# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [13]:
corr = data.corr()

In [32]:
data.shape

(4870, 65)

## SOLUTIONS

#### 1. What test is appropriate for this problem? Does CLT apply?

Because I am trying to determine if the sample of resumes with black-sounding names had a different rate of calls than the sample of resumes with white-sounding names, I will use a 2 sample t-test. Because the problem is asking if race had an impact, and not whether black-sounding had a lower rate than white-sounding, I will use a 2-sided test.

CLT may be applied because the sample size is sufficiently large ($\geq30$) and iid.

#### 2. What are the null and alternate hypotheses?

Under a 2-sided 2-sample t-test, the null hypothesis states that there is no difference in call rates for black-sounding names and white-sounding names. The alternative hypothesis states there is a significant difference in call rates for the two samples.

More properly stated:

($H_{0}$): $\mu_{white} = \mu_{black}$

($H_{A}$): $\mu_{white} \neq \mu_{black}$

#### 3. Compute margin of error, confidence interval, and p-value.

I will calculate the above values at a 99% signficance level. Population variances are unknown but assumed to be equal to sample variances.

In [46]:
white_names = data[data.race == 'w'].call
black_names = data[data.race == 'b'].call
mu_white = np.mean(white_names)
mu_black = np.mean(black_names)
var_white = np.var(white_names)
var_black = np.var(black_names)

if len(white_names) == len(black_names):
    n = len(white_names)
    t = stats.t.ppf(q = .995, df = (2*n - 2))
    var_pooled = (var_black +var_white)/2
    standard_error = np.sqrt(2*var_pooled/n)
else:
    n_w = len(white_names)
    n_b = len(black_names)
    t = stats.t.ppf(q = .975, df = (n_w + n_b - 2))
    var_pooled = (((n_w - 1)*var_white + (n_b - 1)*var_black)/(n_w + n_b - 2))
    standard_error = np.sqrt(var_pooled/n_w + var_pooled/n_b)
    print('WARNING - Unequal samples sizes by', n_w - n_b, 'equal population variances may be unsafe to assume.\n')
    
margin_of_error = t*standard_error
ci = mu_white - mu_black + np.array([-1,1])*margin_of_error
p = stats.ttest_ind(white_names, black_names).pvalue

print('The margin of error is {}, producing a confidence interval of {}, with a p-value of {}'.format(margin_of_error, ci, p))

The margin of error is 0.020056337083217784, producing a confidence interval of [ 0.01197652  0.05208919], with a p-value of 3.940802103128886e-05


#### 4. Write a story describing the statistical significance in the context or the original problem.

After evaluating the number of callback requests received by resumes with black and white sounding names, all other qualifications being equal, it is clear at the 99% significance level that there is a difference in callback success between the two groups. This means that at least 99% of all samplings from the population would produce sample means that were statistically different. Because the confidence interval did not contain 0 (and was positive), this means that at least 99% of samples would result in a high rate of white-sounding names receiving callbacks than black-sounding.

#### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

This does not inherently means that race is the most important factor, only a statistically significant factor. I'll look at the correllation coefficients between other variables in the data and call, making sure to 'dummify' the categorical race variable first.

In [51]:
data_cat = pd.get_dummies(data, columns = ['race'])

call_corr = data_cat.corr().call.sort_values()

print('The most detrimental factors towards getting a callback are: \n', call_corr.head(5))
print('\n The most helpful factors in getting a callback are: \n', call_corr.tail(10)[::-1])

The most detrimental factors towards getting a callback are: 
 race_b             -0.058872
fracdropout        -0.056671
lmedhhinc_empzip   -0.049879
req                -0.041699
educreq            -0.033864
Name: call, dtype: float64

 The most helpful factors in getting a callback are: 
 call             1.000000
specialskills    0.111074
honors           0.071951
empholes         0.071888
adid             0.063178
yearsexp         0.061436
race_w           0.058872
linc             0.049649
offsupport       0.047783
lmedhhinc        0.047699
Name: call, dtype: float64


It appears that though having a black sounding name is the most detrimental factor towards a callback. A white-sounding name, while beneficial, is not as important as having special skills, honors, or experience.

Knowing this information, I would amend the analysis to also see different quantiles of those most correllated variables compare to one another. For example, how likely are black-sounding names with special skills to get a callback compared to white-sounding names without special skills.