# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises

You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions in this notebook below and submit to your Github account.

1) What test is appropriate for this problem? Does CLT apply?

2) What are the null and alternate hypotheses?

3) Compute margin of error, confidence interval, and p-value.

4) Write a story describing the statistical significance in the context or the original problem.

5) Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 

- In the control panel at the top, choose Cell > Cell Type > Markdown
- Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

### Resources

- Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
- Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
- Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet 

In [60]:
import pandas as pd
import numpy as np
from scipy import stats

In [61]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [62]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [73]:
# Desriptive statistic for the black and white groups created based on the rate of callbacks for resumes 
black = data[data.race=='b']
white = data[data.race=='w']

n_b = len(black)
mean_b = black.call.mean()
std_b = black.call.std()
sem_b = black.call.sem()
skewness_b = black.call.skew()
print('Black group')
print('n: ', n_b)
print('Mean: ', mean_b)
print('SD: ', std_b)
print('SE: ', sem_b)
print('Skewness: ', skewness_b)
print('--------------------------')

n_w = len(white)
mean_w = white.call.mean()
std_w = white.call.std()
sem_w = white.call.sem()
skewness_w = white.call.skew()
print('White group')
print('n: ', n_w)
print('Mean: ', mean_w)
print('SD: ', std_w)
print('SE: ', sem_w)
print('Skewness: ', skewness_w)

Black group
n:  2435
Mean:  0.0644763857126236
SD:  0.24564945697784424
SE:  0.00497813117346
Skewness:  3.5488
--------------------------
White group
n:  2435
Mean:  0.09650924056768417
SD:  0.2953455150127411
SE:  0.00598523067594
Skewness:  2.73454


### Statistic test
Both group sample distributions are only slightly skewed and are not normal. Therefore, the Mann-Whitney U-test, a nonparametric equivalent of the two sample t-test, will be used. While the t-test makes an assumption about the distribution of a population (i.e. that the sample came from a t-distributed population), the Mann Whitney U-test makes no such assumption.

### Null hypothesis

The null hypothesis for the test is that the probability is 50% that resumes randomly chosen from first and second groups will have 0 and 0 or 1 and 1 call values, respectively. The alternate hypothesis is that resumes randomly chosen from first and second groups will have 1 and 0 or 0 and 1 call values, respectively.

### Assumptions for the Mann Whitney U-test:

- the sample groups are independent
- the two group sample distributions are not normal, but are skewed in the same direction

In [67]:
# Creating lists
call_b = sorted(black.call)
call_w = sorted(white.call)

In [80]:
# Calculing confidence intervals with a confidence level of 95%
low, high = stats.norm.interval(0.95, loc=mean_b, scale=sem_b)
margerr = high - mean_b
print('Confidence intervals for the black group: ', low, '-', high)
print('Margin of error for the black group: ', margerr)

low, high = stats.norm.interval(0.95, loc=mean_w, scale=sem_w)
margerr = high - mean_w
print('Confidence intervals for the white group: ', low, '-', high)
print('Margin of error for the white group: ', margerr)

Confidence intervals for the black group:  0.0547194279023 - 0.0742333435229
Margin of error for the black group:  0.0097569578103
Confidence intervals for the white group:  0.0847784040037 - 0.108240077132
Margin of error for the white group:  0.011730836564


In [75]:
# Null hypothesis test - Mann-Whitney U-test (nonparametric test)
u, p = stats.mannwhitneyu(call_b, call_w)
print('z-score: ', u)
print('p-value: ', p)

z-score:  2869647.5
p-value:  1.99577095968e-05


### Conclusion:

The null hypothesis is rejected, since p-value is significantly smaller than 0.05. Therefore, in the context of the original problem, the above analysis means that race suggested by name plays an important role in callback success. White people have an advantage over black people in the hiring process in U.S. 