# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [20]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


# 1

Use test of proportion in this case.
Conditions for CLT:
1. Independence:the sample is sampled randomly since B and W are assigned randomly to each observations.
2. There should be at least 10 successes and 10 failures for each b and w race

In [26]:
#find the proportion of whites who got called back
w_call = sum(data[data.race=='w'].call)
w_total = data[data.race=='w'].call.count()
print(w_call,w_total)
w_prop = w_call/w_total
print(w_prop)

235.0 2435
0.0965092402464


In [25]:
# proportion of blacks who got called back
b_call = sum(data[data.race=='b'].call)
b_total = data[data.race=='b'].call.count()
print(b_call,b_total)
b_prop = b_call/b_total
print(b_prop)

157.0 2435
0.064476386037


In [29]:
#find the pooled probability:
p_pooled = (w_call+b_call)/(b_total+w_total)
p_pooled

0.080492813141683772

In [31]:
#find out if b and w meet the condition for CLT:
print('np white:', w_total*p_pooled)
print('nq white:', w_total*(1-p_pooled))
print('np black:', b_total*p_pooled)
print('nq black:',b_total*(1-p_pooled))

np white: 196.0
nq white: 2239.0
np black: 196.0
nq black: 2239.0


The results from above indicates that they are all above 10, which indicates that the sample fits the central limit theorem.

# 2

Ho: prop.whites that got called back = prop.blacks that got called back

Ha: prop.whites that got called back != prop.blacks that got called back

# 3

In [35]:
#compute margin of error, CI, and p-value
#compute Standard Error for the difference between proportions:
SE = np.sqrt((w_prop*(1-w_prop)/w_total) + (b_prop*(1-b_prop)/b_total))
ME = 1.96*SE
diff_of_prop = w_prop - b_prop
CI = (diff_of_prop - ME, diff_of_prop + ME)
Z_stats = diff_of_prop/SE
p_value = stats.norm.sf(abs(Z_stats))*2

print('margin of error:', ME)
print('Confidence interval:', CI)
print('p-value:', p_value)

margin of error: 0.0152554063499
Confidence interval: (0.016777447859559147, 0.047288260559332024)
p-value: 3.86256520752e-05


# 4

The statistics indicate that the proportion of white and black that got called is not the same since 0 is not within the confidence interval range. In addition, the p-value also indicates that the proportion is not equal since the p-value is very small.

# 5

This statistics only mean that race might only be one of the factor that affects resume callback. To figure out which one is the most important factor that affects callback success, a more in-depth analysis is required such as looking at the correlation between the dependent and independent variables. 