# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet




In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [7]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [8]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


1) CLT would apply to the population of white or black sounding names and percentage call back. Since other factors are random/reused only statistical differences are either due to the name or chance.

2) The null hypothesis is that name does not matter and the call back rate will be roughly equivalent. The alternative hypthesis is that there would be a large difference in the call back rate based on the name. 

In [21]:
w_count = data[data.race=='w']['call'].count()
w_count

2435

In [22]:
w_calls = data[data.race=='w']['call'].sum()
w_calls

235.0

In [26]:
w_rate = w_calls/w_count
w_rate *100.0 #white name call back rate

9.6509240246406574

In [27]:
b_count = data[data.race=='b']['call'].count()
b_count

2435

In [28]:
b_calls = data[data.race=='b']['call'].sum()
b_calls

157.0

In [30]:
b_rate = b_calls/b_count
b_rate*100.0 #black name call back rate

6.4476386036960989

In [36]:
#percent difference
(w_rate-b_rate)/w_rate *100.0 

33.191489361702125

In [48]:
import math
w_err=math.sqrt((w_rate*(1-w_rate))/w_count)
w_err *100.0 #std err < 1%

0.5984072178128066

In [49]:
b_err=math.sqrt((b_rate*(1-b_rate))/b_count)
b_err *100.0 #std err < 1%

0.4977121442811946

In [50]:
import numpy as np
import scipy as sp
import scipy.stats

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0*np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
    return m, m-h, m+h

mean_confidence_interval(data[data.race=='w']['call'])

(0.096509241, 0.084772429031432048, 0.1082460521039363)

In [51]:
mean_confidence_interval(data[data.race=='b']['call'])

(0.064476386, 0.054714549604450616, 0.074238221820796577)

The confidence interval is +/- 1.2 on white name call back and +/- 1.0 on black name call back

In [53]:
# p value for independent t test
from scipy.stats import ttest_ind

cat1 = data[data['race']=='w']
cat2 = data[data['race']=='b']

ttest_ind(cat1['call'], cat2['call'])

Ttest_indResult(statistic=4.1147052908617514, pvalue=3.9408021031288859e-05)

4) Race shows a significant difference in call back rate with other factors being random or equal. For this sample given the same qualifications a white name would see a call back in 10 in 100 tries where a black name would only receive a call back in 6 in 100 tries. 

5) While race/name shows a significant relative impact on call back rate, it is a limited change to the overall callback rate (3/100). Bcause of this we can be led to further examine the data to see what factors can lead to a grater call back for either group. Presumable adding a name/race indepentdent resume score would provide a better analysis along these lines as well.