
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [4]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

In [5]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [6]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [7]:
data.columns

Index([u'id', u'ad', u'education', u'ofjobs', u'yearsexp', u'honors',
       u'volunteer', u'military', u'empholes', u'occupspecific', u'occupbroad',
       u'workinschool', u'email', u'computerskills', u'specialskills',
       u'firstname', u'sex', u'race', u'h', u'l', u'call', u'city', u'kind',
       u'adid', u'fracblack', u'fracwhite', u'lmedhhinc', u'fracdropout',
       u'fraccolp', u'linc', u'col', u'expminreq', u'schoolreq', u'eoe',
       u'parent_sales', u'parent_emp', u'branch_sales', u'branch_emp', u'fed',
       u'fracblack_empzip', u'fracwhite_empzip', u'lmedhhinc_empzip',
       u'fracdropout_empzip', u'fraccolp_empzip', u'linc_empzip', u'manager',
       u'supervisor', u'secretary', u'offsupport', u'salesrep', u'retailsales',
       u'req', u'expreq', u'comreq', u'educreq', u'compreq', u'orgreq',
       u'manuf', u'transcom', u'bankreal', u'trade', u'busservice',
       u'othservice', u'missind', u'ownership'],
      dtype='object')

In [8]:
df=data.groupby(['call','race']).size()

In [9]:
df1=df.unstack('race')

# Question 1. Does the CLT apply? What test is appropriate?
#### Answer: With only values of '1' or '0', this is not a normally distributed variable.  It is, in fact a Bernoulli variable, and these are count data.  Therefore, a 2x2 contingency test could be appropriate, but the column totals are fixed. This is not simple count data from a population.  So instead, one can develop statistics describing the uncertainty around the proportions of invited applicants, and compare them.  If one were to repeately sample these proportions, one could anticipate that the average value of the proportion would converge on the parametric value of the proportion, due to the CLT.  Thus, we can develop standard statistical testing  based on the sampling distributions of the proportions, and their difference.

# Question 2. What are the null and alternate hypotheses?
#### Answer: H<sub>0</sub> is that the two proportions are the same and, thus, that their difference equals 0.  
#### H<sub>0</sub>: P1 - P2 == 0 ; H<sub>1</sub>: P1 - P2 !=0.   
Let's refer to this difference as the test statistic.

In [10]:
df1       # Here are the counts of the cases in their categories

race,b,w
call,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,2278,2200
1.0,157,235


In [11]:
sums=df1.apply(np.sum,axis=0)
tot=np.sum(df1.apply(np.sum,axis=0))
freq=df1.apply(np.sum,axis=1)/tot
sums

race
b    2435
w    2435
dtype: int64

In [12]:
props=df1/sums    #The proportions of blacks (b) and whites (w) called back (call==1) are shown in this table
props

race,b,w
call,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0.935524,0.903491
1.0,0.064476,0.096509


## 3. Finding the standard error of the test statistic, and generating the 97.5 %and 2.5% confidence intervals

#### Here are the variances of each of the sampling distributions of the two proportions in the bottom row of the table just above.

In [30]:
vars=(props.iloc[0,:]*props.iloc[1,:])/sums  # Here are the variances of sampling distributions of each of the two
vars                                         # proportions.

race
b    0.000025
w    0.000036
dtype: float64

#### The varances of the sampling distribution of the two proportions can be used to generate standard error of the test statistic, 'sd' below

In [31]:
sd=np.sqrt(vars.sum())  
sd

0.0077833705866767553

##### Here is the distance from the observed mean of P1-P2  that will span 95% of the test statistic's sampling distribution. One multiplies the Z value for the 2.5 % tail of the normal distribution by the standard error of the sampling distribution of the test statistic.

In [32]:
d= 1.96*sd  
d

0.01525540634988644

####  Of course, we need the observed value of the test statistic

In [27]:
test_stat=props.iloc[1,1] - props.iloc[1,0]  # The calculation of the observed value of the test statistic
test_stat

0.032032854209445585

#### Adding and subtracting the value 'd' from the test_stat provides the upper and lower limits of the confidence interval

In [28]:
lower= test_stat-d
upper= test_stat+d
print "The lower limit is ", lower
print "The upper limit is ", upper

The lower limit is  0.0167774478596
The upper limit is  0.0472882605593


#### Conducting a test of statistical signficance means determining the sampling distribution of the test statistic under the assumption of the null hypothesis.

#### The best estimate of proportion called back under H0 is the mean of the two proportions

In [38]:
df1a=df1.apply(np.sum,axis=1)
print df1a


call
0.0    4478
1.0     392
dtype: int64


In [45]:
p_null=df1a.loc[1.0]/float(df1a.sum())  
p_null

0.080492813141683772

#### Here is the standard error of the sampling distribution of the test statistic under H<sub>0</sub>

In [54]:
se=np.sqrt((2*p_null*(1-p_null))/sums.iloc[0])
se

0.0077968940361704568

#### We want to conduct a 2-tailed test of whether the difference between the test statistic and it's value assumed under H0 (i.e. zero) is rare under H0.  So we want to find the z value corresponding to the observed difference, given the sampling distribution of the test statistic under H0.  This is straightforward now that we have the standard error of the test statistic under H<sub>0</sub>.  We simply divide the difference between the test statistic and its expected value under H<sub>0</sub> (which is zero) by the standard error we just calculated.  In this case, the corresponding value of z for the 5% tail of the normal distribution is 1.96.  

In [50]:
z=test_stat/se     #

4.1084121524343464

### We see that the observed difference between the value of the test statistic and zero (its expected value under H<sub>0</sub>) is much further out on the tail of the distribution than would be expected.  We therefore reject the null hypothesis and conclude that the proportion of applicants with african-american sounding names that gets called is different than the proportion of applicants with non-african american sounding names.

#### To get the p-value, one has to double the value given by the cdf of the normal distribution because we want the probability that lies in both tails of the distribution.  In fact, one minus the value from the CDF function only gives us the proportion in the right hand tail.

In [61]:
P= 2*(1-stats.norm.cdf(4.1,loc=0,scale=1))
print "The value of P is ",P

 The value of P is  4.1315013825e-05


## 4. Discuss statistical significance
#### Just briefly, this result is highly statistically significant.  A P-value of 0.00004 means that it is highly unlikely that such a difference in the proportion of call-backs would arise by chance in samples of over 2300 people assigned either 'b' or 'w'-sounding names.  In fact the propability of such a result is about than 4 in 100,000.  This result supports the contention that racial bias exists in this sample of evaluated job applications.