
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for balck-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
sum(data[data.race!='b'].call)

235.0

In [7]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1,0,1,0,0,0,0,0,0,
1,b,1,3,3,6,0,1,1,0,316,...,1,0,1,0,0,0,0,0,0,
2,b,1,4,1,6,0,0,0,0,19,...,1,0,1,0,0,0,0,0,0,
3,b,1,3,4,6,0,1,0,1,313,...,1,0,1,0,0,0,0,0,0,
4,b,1,3,3,22,0,0,0,0,313,...,1,1,0,0,0,0,0,1,0,Nonprofit


In [9]:
data.columns

Index([u'id', u'ad', u'education', u'ofjobs', u'yearsexp', u'honors',
       u'volunteer', u'military', u'empholes', u'occupspecific', u'occupbroad',
       u'workinschool', u'email', u'computerskills', u'specialskills',
       u'firstname', u'sex', u'race', u'h', u'l', u'call', u'city', u'kind',
       u'adid', u'fracblack', u'fracwhite', u'lmedhhinc', u'fracdropout',
       u'fraccolp', u'linc', u'col', u'expminreq', u'schoolreq', u'eoe',
       u'parent_sales', u'parent_emp', u'branch_sales', u'branch_emp', u'fed',
       u'fracblack_empzip', u'fracwhite_empzip', u'lmedhhinc_empzip',
       u'fracdropout_empzip', u'fraccolp_empzip', u'linc_empzip', u'manager',
       u'supervisor', u'secretary', u'offsupport', u'salesrep', u'retailsales',
       u'req', u'expreq', u'comreq', u'educreq', u'compreq', u'orgreq',
       u'manuf', u'transcom', u'bankreal', u'trade', u'busservice',
       u'othservice', u'missind', u'ownership'],
      dtype='object')

In [10]:
data.groupby(['education','race']).call.mean().unstack('race')

race,b,w
education,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.107143,0.0
1,0.090909,0.055556
2,0.068182,0.112676
3,0.058824,0.107212
4,0.064773,0.093463


In [20]:
data.groupby(['race']).call.sum()

race
b    157
w    235
Name: call, dtype: float32

In [18]:
data.groupby(['race']).call.describe()

race       
b     count    2435.000000
      mean        0.064476
      std         0.245649
      min         0.000000
      25%         0.000000
      50%         0.000000
      75%         0.000000
      max         1.000000
w     count    2435.000000
      mean        0.096509
      std         0.295346
      min         0.000000
      25%         0.000000
      50%         0.000000
      75%         0.000000
      max         1.000000
dtype: float64

In [23]:
# Sample proportions for white and block sounding names
p_w = data[data.race=='w'].call.mean()
p_b = data[data.race=='b'].call.mean()
sig2_w = data[data.race=='w'].call.var()/data[data.race=='w'].call.count()
sig2_b = data[data.race=='b'].call.var()/data[data.race=='b'].call.count()
print p_w, p_b, sig2_w, sig2_b

0.0965092405677 0.0644763857126 3.58229851086e-05 2.47817891946e-05


In [24]:
#Sampling distribution of the difference
mu_diff = p_w-p_b
sig_diff = (sig2_w+sig2_b)**0.5
print mu_diff, sig_diff

0.0320328548551 0.00778490682688


In [27]:
stats.norm.interval(0.95, loc=mu_diff, scale=sig_diff)

(0.016774717851368581, 0.047290991858752573)

The 95% confidence interval gives us the white over black candidate call back proportion

 Create a null hypoethesis to test if white sounding name candidates are more likely to receive a call back 

Null Hypothesis - H0: There is no difference in call back proportion between white/black sounding name candidates

Alternate Hypothesis - H1 : There is a difference in call back proportion between white/black sounding name candidates

Find the probablity of the sampling dist of the difference, assuming the null hypothesis

In [31]:

# If we assume the null hypothesis the mean and variance of dist of difference:
mu_diff_h0 = 0
p_hat = data.call.mean()
print p_hat , data.call.count()
sig_diff_h0 = (2*p_hat*(1-p_hat)/data[data.race=='w'].call.count())**0.5
# The z-score for 
z_score = (mu_diff-mu_diff_h0)/sig_diff_h0
print z_score

0.0804928168654 4870
4.10841214853


The z-score is 4.1. What is the min z-score to reject the hypothesis? For a significance level (p-value) of 5%, what is the z-score?

In [32]:
stats.norm.ppf(1-0.025)

1.959963984540054

The critical z-value is 1.96. The probability of getting a z-score of 4.1 is much less likely, hence we can reject the null hypothesis and **conclude that there is racial discrimination** and that white sounding names are more likely to receive a call back