# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

In [27]:
import pandas as pd
import numpy as np
from scipy import stats
import math

In [28]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [59]:
# number of callbacks for black-sounding names
sum_bs_call=sum(data[data.race=='b'].call)
print("Total number of black-sounding names is: "+ str(sum_bs_call))
# number of callbacks for black-sounding names
sum_ws_call=sum(data[data.race=='w'].call)
print("Total number of white-sounding names is: "+ str(sum_ws_call))
total_app=len(data.race)
print("Total number of applicants: "+ str(total_app))

total_bs_names=len(data[data.race=='b'])
print("Total black-sounding names: "+ str(total_bs_names))

total_ws_names=len(data[data.race=='w'])
print("Total white-sounding names: "+ str(total_ws_names))

p_bs=sum_bs_call/total_bs_names
print("Proportion of black-sounding names that got a call-back: " + str(p_bs))

p_ws=sum_ws_call/total_ws_names
print("Proportion of black-sounding names that got a call-back: " + str(p_ws))

sum_total_call=sum_bs_call+sum_ws_call
print("Total number of call-backs for both black and white sounding names: " + str(sum_total_call))

p_total=sum_total_call/total_app
print("Total proportion of call-backs for both black and white sounding names: " + str(p_total))



Total number of black-sounding names is: 157.0
Total number of white-sounding names is: 235.0
Total number of applicants: 4870
Total black-sounding names: 2435
Total white-sounding names: 2435
Proportion of black-sounding names that got a call-back: 0.06447638603696099
Proportion of black-sounding names that got a call-back: 0.09650924024640657
Total number of call-backs for both black and white sounding names: 392.0
Total proportion of call-backs for both black and white sounding names: 0.08049281314168377


In [30]:
data.head(10)

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
5,b,1,4,2,6,1,0,0,0,266,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
6,b,1,4,2,5,0,1,0,0,13,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
7,b,1,3,4,21,0,1,0,1,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
8,b,1,4,3,3,0,0,0,0,316,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
9,b,1,4,2,6,0,1,0,0,263,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private


In [31]:
data.describe()

Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,...,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,...,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,...,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,...,0.308866,0.496083,0.259649,0.275854,0.171677,0.278932,0.410141,0.442847,0.361773,0.371308
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [32]:
w = data[data.race=='w']
b = data[data.race=='b']

1. What test is appropriate for this problem? Does CLT apply?
Ans: Since we are comparing proportions of populations and we have large enough sample size we can use two-sided z-test or we could use difference in proportions confidence interval test. 

Requirements for CLT:

1) The samples must be independent: Black and White sounding names were randomly assigned to similar 
    resumes so they represent a random sample and are independent. 
    This requirement is saitsfied.

2) The sample size must be 'big enough':

The sample size must not be bigger than 10% of the entire population.
With around 2435 data points each representing the black and white race in America, 
we can assume this would be less than 10% of the millions of each race in America. 
This requirement is definitely satisfied.
Therefore, CLT can be applied to our dataset.


2. What are the null and alternate hypotheses?
Ans: 
Null Hypothesis: The probability of success of getting a callback is the same for both resumes with white-sounding names                           and black-sounding names. 

                        H(null)= p_w - p_b = 0
                        
Alternative Hypothesis: The probability of success of getting a callback  is not the same for both resumes with white-                                   sounding names and black-sounding names. 
                        
                        H(alternative)= p_w - p_b ≠ 0 
                        
                        

In [67]:
# Your solution to Q3 here
# Frequentist Approach 
# 2-proportion Z-test
import math 
import scipy
upper=p_ws-p_bs
lower=math.sqrt(p_total*(1-p_total)*((1/total_bs_names)+(1/total_ws_names)))
z_value=upper/lower
print("z-value is : "+str(z_value))
p_value = scipy.stats.norm.sf(abs(z_value))*2 #twosided
print("p-value is : "+str(p_value))


z-value is : 4.108412152434346
p-value is : 3.983886837585077e-05


In [72]:
# 95% confidence interval
prop_diff = p_ws - p_bs
print('Observed difference in proportions:  '+str(prop_diff))

z_crit = 1.96
ci_high = prop_diff + z_crit*(math.sqrt(p_total*(1-p_total)*((1/total_bs_names)+(1/total_ws_names))))
ci_low = prop_diff - z_crit*(math.sqrt(p_total*(1-p_total)*((1/total_bs_names)+(1/total_ws_names))))
print('95% conf int: \t {} - {}'.format(ci_low, ci_high))
moe = (ci_high - ci_low)/2
print('Margin of err: \t +/-{}'.format(moe))

Observed difference in proportions:  0.032032854209445585
95% conf int: 	 0.01675094189855149 - 0.04731476652033968
Margin of err: 	 +/-0.015281912310894097


In [88]:
# Bootstrap Approach 

all_callbacks = np.array([True] * int(sum_total_call) + [False] * int(total_app-sum_total_call))

size=10000
bs_reps_diff = np.empty(size)

for i in range(size):
    w_bs_replicates = np.sum(np.random.choice(all_callbacks, size=total_ws_names))
    b_bs_replicates = np.sum(np.random.choice(all_callbacks, size=total_bs_names))
    
    bs_reps_diff[i] = (w_bs_replicates - b_bs_replicates)/total_bs_names

bs_p_value = np.sum(bs_reps_diff >= prop_diff) / len(bs_reps_diff)
print(" P-value for black sounding names: " + str(bs_p_value))
bs_ci = np.percentile(bs_reps_diff, [2.5, 97.5])
print(" 95% confidence interval: " + str(bs_ci))

 P-value for black sounding names: 0.0
 95% confidence interval: [-0.01519507  0.01560575]


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

Question 4 : Write a story describing the statistical significance in the context or the original problem.
Answer: It has been proven conclusively that the proportion of callbacks received for resumes with white-sounding names is significantly and consistently higher than the proportion of callbacks for resumes with black-sounding names. The evidence for the samples provided show that resumes with white-sounding names are approximately 50% more likely to receive a callback.

Question 5: Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
Answer: Name is not the most important factor in the overall hiring process but initially from the analysis performed above it definitely seems like black-sounding names have lower rate of getting a call-back. For further analysis, I would acutally perform the same study performed above in individual states and see which states are likely to cause such a major difference in the callbacks for black-sounding names as a whole in the US. 