# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats



In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
#Inspecting the shape of the data
data.shape

(4870, 65)

In [5]:
#Inspecting the first 20 columns
data.iloc[:,0:20].head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,email,computerskills,specialskills,firstname,sex,race,h,l
0,b,1,4,2,6,0,0,0,1,17,1,0,0,1,0,Allison,f,w,0.0,1.0
1,b,1,3,3,6,0,1,1,0,316,6,1,1,1,0,Kristen,f,w,1.0,0.0
2,b,1,4,1,6,0,0,0,0,19,1,1,0,1,0,Lakisha,f,b,0.0,1.0
3,b,1,3,4,6,0,1,0,1,313,5,0,1,1,1,Latonya,f,b,1.0,0.0
4,b,1,3,3,22,0,0,0,0,313,5,1,1,1,0,Carrie,f,w,1.0,0.0


In [6]:
#Inspecting the next 20 columns
data.iloc[:,20:40].head()

Unnamed: 0,call,city,kind,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,expminreq,schoolreq,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip
0,0.0,c,a,384.0,0.98936,0.0055,9.527484,0.274151,0.037662,8.706325,1.0,5,,1.0,,,,,,
1,0.0,c,a,384.0,0.080736,0.888374,10.408828,0.233687,0.087285,9.532859,0.0,5,,1.0,,,,,,
2,0.0,c,a,384.0,0.104301,0.83737,10.466754,0.101335,0.591695,10.540329,1.0,5,,1.0,,,,,,
3,0.0,c,a,384.0,0.336165,0.63737,10.431908,0.108848,0.406576,10.412141,0.0,5,,1.0,,,,,,
4,0.0,c,a,385.0,0.397595,0.180196,9.876219,0.312873,0.030847,8.728264,0.0,some,,1.0,9.4,143.0,9.4,143.0,0.0,0.204764


#Inspecting the next 20 columns
data.iloc[:,40:60].head()

In [7]:
#Inspecting the final columsn
data.iloc[:,60:].head()

Unnamed: 0,trade,busservice,othservice,missind,ownership
0,0.0,0.0,0.0,0.0,
1,0.0,0.0,0.0,0.0,
2,0.0,0.0,0.0,0.0,
3,0.0,0.0,0.0,0.0,
4,0.0,0.0,1.0,0.0,Nonprofit


In [8]:
print(data.call.unique())
print(data.race.unique())
data.iloc[:,20:40].describe()

[ 0.  1.]
['w' 'b']


Unnamed: 0,call,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip
count,4870.0,4870.0,4784.0,4784.0,4784.0,4784.0,4784.0,4784.0,4870.0,4870.0,1672.0,1722.0,608.0,658.0,3102.0,1918.0
mean,0.080493,651.777832,0.310831,0.542772,10.147275,0.185674,0.213816,9.550801,0.719507,0.29117,587.686035,2287.051025,196.050522,755.416992,0.114765,0.079096
std,0.272079,388.690582,0.332473,0.329467,0.34578,0.081747,0.169305,0.557097,0.449287,0.454347,2907.629395,8902.84375,896.510864,1665.165039,0.318791,0.149742
min,0.0,1.0,0.0,0.004814,8.841738,0.0,0.030847,8.507345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,306.25,0.045275,0.252164,9.965053,0.139711,0.092559,9.220489,0.0,0.0,12.975,98.0,13.0,97.0,0.0,0.007125
50%,0.0,647.0,0.15995,0.571833,10.144078,0.190751,0.145053,9.438432,1.0,0.0,33.35,220.0,34.9,200.0,0.0,0.017404
75%,0.0,979.75,0.516854,0.873805,10.342871,0.238196,0.284315,9.668208,1.0,1.0,133.099998,700.0,86.699997,500.0,0.0,0.089956
max,1.0,1344.0,0.992043,0.981653,11.11929,0.356164,0.780124,11.07866,1.0,1.0,47947.601562,124500.0,10500.0,12208.0,1.0,0.98936


### 1. What test is appropriate for this problem? Does CLT apply?

This resume dataset contains 4870 samples of binary variables (white/black and call/no call) and has an unknown standard deviation.  The sample size is large enough (n>30) that this binomial distribution will approximate a normal distribution and makes the Central Limit Theorem applicable.  This problem is an excellent candidate for a two-tail $T$-test.

### 2. What are the null and alternative hypothesis?

$H_0$: Race has no impact on call back rates.  
$H_a$: Race has an impact on call back rates.

Significance level ($\alpha$) and Confidence Interval (CI):  
$\alpha$ = 0.05  
CI = 0.95

### 3. Compute margin of error, confidence interval, and p-value.

In [9]:
#Creating two data sets, 1 for black and 1 for whites.
b = data[data['race'] == 'b']
b_qty = len(b)
print('Total Number of resumes for blacks: ' + str(b.shape[0]))

w = data[data['race'] == 'w']
w_qty = len(w)
print('Total Number of resumes for whites: ' + str(w.shape[0]))

#Determining the probabilities of call back for blacks and whites.
w_prob = np.sum(w.call) / w_qty
b_prob = np.sum(b.call) / b_qty
print()
print('Probability of call for whites is %2.2f percent' %(w_prob * 100))
print('Probability of call for blacks is %2.2f percent' %(b_prob * 100))
print('Whites are %2.3f times more likely to get a callback than blacks.' %(w_prob/b_prob))



#Determining the percentage difference
prob_diff = w_prob - b_prob
print()
print('The percentage difference in callbacks is %2.3f percent' %(prob_diff * 100))


#Calculating t-statistic
t_stat, p = stats.ttest_ind(w.call, b.call)  #Unpacking the results.
#print('p-value: ' + str(p))

#Calculation of the standard error
w_var = np.var(w.call) #Determining the variance for whites
b_var = np.var(b.call) #Determining the variance for blacks
s_error = np.sqrt(w_var/w_qty + b_var/b_qty)

#Calculating the Margin of Error
m_error = s_error * 1.96 #With 95% confidence level, the z-score is 1.96
print('Margin of error: %2.4f' %(m_error * 100))

#Calculating the Confidence Interval
c_int = prob_diff + (np.array([-1,1]) * m_error)
print('Confidence Interval: ' + str(c_int))

p_value = stats.norm.cdf(-t_stat)* 2
print()
print('T-statistic: %2.4f' %t_stat)
print('p_value: ' + str(p_value))

Total Number of resumes for blacks: 2435
Total Number of resumes for whites: 2435

Probability of call for whites is 9.65 percent
Probability of call for blacks is 6.45 percent
Whites are 1.497 times more likely to get a callback than blacks.

The percentage difference in callbacks is 3.203 percent
Margin of error: 1.5255
Confidence Interval: [ 0.01677757  0.04728814]

T-statistic: 4.1147
p_value: 3.87674401894e-05


### 4. Write a story describing the statistical significance in the context or the original problem.

The United States labor market has is continually plagued with accusations of racial discrimation in its hiring processes.  To address this issue directly, researchers randomly assigned identical résumés to black-sounding and white-sounding names,  counted the number of callbacks received and conducted a two-tail $T$-test to determine whether or not the null hypothesis that race bears no impact on employer call back rates is true. 

The findings indicate that with an equal number of black and white resumes, the resumes with white-sounding names had a higher probability of call backs from employers; 9.65% for whites and 6.45% for blacks, a percentage difference of 3.2%.  Probability-wise, white-sounding names were 1.5 mores likely to receive a call back from employers than were the black-sounding names.  The margin of error on this difference is 1.5255 with a 95% confidence interval being between 1.68 and 4.73.

The calculated $T$-statistic of 4.1147 is high and the calculated $p$-value of 3.877e-05 is below our alpha level of 0.05 which allows us to reject the null hypothesis and conclude that race $does$ influence the call-back rates from employers.


### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

The analysis conducted above is limited in its scope of determining how much race influences a callback as it only focuses on two variables, race and call-backs. Thus we are blind to other influential variables such as education, qualifications/certifications and experience among several others.  If these additional variables are accounted for in a more extensive analysis a better dertermination can be made as to how much an employment call back is influenced by a candidate's race.      