# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for white-sounding names
white_cb = sum(data[data.race=='w'].call)
# number of callbacks for black-sounding names
black_cb = sum(data[data.race=='b'].call)

#print(white_cb)
#print(black_cb)

white_len = (len(data[data.race=='w']))
black_len = (len(data[data.race=='b']))

p_w = white_cb/white_len
p_b = black_cb/black_len

#n*p > 10 & n*(1-p) > 10 to be close to normal


print(p_w*white_len)
print((1-p_w)*white_len)
print(p_b*black_len)
print((1-p_b)*black_len)

#appears to be close to normal

#print(white_len)
#print(black_len)

print('Probability of success for "W" name: ' + str(round(white_cb/white_len,3)))
print('Probability of success for "B" name: ' + str(round(black_cb/black_len,3)))



235.0
2200.0
157.0
2278.0
Probability of success for "W" name: 0.097
Probability of success for "B" name: 0.064


In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
data.describe()


Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,...,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,...,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,...,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,...,0.308866,0.496083,0.259649,0.275854,0.171677,0.278932,0.410141,0.442847,0.361773,0.371308
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
data.shape

(4870, 65)

In [7]:
df=data[['race','call']]

In [8]:
print(df.tail(25))

     race  call
4845    w   0.0
4846    w   1.0
4847    w   1.0
4848    b   1.0
4849    b   0.0
4850    b   0.0
4851    w   0.0
4852    w   0.0
4853    b   0.0
4854    w   0.0
4855    w   0.0
4856    b   0.0
4857    b   0.0
4858    b   0.0
4859    b   1.0
4860    w   0.0
4861    w   1.0
4862    w   0.0
4863    w   0.0
4864    b   0.0
4865    b   0.0
4866    b   0.0
4867    w   0.0
4868    b   0.0
4869    w   0.0


Kilos: 23
Pounds: 50.71


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

# QUESTION 1 and 2

In [9]:
w = data[data.race=='w']
b = data[data.race=='b']


In [None]:
#rate of callbacks(not means) for black vs white so we would use a two proportion z-test
#H0 is that there is no difference in the rate of callbacks between race
#Ha is that there is a difference in the rate of callbacks between race
#As shown above --> it is normal
#appears to be random and independent

# Question 3

Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

In [10]:
#Standard error = sqrt(((p hat)(1- p hat))/n)
p_hat = (white_cb+black_cb)/(white_len+black_len)
stnd_error = np.sqrt((p_hat*(1-p_hat))/(white_len+black_len))
print(stnd_error)

0.0038984470180852284


In [11]:
print(round(p_b,3))
print(round(p_w,3))

0.064
0.097


In [None]:
#H0 is that there is no difference in the rate of callbacks between race
#Ha is that there is a difference in the rate of callbacks between race
#I'm setting Alpha 0.05 (5% signficance level)

In [12]:
#need the std of the difference of the sample distributions
comb_sample_prop = (white_cb+black_cb)/(white_len+black_len) #combined sample proportion
#print(comb_sample_prop)

var_w = (comb_sample_prop*(1-comb_sample_prop))/white_len #use the sample prop to estimate
var_b = (comb_sample_prop*(1-comb_sample_prop))/black_len
std_wminusb= np.sqrt(var_w+var_b)
print(std_wminusb)
z_val = (p_w - p_b)/std_wminusb
print('z-value: ' + str(round(z_val,3)))
p_value = stats.norm.sf(z_val)*2
print('p-value: ' + str(round(p_value,3)))

MOE = 1.96 * std_wminusb 
print ('Margin of error: ' + str(round(MOE,3)))
print('95% confidence interval: ' + str(round(p_w-p_b,3)) + ' +- ' + str(round(MOE,3)))

0.007796894036170457
z-value: 4.108
p-value: 0.0
Margin of error: 0.015
95% confidence interval: 0.032 +- 0.015


In [13]:
def bootstrap_replicate_1d(data, func):
    return func(np.random.choice(data, size=len(data)))

def draw_bs_reps(data, func, size=1):
    bs_replicates=np.empty(size)
    for i in range(size):
        bs_replicates[i]=bootstrap_replicate_1d(data,func)
    return bs_replicates

In [22]:
size = 1
bs_replicates = np.empty(size)
for i in range(size):
    w_reps = np.sum(np.random.choice(df.call,int(size/2)))
    b_reps = np.sum(np.random.choice(df.call,int(size/2)))
    bs_replicates[i] = (w_reps - b_reps)/(size/2)

#print(bs_replicates)
p_val_bs = np.sum(bs_replicates>=(p_w-p_b))/size
print('Bootstrap P Value: ' + str(p_val_bs))

Bootstrap P Value: 0.0


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

Write a story describing the statistical significance in the context or the original problem.
Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

4.) As noted in the original forward, racial discrimination is prevalent thorughout the world.  When analyzing call backs for identical resumes in the USA, we found that racial discrimination exists between white and black pools with identical resumes.  The probability of seeing a difference in rate of callback between black and white names was essentially 0.  Having a 'black-sounding' name was a deterrent for callbacks from employers.  Are there methods that can minimize the impact of racial discrimination in these sitations not just in callback phase, but in the whole process? (assigning numbers to individuals instead of sharing names, not having a race box in a potential employee survey, etc..) 

5.) Assuming that the study was conducted fairly, the analysis did determine that race/name was the most important factor in callback success.  We know this because all of the other submitted materials were identical aside from the names.  It would be interesting to look at ways to minimize the impact of race/names in the job seeking process.
