# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [141]:
import pandas as pd
import numpy as np
from scipy import stats

from IPython.display import Markdown, FileLink

import matplotlib.pyplot as plt

In [142]:
#Exploration
def display_all(data,max_rows=1000,max_columns=1000):
    """Display a data frame within a pandas option context.
    """
    with pd.option_context("display.max_rows", max_rows, "display.max_columns", max_columns):
        display(data)
        
def view(data, sample_size=5, max_rows=1000, max_columns=1000):
    """Display the shape and sample of observations for a dataset.
    """
    print("DF shape: {shp}".format(shp=data.shape))
    with pd.option_context("display.max_rows",max_rows, "display.max_columns", max_columns):
        display(Markdown('##### DF Sample \n({sple} Observations)'.format(sple=sample_size)))
        display(data.sample(sample_size))
    display(data.dtypes)
    
    
def expHist(x,title):
    mn = np.mean(x)
    sdev = np.std(x)
    _ = plt.hist(x)
    #mean
    _ = plt.axvline(mn,linestyle='-', color='black', alpha=0.8)
    #s1
    _ = plt.axvline(mn-sdev, linestyle='-', color='black', alpha = 0.65)
    _ = plt.axvline(mn+sdev, linestyle='-', color='black', alpha = 0.65)
    #s2
    _ = plt.axvline(mn-sdev*2, linestyle='--', color='black', alpha = 0.50)
    _ = plt.axvline(mn+sdev*2, linestyle='--', color='black', alpha = 0.50)
    #s3
    _ = plt.axvline(mn-sdev*3, linestyle='-.', color='black', alpha = 0.25)
    _ = plt.axvline(mn+sdev*3, linestyle='-.', color='black', alpha = 0.25)

    _ = plt.xlabel(title)
    display(Markdown('#### {ttl}'.format(ttl=title)))
    plt.show()
    

def ks_plot_norm(data):
    length = len(data)
    plt.figure(figsize=(12, 7))
    plt.plot(np.sort(data), np.linspace(0, 1, len(data), endpoint=False))
    plt.plot(np.sort(norm.rvs(loc=np.mean(data), scale=np.std(data), size=len(data))), np.linspace(0, 1, len(data), endpoint=False))
    plt.legend('top right')
    plt.legend(['Data', 'Theoretical Values'])
    plt.title('Comparing CDFs for KS-Test')

In [143]:
data = pd.io.stata.read_stata('us_job_market_discrimination.dta')

In [144]:
# number of callbacks for black-sounding names
wcalls = data.loc[data.loc[:,'race']=='w'].call
bcalls = data.loc[data.loc[:,'race']=='b'].call
print("White-sounding Names w/Calls: ",sum(wcalls))
print("Black-sounding Names w/Calls: ",sum(bcalls))
print('Total Candidates: ', len(data))

White-sounding Names w/Calls:  235.0
Black-sounding Names w/Calls:  157.0
Total Candidates:  4870


In [145]:
display_all(data.head())
view(data)
display_all(data.describe())

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,email,computerskills,specialskills,firstname,sex,race,h,l,call,city,kind,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,expminreq,schoolreq,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip,fracwhite_empzip,lmedhhinc_empzip,fracdropout_empzip,fraccolp_empzip,linc_empzip,manager,supervisor,secretary,offsupport,salesrep,retailsales,req,expreq,comreq,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,1,0,0,1,0,Allison,f,w,0.0,1.0,0.0,c,a,384.0,0.98936,0.0055,9.527484,0.274151,0.037662,8.706325,1.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,6,1,1,1,0,Kristen,f,w,1.0,0.0,0.0,c,a,384.0,0.080736,0.888374,10.408828,0.233687,0.087285,9.532859,0.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,1,1,0,1,0,Lakisha,f,b,0.0,1.0,0.0,c,a,384.0,0.104301,0.83737,10.466754,0.101335,0.591695,10.540329,1.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,5,0,1,1,1,Latonya,f,b,1.0,0.0,0.0,c,a,384.0,0.336165,0.63737,10.431908,0.108848,0.406576,10.412141,0.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,5,1,1,1,0,Carrie,f,w,1.0,0.0,0.0,c,a,385.0,0.397595,0.180196,9.876219,0.312873,0.030847,8.728264,0.0,some,,1.0,9.4,143.0,9.4,143.0,0.0,0.204764,0.727046,10.619399,0.070493,0.369903,10.007352,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


DF shape: (4870, 65)


##### DF Sample 
(5 Observations)

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,email,computerskills,specialskills,firstname,sex,race,h,l,call,city,kind,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,expminreq,schoolreq,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip,fracwhite_empzip,lmedhhinc_empzip,fracdropout_empzip,fraccolp_empzip,linc_empzip,manager,supervisor,secretary,offsupport,salesrep,retailsales,req,expreq,comreq,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
1744,b,2,4,2,5,0,0,0,0,263,4,0,0,0,1,Todd,m,w,0.0,1.0,0.0,c,s,754.0,0.945291,0.04833,9.892426,0.251865,0.055982,8.93406,1.0,some,,0.0,,,,,0.0,,,,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
3405,b,4,4,2,6,1,0,0,1,266,4,0,0,0,1,Brett,m,w,0.0,1.0,0.0,c,s,1070.0,0.000904,0.918852,10.758413,0.127148,0.288414,10.074201,1.0,some,,0.0,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
1317,b,17,2,2,4,0,0,0,0,189,1,0,0,1,0,Ebony,f,b,0.0,1.0,0.0,c,a,669.0,0.948767,0.033041,10.295766,0.18482,0.100121,9.191259,0.0,,,0.0,,,,,,0.190948,0.770569,10.566845,0.05786,0.41213,10.04055,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,
3513,167,42,4,4,6,0,0,0,0,285,4,1,0,1,0,Jermaine,m,b,0.0,1.0,0.0,b,s,38.0,0.303463,0.471892,10.230702,0.189629,0.119756,9.334149,1.0,5,,0.0,19.0,73.0,19.0,73.0,0.0,0.012145,0.93715,10.930049,0.087009,0.286532,9.92,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
3112,32,32,4,4,2,0,1,1,0,267,4,1,1,1,0,Keisha,f,b,1.0,0.0,0.0,b,s,158.0,0.003336,0.981653,10.299677,0.190751,0.328431,10.003831,1.0,some,,0.0,,,,,0.0,0.144843,0.716077,9.967776,0.068858,0.263203,9.643809,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,Private


id                     object
ad                     object
education                int8
ofjobs                   int8
yearsexp                 int8
honors                   int8
volunteer                int8
military                 int8
empholes                 int8
occupspecific           int16
occupbroad               int8
workinschool             int8
email                    int8
computerskills           int8
specialskills            int8
firstname              object
sex                    object
race                   object
h                     float32
l                     float32
call                  float32
city                   object
kind                   object
adid                  float32
fracblack             float32
fracwhite             float32
lmedhhinc             float32
fracdropout           float32
fraccolp              float32
linc                  float32
                       ...   
parent_emp            float32
branch_sales          float32
branch_emp

Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,email,computerskills,specialskills,h,l,call,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip,fracwhite_empzip,lmedhhinc_empzip,fracdropout_empzip,fraccolp_empzip,linc_empzip,manager,supervisor,secretary,offsupport,salesrep,retailsales,req,expreq,comreq,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4784.0,4784.0,4784.0,4784.0,4784.0,4784.0,4870.0,4870.0,1672.0,1722.0,608.0,658.0,3102.0,1918.0,1918.0,1908.0,1918.0,1918.0,1918.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,0.479261,0.820534,0.328747,0.502259,0.497741,0.080493,651.777832,0.310831,0.542772,10.147275,0.185674,0.213816,9.550801,0.719507,0.29117,587.686035,2287.051025,196.050522,755.416992,0.114765,0.079096,0.843764,10.655662,0.101692,0.333872,10.031516,0.152156,0.077207,0.332854,0.118686,0.151129,0.167967,0.787269,0.435318,0.124846,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,0.499621,0.383782,0.469806,0.500051,0.500051,0.272079,388.690582,0.332473,0.329467,0.34578,0.081747,0.169305,0.557097,0.449287,0.454347,2907.629395,8902.84375,896.510864,1665.165039,0.318791,0.149742,0.182991,0.441931,0.071293,0.192012,0.567816,0.359208,0.266945,0.471274,0.323461,0.358204,0.373869,0.409275,0.495846,0.330582,0.308866,0.496083,0.259649,0.275854,0.171677,0.278932,0.410141,0.442847,0.361773,0.371308
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004814,8.841738,0.0,0.030847,8.507345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0055,9.170247,0.0,0.030847,8.662505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,306.25,0.045275,0.252164,9.965053,0.139711,0.092559,9.220489,0.0,0.0,12.975,98.0,13.0,97.0,0.0,0.007125,0.82414,10.448976,0.047958,0.201971,9.691531,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,647.0,0.15995,0.571833,10.144078,0.190751,0.145053,9.438432,1.0,0.0,33.349998,220.0,34.900002,200.0,0.0,0.017404,0.900727,10.666441,0.087009,0.288414,9.914428,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,979.75,0.516854,0.873805,10.342871,0.238196,0.284315,9.668208,1.0,1.0,133.100006,700.0,86.699997,500.0,0.0,0.089956,0.956356,10.867444,0.142636,0.412352,10.386931,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1344.0,0.992043,0.981653,11.11929,0.356164,0.780124,11.07866,1.0,1.0,47947.601562,124500.0,10500.0,12208.0,1.0,0.98936,1.0,11.814311,0.356164,0.892857,11.36244,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 1. What test is appropriate for this problem? Does CLT apply?

Links:
  - http://www.randomservices.org/random/hypothesis/Bernoulli.html
  - https://en.wikipedia.org/wiki/Bernoulli_distribution
  - http://www.cs.cmu.edu/~bhiksha/courses/10-601/hypothesistesting/hyptesting_onesample_Bernoulli.html

Since we are looking at a comparison between two expected values for a callback, then we wll use a two-sided test.

**Does the Central Limit Theorem apply?** Yes

## 2. What are the null and alternate hypotheses?

**Null Hypothesis:** Expected value (of callback) is same for white sounding names and black sounding names.

$H_0: \mu_w = \mu_b$

**Alternative Hypothesis:** Expected value (of callbacks) is not the same for white sounding names and black sounding names.

$H_a: \mu_w \neq \mu_b$

## 3. Compute margin of error, confidence interval, and p-value.

Try using both the bootstrapping and the frequentist statistical approaches.

In [146]:
w = data[data.race=='w'].call
display(w.head())
b = data[data.race=='b'].call
display(b.head())

0    0.0
1    0.0
4    0.0
5    0.0
6    0.0
Name: call, dtype: float32

2    0.0
3    0.0
7    0.0
8    0.0
9    0.0
Name: call, dtype: float32

In [147]:
#What is the expected value?
#X = {1 with probability P, 0 with probability (1 - P)}
#X = {Callback with probability P, No Call with probability (1 - P)}
wp = data.loc[data.race=='w','call']
wp = wp.sum()/wp.count()

display(Markdown("Expected Value of Callback with White-sounding name: {0}".format(round(wp,3))))

Expected Value of Callback with White-sounding name: 0.097

In [148]:
bp = data.loc[data.race=='b','call']
bp = bp.sum()/bp.count()

display(Markdown("Expected Value of Callback with White-sounding name: {0}".format(round(bp,3))))

Expected Value of Callback with White-sounding name: 0.064

### Frequentist Approach

In [150]:
#Create contingency table to visualize.
wtab = w.reset_index().drop('index',axis=1)
wtab.loc[:,'wh_callbacks'] = wtab.loc[:,'call'].copy()

btab = b.reset_index().drop('index',axis=1)
btab.loc[:,'bl_callbacks'] = btab.loc[:,'call'].copy()

wtab = wtab.groupby('call').count()
btab = btab.groupby('call').count()

ct = pd.concat([wtab,btab],axis=1)
ct.loc['Total'] = ct.sum()
ct.loc[:,'Total'] = ct.loc[:,'wh_callbacks'] + ct.loc[:,'bl_callbacks']
display(ct)

Unnamed: 0_level_0,wh_callbacks,bl_callbacks,Total
call,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,2200,2278,4478
1.0,235,157,392
Total,2435,2435,4870


In [151]:
from IPython.display import Math, Latex

z_stat_latex = r'Z = \frac{(\hat{p_1} - \hat{p_2}) - 0}{\sqrt{\hat{p}(1 - \hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}'
display(Math(z_stat_latex))

<IPython.core.display.Math object>

In [154]:
def sampleProportion(x1, x2):
    phat = (x1.sum() + x2.sum())/(x1.count() + x2.count())
    return phat


In [166]:
def bTest(x1, x2, alpha=0.05):
    p1 = x1.sum()/x1.count()
    n1 = x1.count()
    p2 = x2.sum()/x2.count()
    n2 = x2.count()
    
    #Calculate PHat and Z Statistic
    phat = sampleProportion(x1,x2)
    numer = (p1 - p2)
    denom = np.sqrt(phat*(1 - phat) * (1/n1 + 1/n2))
    z = numer/denom

    #Calculate rejection area
    two_sided_p = alpha/2
    
    
    #Calculate Confidence Interval
    zstar = z * (1 - alpha/2)* np.sqrt((phat * (1-phat))/(len(x1)+len(x2)))
    z_calc = zstar * np.sqrt((phat*(1-p1))/n1 + (p2*(1-p2))/n2)
    conf_int = (round(phat - z_calc,3), round(phat + z_calc,3))
    
    return(z, conf_int, z_calc)

ztest, conf, marg_err = bTest(w,b)
pvalue = round(stats.norm.cdf(ztest),5)

print("The Margine of error: ", round(marg_err,5))
print("The Confidence Interval of the difference in expected values: \n", conf)
print("The P-Value of the difference of the expected values: \n", 1-pvalue)

The Margine of error:  0.00012
The Confidence Interval of the difference in expected values: 
 (0.08, 0.081)
The P-Value of the difference of the expected values: 
 2.0000000000020002e-05


### Bootstrapping Approach

In [None]:
#bootstrap hypothesis test
def bootstrap(x, samples=10000):
    """Return a list of bootstrapped samples."""
    n = len(x)
    bt_straps = [np.random.choice(x,size=n) for i in range(samples)]
    return bt_straps

wbts = bootstrap(w)
wmeans = [np.mean(i) for i in wbts]
bbts = bootstrap(b)

expHist(wbts, "Bootstrapped Means")

#Compute the 95% confidence interval: conf_int
conf_int = np.percentile(wbts, [2.5,97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int)


In [None]:
n_its = 100000
n = int(len(data)*0.5)

stats = list()

for i in range(n_its):


## 4. Write a story describing the statistical significance in the context or the original problem.


## 5. Does your analysis mean that race/name is the most important factor in callback success? 
Why or why not? If not, how would you amend your analysis?