# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [111]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [112]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [113]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

# Q1: What test is appropriate for this problem? Does CLT apply?


We are trying to determine if whether race has a significant impact on the rate of callbacks. Therefore we are testing if race and callback rate are related. 

The CLT says that for non-normal data, the distribution of the sample means has an approximate normal distribution. There are three conditions in order for the central limit theorem to hold true:

   <li> large sample size (n>=30)
   <li> randomly generated samples
   <li> independent samples

We can assume the data we collected are independent and was randomly selected, and the sample size is greater than 30. So the Central Limit Theorem does apply for this dataset. 



The sample size is large, therefore we will be using a 2-sample z-test. 

# Q2: What are the null and alternate hypotheses?

**Null Hypothesis**: There is no difference between callbacks rates for white and black sounding names. $$H_0:P_1=P_2$$


**Alternative Hypothesis**: There is a difference between callback rate for white and black sounding names. $$H_a:P_1\neq0P_2$$

We will do this test with a significance level of 5%. Which means if we are going to assume the null hypothesis is true, what is the probability of getting actual difference of our black and white callback rates given our null hypothesis is true. Given that our probability is less than our significance level than we reject our null hypothesis. 

In [5]:
w = data[data.race=='w']
b = data[data.race=='b']

In [11]:
w.shape

(2435, 65)

In [12]:
b.shape

(2435, 65)

In [8]:
w.call.value_counts()

0.0    2200
1.0     235
Name: call, dtype: int64

In [14]:
b.call.value_counts()

0.0    2278
1.0     157
Name: call, dtype: int64

# Q3: Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

In [7]:
# Your solution to Q3 here

## Frequentist approach

### Confidence Interval

In [39]:
p1 = 235/2435
p2 = 157/2435

var_p1 = ((235/2435)*(1-235/2435))/2435
var_p2 = ((157/2435)*(1-157/2435))/2435

print('The proportion of white names that received a call back: ', round(p1,4))
print('The proportion of black names that received a call back: ', round(p2,4))



The proportion of white names that received a call back:  0.0965
The proportion of black names that received a call back:  0.0645


We want to find out if sample of p1-p2 has a meaningful difference,a 95% confidence interval for the parameter sample of p1-p2. 

In [43]:
#Sampling distribution of sampling proportions

p3 = p1-p2
var_p3 = var_p1+var_p2
std_p3 = np.sqrt(var_p3)

print('The difference between the sample means are: ', round(p3,4))


The difference between the sample means are:  0.032


We are tryting to figure out if there is a meaningful difference between the proportion of white sounding names that received a callback and the proportion of black sounding names that received a callback. We sampled 2435 white sounding names and 2435 black sounding names, and we got 0.0965 for the white names and 0.0645 for the black names. Our goal is to get a 95% confidence interval. 

We want to be confident that 95% chance that the true mean (p1-p2) is within a distance (d) of 0.032. Similar to 95% chance that 0.032 is within the same distance of the true mean. 

We can assume everything is normal, we ask ourselves how many standard deviations do we need to be away from the mean in order to contain 95% of the probability. Using a z-table, we are looking for a z-value that contains 97.5% of our data since a z-table looks at the cumulative probability up to the z-value. If we get the z-value and apply it to both sides we will get 95%. The z-score for 97.5% of our distribution is 1.96. 

In [84]:
z_scr = 1.96
d = 1.96 * std_p3
print('The Margin Of Error (distance):', round(d,4))
      
conf_int = (round(p3-d,4),round(p3+d,4))
print('The Confidence Interval :',conf_int)


The Margin Of Error (distance): 0.0153
The Confidence Interval : (0.0168, 0.0473)


**There is a 95% chance that the true difference of the proportions is within 0.0153 of 0.032. 
Giving us a confidence internval of (0.0168, 0.0473). Therefore, we are 95% confidence white sounding names get more callbacks than black sounding names.**

### Hypothesis Test

$$H_0:P_1-P_2 = 0$$
$$H_a:P_1-P_2 \neq0$$

Assume the $H_0$ is true, we will try to figure out the probability of actually getting $\bar{p_1}-\bar{p_2}$. 

Given this: $P(\bar{p_1}-\bar{p_2}| H_0) < .05 $ we are going to reject the null hypothesis.

If we assume the population proportions are actually the same, then the mean will be 0. Assuming that the mean of sampling distribution of this statistic is 0 what is the probability of getting the difference 0.032? 

What we do here is essentially find a z-score for this, how many standard deviations away from the mean 0.032 is and whether this likelihood more or less than our significance level, in this case 5%. 

In [72]:
#since we assuming that p1 and p2 are the same value.
#we can assume one big survey

p_h = 392/4870
print('The best estimate of this consistent population porportion that is true for both white names and black names: ', round(p_h,4))

std_h = np.sqrt((2*p_h)*(1-p_h)/2435)
print('The standard deviation of our distribution given our null hypothesis is true: ',round(std_h,4))



The best estimate of this consistent population porportion that is true for both white names and black names:  0.0805
The standard deviation of our distribution given our null hypothesis is true:  0.0078


In [93]:
z_h = (0.032-0)/std_h
print('The z-score:', round(z_h,3))
p_val = stats.norm.cdf(-z_h) * 2
print('The p-value:', (p_val))

The z-score: 4.104
The p-value: 4.0571918338242116e-05


**We want to have a significance level of 5%, two tailed test therefore we want to find a z-value that has 97.5% below it. which is a z-score of 1.96. There is a 5% chance sampling a z-statistic greater than 1.96 assuming the null hypothesis is correct. The probability of sampling a z-score of 4.104 is less than 5%, so we can reject the null hypothesis.**

**Statistically speaking there is a difference between callbacks of white sounding names and black sounding names.**

In [99]:
n_w = len(w)
n_b = len(b)

prop_w = np.sum(w.call)/len(w)
prop_b = np.sum(b.call)/len(b)

prop_diff = prop_w - prop_b
phat = (np.sum(w.call)+np.sum(b.call))/ (len(w)+len(b))

z=prop_diff/np.sqrt(phat*(1-phat)* ((1/n_w)+(1/n_b)))
pval = stats.norm.cdf(-z)* 2
print('Z score: {}'.format(z))
print('P value: {}'.format(pval))

Z score: 4.108412152434346
P value: 3.983886837585077e-05


## Bootstrap Approach

**Null Hypothesis**: There is no difference between callbacks rates for white and black sounding names. $$H_0:P_1=P_2$$


**Alternative Hypothesis**: There is a difference between callback rate for white and black sounding names. $$H_a:P_1\neq0P_2$$

We will use permutation sampling to simulate our hypothesis. Permutation is random reordring of entries in an array. This technique at the heart of simulating two quantities we assume are identically distributed. 



In [129]:
#Generate permutation functions for random sampling

def permutation_sample(data_1, data_2):
    '''Generate a permutation sample from two data sets'''
    
    #concatenate two data sets
    data = np.concatenate((data_1,data_2))
    
    #permute the concatenated array
    permuted_data = np.random.permutation(data)
    
    #split the permuted array into two
    perm_sample_1 = permuted_data[:len(data_1)]
    perm_sample_2 = permuted_data[len(data_1):]
    
    return perm_sample_1, perm_sample_2

def draw_perm_reps(data_1,data_2,func,size=1):
    '''Generate multiple permutation replicates.'''
    
    #initialize array of replicates
    perm_replicates = np.empty(size)
    
    for i in range(size):
        #generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1,data_2)
        
        #compute the test statistic
        
        perm_replicates[i] = func(perm_sample_1,perm_sample_2)
        
    return perm_replicates



What about the data are we accessing, and how to we quantify the assessment? This hinges on a test statistic. A test statistic is a single number that can be computed from observed data and from data you simulate under the null hypothesis. It serves as a basis of comparison between what the hypothesis predicts and what we actually observed. 

Our test statistic for our test is are callback rates different. We assume they have no difference therefore we would choose the difference in means as our test statistic, with our test based on the null hypothesis that our difference is 0. 

In [132]:
#test statistic

def diff_of_means(data_1,data_2):
    '''difference in means of two arrays.'''
    
    #the difference of means of data_1, data_2
    diff = np.mean(data_1) - np.mean(data_2)
    
    return diff


empirical_diff_means = diff_of_means(w.call,b.call)
print('The difference between callback rate:', diff_mean)

The difference between callback rate: 0.03203285485506058


We will redo the simulation under the null hypothesis 10,000 times, by generating lots of permutation replicates. Permutation replicates are test statistics computed from a permutation sample. 

In [137]:
#draw 10,000 permutation replicates
perm_replicates = draw_perm_reps(w.call,b.call,diff_of_means,100000)

#compute p-value
p_value_perm = np.sum(perm_replicates > empirical_diff_means)/  len(perm_replicates)

print('P-value of permutation replicates: ', p_value_perm)



P-value of permutation replicates:  2e-05


**The p-value is pratically 0, so we reject the null hypothesis that white and black sounding names have the same call back rate.**

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

# Q5: Write a story describing the statistical significance in the context or the original problem.


Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

Based off these samples we can confidently say, statistically there is a biased between callback rates of white sounding names and black sounding names. 

**White sounding names have 9.6% callback rate**, while **black sounding names have a 6.4% callback rate**. 

We tested if these percentages could have been from random chance, we did a frequentist z-test to confirm the true proportion of the population between callbacks were indeed different. With a **95% confidence that race has a statistical significant effect on callbacks.**




# Q6: Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

Not necessarily, There are other variables to consider. We can not base a conclusion from a statistical test taken from one sample. It may play a factor, and statistically speaking it does. However, we cannot say it is the most important factor without researching, and analyzing other variables between individual. 
