# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [5]:
################################    1. What test is appropriate for this problem? Does CLT apply?
################################    2. What are the null and alternate hypotheses?

print('1. For this problem, we should opt for the difference in proportions test.  More specifically, we should compute the \
proportions of candidates with white- and black-sounding names that do receive a call for an interview, and then test \
whether these two proportions are statistically different.')

print('\n2. The null hypothesis should be that the difference between the proportion of white- and black-sounding names \
that receive a call for an interview is zero, and the alternative hypothesis should be that there is a difference.  That is: \
\n\nH0: p_b - p_w = 0\nHa: p_b - p_w != 0\n\nWe can alter the alternative hypothesis by stating that the proportion of \
white-sounding names that receive an interview request is greater than that of black-sounding names:\n\nHa2: p_b - p_w < 0')

1. For this problem, we should opt for the difference in proportions test.  More specifically, we should compute the proportions of candidates with white- and black-sounding names that do receive a call for an interview, and then test whether these two proportions are statistically different.

2. The null hypothesis should be that the difference between the proportion of white- and black-sounding names that receive a call for an interview is zero, and the alternative hypothesis should be that there is a difference.  That is: 

H0: p_b - p_w = 0
Ha: p_b - p_w != 0

We can alter the alternative hypothesis by stating that the proportion of white-sounding names that receive an interview request is greater than that of black-sounding names:

Ha2: p_b - p_w < 0


# Question 3 - frequentist

In [6]:
w = data[data.race=='w']
b = data[data.race=='b']

In [7]:
################################    3. Compute margin of error, confidence interval, and p-value.

print("The number of black-sounding observations is: {}".format(b.shape[0]))
print("The number of white-sounding observations is: {}".format(w.shape[0]))

p_w_call = sum(data[data.race=='w'].call) / w.shape[0]
p_b_call = sum(data[data.race=='b'].call) / b.shape[0]

print('The empirical probability that someone with a white-sounding name receives a call is: {:.4f}'.format(p_w_call))
print('The empirical probability that someone with a black-sounding name receives a call is: {:.4f}'.format(p_b_call))
print('A difference of proportions test is appropriate in this scenario, and CLT does apply:')
print('n * p_w_call = {0:.2f} > 5'.format(w.shape[0] * p_w_call))
print('n * p_b_call = {0:.2f} > 5'.format(b.shape[0] * p_b_call))
print('Clearly n times (1 minus p) is also greater than 5')

sigma_b = np.sqrt(p_b_call * (1 - p_b_call) / b.shape[0])
sigma_w = np.sqrt(p_w_call * (1 - p_w_call) / w.shape[0])

MOE_b = 1.96 * sigma_b
MOE_w = 1.96 * sigma_w

print('\nMargins of error for the black and white proportions, respectively, are: {0:.4f} and {1:.4f}'.format(MOE_b, MOE_w))

CI_b = (p_b_call - MOE_b, p_b_call + MOE_b)
CI_w = (p_w_call - MOE_w, p_w_call + MOE_w)

print('\nThe 95% confidence interval for the true proportion of those black-sounding names who receive a call is: ({0:.4f}, \
{1:.4f})'.format(CI_b[0], CI_b[1]))
print('The 95% confidence interval for the true proportion of those white-sounding names who receive a call is: ({0:.4f}, \
{1:.4f})'.format(CI_w[0], CI_w[1]))

z_stat = (p_w_call - p_b_call) / np.sqrt(sigma_b ** 2 + sigma_w ** 2)

print('\nThe test statistic is equal to {0:.2f} and the p-value is equal to {1:.8f}'. \
      format(z_stat, 2 * (1 - stats.norm.cdf(abs(z_stat)))))

print('\nThe low p-value indicates that we reject the null hypothesis and instead conclude that white-sounding names \
do not receive the same proportion of calls that black-sounding names receive, and by extension we conclude that \
white-sounding names receive more calls than black-sounding names do')

The number of black-sounding observations is: 2435
The number of white-sounding observations is: 2435
The empirical probability that someone with a white-sounding name receives a call is: 0.0965
The empirical probability that someone with a black-sounding name receives a call is: 0.0645
A difference of proportions test is appropriate in this scenario, and CLT does apply:
n * p_w_call = 235.00 > 5
n * p_b_call = 157.00 > 5
Clearly n times (1 minus p) is also greater than 5

Margins of error for the black and white proportions, respectively, are: 0.0098 and 0.0117

The 95% confidence interval for the true proportion of those black-sounding names who receive a call is: (0.0547, 0.0742)
The 95% confidence interval for the true proportion of those white-sounding names who receive a call is: (0.0848, 0.1082)

The test statistic is equal to 4.12 and the p-value is equal to 0.00003863

The low p-value indicates that we reject the null hypothesis and instead conclude that white-sounding names d

# Question 3 - bootstrapping

In [8]:
################################    3. Compute margin of error, confidence interval, and p-value.

# the below code is adapted from DataCamp: https://www.datacamp.com/courses/statistical-thinking-in-python-part-2

############ ensure reproducibility of results

np.random.seed(1)

############ custom function to generate p-values

def p_val(df):

    b = df[df.race=='b']
    w = df[df.race=='w']

    p_b_call = sum(df[df.race=='b'].call) / b.shape[0]
    p_w_call = sum(df[df.race=='w'].call) / w.shape[0]

    sigma_b = np.sqrt(p_b_call * (1 - p_b_call) / b.shape[0])
    sigma_w = np.sqrt(p_w_call * (1 - p_w_call) / w.shape[0])

    z_stat = (p_w_call - p_b_call) / np.sqrt(sigma_b ** 2 + sigma_w ** 2)

    return 2 * (1 - stats.norm.cdf(abs(z_stat)))

############ sample from df w/ replacement and return bootstrap statistic of interest

def bootstrap_replicate_1d(df, func):
    bs_sample = df.sample(n=len(df), replace=True)
    return func(bs_sample)

############ draw n instances of the bootstrap statistic of interest

def draw_bs_reps(df, func, size=1):
    bs_replicates = np.empty(size)
    
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(df, func)
        
    return bs_replicates

############ 10,000 p-values, each from 10,000 bootstrapped realizations of 'race' and 'call'

bs_replicates = draw_bs_reps(data[['race', 'call']], p_val, 10000)

print('Using the bootstrapping approach, drawing 10000 samples with replacement from the \'race\' and \'call\' columns, \
we arrive at an average p-value of: {0:.4f}.  Hence, we reject the null that there is no difference between the \
proportions of black- and white-sounding names receiving calls for interviews.'.format(bs_replicates.mean()))

Using the bootstrapping approach, drawing 10000 samples with replacement from the 'race' and 'call' columns, we arrive at an average p-value of: 0.0035.  Hence, we reject the null that there is no difference between the proportions of black- and white-sounding names receiving calls for interviews.


<div class="span5 alert alert-success">
<p>Your answers to Q4 and Q5 here</p>
</div>

In [9]:
################################    4. Write a story describing the statistical significance in the context or the 
################################    original problem.

################################    5. Does your analysis mean that race/name is the most important factor in callback 
################################    success? Why or why not? If not, how would you amend your analysis?

print("4. This analysis indicates that race is a statistically significant factor in callback success, but it's not \
sufficient to conclude that race is the most important factor because it fails to consider other factors that \
employers might consider")

print("\n5. Further study may involve a multi-variable logistic regression with focus on the regressors' t-stats or the \
model's F-stat")

4. This analysis indicates that race is a statistically significant factor in callback success, but it's not sufficient to conclude that race is the most important factor because it fails to consider other factors that employers might consider

5. Further study may involve a multi-variable logistic regression with focus on the regressors' t-stats or the model's F-stat
