# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [20]:
import pandas as pd
import numpy as np
from scipy import stats

In [21]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [24]:
w = data[data.race=='w']
b = data[data.race=='b']

print("sizes of w, b and total: ", w.size, b.size, data.size)

sizes of w, b and total:  158275 158275 316550


## Question 1:
1) We will use two-sample T-test to compare rates of call-back between w and b
2) Sample sizes for w and b are 158275 each. The size is large enough that CLT is a reasonable assumption


## Question 2: Hypothesis
H0: Race has no impact on the call-back rates<BR>
Ha: Race does have an impact on the call-back rates

## Question 3:

### Conclusion:
Based on the result, the p_value is very small. Thus we reject the Null hypothesis and deduce that Race has an impact on the call-back rates

In [27]:
# Your solution to Q3 here
# Get Margin of error, cond interval and p-value
# Margin of error: range of values below and above the samples in a confidence interval

def bootstrap_replicate_1d(data, func):
  return func(np.random.choice(data, size=len(data)))

def draw_bs_reps(data, func, size=1):
    return np.array([bootstrap_replicate_1d(data, func) for _ in range(size)])

# 2-sample test
w_call = w.call
b_call = b.call

# a) Bootstrapping
mean_diff = np.mean(w_call) - np.mean(b_call)
bs_replicates_w = draw_bs_reps(w_call, np.mean, 10000)
bs_replicates_b = draw_bs_reps(b_call, np.mean, 10000)
bs_diff_replicates = bs_replicates_w - bs_replicates_b
conf_int = np.percentile(bs_diff_replicates,[2.5,97.5])

total_mean = np.mean(data.call)
w_shifted = w_call - np.mean(w_call) + total_mean
b_shifted = b_call - np.mean(b_call) + total_mean
bs_replicates_w = draw_bs_reps(w_shifted, np.mean, 10000)
bs_replicates_b = draw_bs_reps(b_shifted, np.mean, 10000)
bs_diff_replicates = bs_replicates_w - bs_replicates_b
p = np.sum(bs_diff_replicates >= mean_diff) / len(bs_diff_replicates)

print("Bootstrapping Approach")
print("callbacks mean for both black and white:", np.mean(w_call))
print('95% confidence interval = ', conf_int)
print("p_val: ", p, '\n')

# b) Frequentist
std_error = np.sqrt(np.var(w_call)/w_call.size + np.var(b_call)/b_call.size )
t_val = (np.mean(w_call) - np.mean(b_call)) / std_error
p = stats.ttest_ind(w_call, b_call)

print("Frequentist Approach")
print("mean callbacks for white from samples:", np.mean(w_call))
print("mean callbacks for black from samples:", np.mean(b_call))
print("t_val:", t_val)
print("p_val: ", p, '\n')

Bootstrapping Approach
callbacks mean for both black and white: 0.09650924
95% confidence interval =  [0.01642711 0.04722793]
p_val:  0.0 

Frequentist Approach
mean callbacks for white from samples: 0.09650924
mean callbacks for black from samples: 0.064476386
t_val: 4.1155504738096065
p_val:  Ttest_indResult(statistic=4.114705290861751, pvalue=3.940802103128886e-05) 



## Question 4: Data Story:
Both of the Bootstrapping and Frequentist results yield a small p-value that indicates the probability is very small to have the difference between mean callbacks for black and total mean. Thus we should reject the Null hypothesis and that Race has an impact on the call-back rates.  

However until we analyze the impacts/weights from different elements in a resume (eg skills, work experiences, etc) on the callback rate, we cannot conclude that race has significant impact on requests for interviews based on this statistics alone.

## Question 5: 
The analysis here proves that race does have an impact on callback success, but we cannot conclude that race/name is the most important factor in callback success based on this statistics. We just do not have the data to put weights on different elements in the resume. We should take similar statistics data on how other elements (eg yearsexp, sex, computerskills, education, etc) impact the call back rates.

In [28]:
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')