# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [54]:
import pandas as pd
import numpy as np
from scipy import stats

In [55]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

## Q1 And Q2
<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

<p>Q1. A/B Testing using Two Sample Z-test for Proportion to see if black call percentage is similar to white call percentage. CLT is applicable as sample size is large enough.</p>
<p>Q2. Null hypothesis: black call percentage is the same to white. Alternate: black call percentage is different than white (lesser).  </p>

# Q3

In [84]:
# Set up datasets
black_call=data[data.race=='b'][data.call==1.0].call.count()
white_call=data[data.race=='w'][data.call==1.0].call.count()
black_no=data[data.race=='b'][data.call==0.0].call.count()
white_no=data[data.race=='w'][data.call==0.0].call.count()
total_call=data[data.call==1.0].call.count()
total_no=data[data.call==0.0].call.count()

blacks = np.array([True] * black_call + [False] * black_no)
whites = np.array([True] * white_call + [False] * white_no)
total = np.array([True] * total_call + [False] * total_no)

print('Black interview :',100.0*np.sum(blacks)/len(blacks), '%  White interview : ',100.0*np.sum(whites)/len(whites), '% \n  Difference : ',100.0*(np.sum(whites)/len(whites)-np.sum(blacks)/len(blacks)),'%')

Black interview : 6.4476386037 %  White interview :  9.65092402464 % 
  Difference :  3.20328542094 %


  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """


In [77]:
ppt=np.sum(blacks)/len(blacks)
ppt1=np.sum(whites)/len(whites)
se = 100.0*np.sqrt((ppt*(1-ppt)/len(blacks) + ppt1*(1-ppt1)/len(whites)))
MoE=1.96*se
MoE
print(' Margin of Error = {0:.2f}%  \n 95% Confidence Interval = ( {1:.2f}% , {2:.2f}% )'
      .format( MoE,100*(ppt1-ppt)-MoE, 100*(ppt1-ppt)+MoE))

 Margin of Error = 1.53%  
 95% Confidence Interval = ( 1.68% , 4.73% )


In [57]:
#Bootstrap test

#Combining the sample
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""
    data = np.concatenate((data1, data2))
    permuted_data = np.random.permutation(data)
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

#Create permutation
def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""
    perm_replicates = np.empty(size)
    for i in range(size):
        perm_sample_1, perm_sample_2 = permutation_sample(data_1,data_2)
        perm_replicates[i] = func(perm_sample_1,perm_sample_2)
    return perm_replicates

def frac_yay_blacks(blacks, whites):
    """Compute fraction of black with interview calls."""
    frac = np.sum(blacks) / len(blacks)
    return frac

# Acquire permutation samples: perm_replicates
perm_replicates = draw_perm_reps(blacks, whites, frac_yay_blacks, 10000)

# Compute and print p-value: p
p = np.sum(perm_replicates <= 157.0/2435.0) / len(perm_replicates)
print('p-value =', p)

p-value = 0.0001


In [78]:
#Two sample proportion z-test
from statsmodels.stats.proportion import proportions_ztest
count = np.array([black_call,white_call])
nobs = np.array([black_call+black_no, white_call+white_no])
z,p = proportions_ztest(count, nobs, value=0, alternative='two-sided')
print(' z-stat = {z} \n p-value = {p}'.format(z=z,p=p))

 z-stat = -4.108412152434346 
 p-value = 3.983886837585077e-05


Both tests proves that we can reject the null hypothesis.

## Q4 And Q5
<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

<p>Q4. We can reject the hypothesis that candidates that are categorized as having black sounding name have the same rate of callbacks as those with white sounding names. We are 95% confident that the difference between black and white sounding names rate of callbacks are between 1.68% and 4.73%.</p>
<p>Q5. Not necessarily. There might be hidden variables within the race/name that cause the outcome. There are however corelations that indicates black sounding names have lower percentage of rate of callbacks than whites.  </p>