# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [11]:
data.head()
print(data.columns)
print(data.race.unique())
print(data.call.unique())

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')
['w' 'b']
[ 0.  1.]


In [88]:
# get calls grouped by race
call_data = data[['race', 'call']].groupby('race')
# get percentage of callbacks for each group
call_data_pct = call_data.sum() / call_data.count()

# save to varibles p_w and p_b
p_w = float(call_data_pct.loc['w'])
p_b = float(call_data_pct.loc['b'])

# get the counts for each group
n_w = len(data[data['race']=='w'])
n_b = len(data[data['race']=='b'])

print(p_w, p_b)
print(n_w, n_b)

0.09650924024640657 0.06447638603696099
2435 2435


#### Question 1
For this problem, we would want to apply a difference in proportion hypothesis test. In the data, the values for race are either 'w' for white or 'b' for black. The values for callback are either 1 for called back or 0 for not called back.

We want the proportion of callbacks for each race group.

$P_w$ is the proportion of callbacks for white-sounding names. <br>
$P_b$ is the proportion of callbacks for black-sounding names.

The central limit theorm would apply because the count is sufficient for an asumed normal shape for the data.

2. What are the null and alternate hypotheses?

#### Question 2
The null hypothesis is that the proportion of callbacks between white- and black-sounding names are the same. The alternative is that they are not the same. 

$H_0$ - $P_w = P_b$<br>
$H_A$ - $P_w \ne P_b$



In [97]:
# use 1.96 for a 95% confidence interval
z_star = 1.96

# we want a pooled sample proportion, p
p = (p_w * n_w + p_b * n_b) / (n_w + n_b)
print(p)
# compute the standard error and margin of error
se = np.sqrt(p*(1-p)/n_w + p*(1-p)/n_b)
me = z_star * se

# confidence interval
ci = [p_w - p_b - me, p_w - p_b + me] 

# z-statistic
z = (p_w - p_b) / se
# two tailed p-value based on 1.96 CI
pval = 1.96 * (1-stats.norm.cdf(z))

print(se, me)
print(z)
print(ci)
print(pval)

0.08049281314168377
0.00779689403617 0.0152819123109
4.10841215243
[0.016750941898551489, 0.047314766520339682]
3.90420910083e-05


#### Question 3

To test the hypothesis, we will use a $95\%$ confidence interval and two-tailed z-test.

First, we caluclate the pooled sample proportion $\bar{p}$ as 

$$\bar{p} = \dfrac{p_w + p_b}{n_w + n_b} = \dfrac{0.096509 + 0.064476}{2435 + 2435} = 0.080492$$

Calculate the standard error as 
$$SE_{\bar{p}} = \sqrt{\dfrac{\bar{p}(1-\bar{p})}{n_w} + \dfrac{\bar{p}(1-\bar{p})}{n_b}} = 0.0077968$$

Margin of error is computed as $$z^* \times SE_{\bar{p}} = 1.96 * 0.0077968 = 0.015281$$

Our $z$-statistic is calculated as 

$$z_{\bar{p}} = \dfrac{p_w - p_b}{SE_{\bar{p}}} = 4.10841215243$$

The $p$-value for this z-score based on a $\pm1.96$ SE confidence interval would be $3.90420910083\times10^{-5}$. Given this extremely small p-value less than 5% we would reject the null hypothesis.

#### Question 4

Our hypothesis test results indicates that the likelihood that the discrepency in proportions between callbacks between those with white-sounding and black-sounding names is incredibly unlikely to have occurred by chance. With a $p$-value of less than $0.0004%$, this is a strong indication that there is a staticially significant evidence that the racial bias associated with an applicat's name has an measurable effect on whether an applicant is called back. 

#### Question 5

This finding does not necesasrily mean that the race associated with an applicant's name is the most important variable in callback proportions. There are numerous other variables in the dataset including education, years of experience, and many others which could have an measurable impact, perhaps a stronger impact than the name racial assoication, in estimating how likely an applicat will be called back. To determine this, we would have to run a multivariable regression analysis on all of the explanatory variables and see which ones have the largest impact, and what the statistical significance of these other variables is.