
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [3]:
from __future__ import division

import pandas as pd
import numpy as np
from scipy import stats

In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [6]:
# Build a dataframes with the relevant features
df = data[['race', 'call']]
dfb = df[df.race == 'b']
dfb0 = dfb[dfb.call == 0]
dfb1 = dfb[dfb.call == 1]
dfw = df[df.race == 'w']
dfw0 = dfw[dfw.call == 0]
dfw1 = dfw[dfw.call == 1]

## 1. What test is appropriate for this problem? Does CLT apply?

A hypothesis test for the difference between two sample proportions is the correct test for this problem. Since we are dealing with two different samples (race = black and race = white) with a value of 0 or 1 for no call-back vs. call-back (respectively), I will use two Bernoulli distributions with a large sample size for this problem. The central limit theorem does apply in this situation because the sample size is greater than 30.

## 2. What are the null and alternate hypotheses?

H0: p_b1 = p_w1 or there is no racial discrimination in the U.S. work labor market

H1: p_b1 != p_w1 or there is racial discrimination in the U.S. work labor market

## 3. Compute margin of error, confidence interval, and p-value.

In [7]:
# Find the proportion of black and white people getting a callback and not getting a callback
p_b0 = len(dfb0) / len(dfb)
p_b1 = 1 - p_b0
p_w1 = len(dfw0) / len(dfw)
p_w1 = 1 - p_w1

# In a Bernoulli dist., the standard deviation of each sample is equal to the proportion getting a 
# callback times one minus the proportion getting a callback
sigma_b = p_b1 * (1 - p_b1)
sigma_w = p_w1 * (1 - p_w1)

In [22]:
# Find the difference between the proportions of black and white people getting a callback
sampdist_prop = np.abs(p_b1 - p_w1)

# Find the standard deviation of the above statistic
sampdist_sigma = np.sqrt((sigma_b / len(dfb)) + (sigma_w / len(dfw)))

# Margin of error is the probability that the true population statistic is within 2 standard deviations (on the sampling
# distribution of p_b1 - pw1) of our smaple statistic
margin_of_error = 2 * sampdist_sigma

print "Margin of Error = ", np.round(margin_of_error, 6) * 100, "%"

Margin of Error =  1.5567 %


In [21]:
print "95% Confidence Interval = ", np.round(sampdist_prop, 6), "+-", np.round(margin_of_error, 6)

95% Confidence Interval =  0.032033 +- 0.015567


In [23]:
print "P-Value = ", 2 * stats.norm.cdf(0, sampdist_prop, sampdist_sigma)

P-Value =  1.93128260376e-05


## 4. Discuss statistical significance.

P-value is less than the significance level of 5%, so I am confident in rejecting my null hypothesis. In other words, there does appear to be racial discrimination in the U.S. labor force.