# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


**1.What test is appropriate for this problem? Does CLT apply?**

The CLT applies because there is a large sample size. A z-test for the difference in proportions is appropriate for this problem, because we are testing whether there is a difference in proportions.

**2. What are the null and alternate hypotheses?**

The null hypothesis is that the proportion of applicants with black-sounding names who receive a call is equal to the proportion of applicants with white-sounding names who receive a call. The alternative hypothesis is that the proportion of applicants who receive a call is different between these groups.

**3. Compute margin of error, confidence interval, and p-value.**

Now I will find the confidence interval for the difference in proportions. First, we can compute the proportion of white-sounding and black-sounding candidates who receive calls:

In [26]:
black_prop = sum(data[data.race=='b'].call) / data[data.race == 'b']['race'].count()
white_prop =sum(data[data.race=='w'].call) / data[data.race == 'w']['race'].count()

print(black_prop)
print(white_prop)

0.064476386037
0.0965092402464


Next, I will compute the difference in proportions, as well as the standard error:

In [27]:
print(white_prop - black_prop)

SE_diff = ((black_prop * (1 - black_prop) / 2435) + (white_prop * (1 - white_prop) / 2435))**0.5

print(SE_diff)

0.0320328542094
0.00778337058668


I now compute the confidence interval for the difference in proportions using the formula $p_{1} - p_{2} \pm$ 1.96* SE, with 1.96 being the z-value corresponding to 95% confidence.

In [24]:
CI_diff = [white_prop - black_prop - 1.96 * SE_diff, white_prop - black_prop + 1.96 * SE_diff]
print(CI_diff)

[0.016777447859559147, 0.047288260559332024]


Thus, the 95% confidence interval for the difference in proportions between white-sounding candidates who receive calls and black-sounding candidates who receive calls is 1.68 and 4.73 percentage points. Finally, we can conduct a z-test for the difference in proportions:

In [32]:
from statsmodels.stats.proportion import proportions_ztest

white_count = sum(data[data.race=='w'].call)

black_count = sum(data[data.race=='b'].call)

count = np.array([white_count, black_count])
n_obs = np.array([2435, 2435])

z, p = proportions_ztest(count, n_obs)

print("Z statistic: " + str(z))
print("p-value: " + str(p))

Z statistic: 4.10841215243
p-value: 3.98388683759e-05


With a small p-value, we can reject the null hypothesis that applicants with black-sounding names receive calls at the same rate as applicants with white-sounding names. We conclude that there is a difference in call rates between applicants based on whether their names sound white or black.

**5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?**

This analysis does not show that the perceived race of a candidate's name is the most important factor in callback success, because we have just shown that race makes a statistically significant difference, not that the difference is greater than the difference resulting from other factors. To amend my analysis to determine whether race is the most important, I would compute confidence intervals and z-tests for the other columns in the data and compare the effect sizes and p-values with the results from this analysis.