# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
len(data)

4870

## 1. What test is appropriate for this problem? Does CLT apply?

I think the best test would be a Chi Square test to look at if the proportion of calls seems reasonable.

There are sufficent data points for CLT, it claims to be random, and these would be independent points.

## 2. What are the null and alternate hypotheses?

The null hypothesis is that there is no difference in frequency of call backs due to race. 

The alternative is that there is. Specifically, in looking at discrimination, do white names get more responses than black names?

## 3. Compute margin of error, confidence interval, and p-value.

In [6]:
race = data.race
call = data.call

In [7]:
call.value_counts(),race.value_counts()

(0.0    4478
 1.0     392
 Name: call, dtype: int64, b    2435
 w    2435
 Name: race, dtype: int64)

In [12]:
x = call.value_counts()
x[1]

392

In [8]:
wc = 0
bc = 0
for i in range(len(race)):
    if race[i] == 'w' and call[i] == 1:
        wc = wc + 1
    if race[i] == 'b' and call[i] == 1:
        bc = bc + 1

wnc = 2435 - wc
bnc = 2435 - bc

In [9]:
table = pd.DataFrame([[wnc, bnc],[wc, bc]],columns = ['White', 'Black'])
print(table)

   White  Black
0   2200   2278
1    235    157


In [47]:
stats.chi2_contingency(table)

(16.449028584189371, 4.9975783899632552e-05, 1, array([[ 2239.,  2239.],
        [  196.,   196.]]))

My own calculations were very similar: chi^2 = 16.879, p = .00004. 

Note, there is no confidence interval to compute here. It is simply an analysis of how likely this distribution is to have occured assuming equal treatment.

## 4. Write a story describing the statistical significance in the context or the original problem.

In order to test racial discrimination, resumes were sent out with the same qualifications but different names. Some of these names sounded black, while some sounded white. The rate of contact from these resumes was recorded and organized. 

When looking at the frequency of call backs, there is an advantage to white sounding names (235 compared to 157 call backs). With a total call back rate of 8%, the expect rate for both would be 196 for each race assuming no discrimination. 

A chisquare analysis of the observed frequencies to the expected showed a chisquare of around 16.5, which has a probablilty of about 0.004% of occuring through random chance alone. This means it is likely that there is discrimination occuring based on names.

## 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

This does not mean that race/name is the most important factor in receiving a call. It simply means that there was a statisitcally significant difference based on callback rates from race.

There are other confounding variables, however. Gender, experience level, resume quality, or address could also play a (potentially bigger) role. Conducting additional frequency tests, or moving to a regression analysis would be illuminating.