# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [2]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats

In [12]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
black = data[data.race == 'b']
number_black = len(black)
white = data[data.race == 'w']
number_white = len(white)
total = len(data)


In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

## Question 1 

What test is appropriate for this problem? Does CLT apply?

The Chi Squared test is appropriate for this problem because we're looking at categorical data. I'm not sure if the CTL applies.



## Question 2

What are the null and alternative hypotheses? 

* Null hypothesis: Racec is not associated with callbacks
* Alternative hypothesis: Race is associated with callbacks

## Question 3

Create a two way table to show race and callback tallies 


In [7]:
#Now I want to create my two way table to show race and callback tallies.

index = ['black','white','total']
cols = ['call_back', 'no_call_back', 'total']
b_c = sum(data[data.race == 'b'].call)
w_c = sum(data[data.race == 'w'].call)
total_c = b_c + w_c
b_nc = number_black - b_c
w_nc = number_white - w_c 
total_nc = b_nc + w_nc

table = pd.DataFrame(index=index,columns=cols)
table.ix['black']['call_back'] = b_c
table.ix['white']['call_back'] = w_c
table.ix['total']['call_back'] = total_c

table.ix['black']['no_call_back'] = b_nc
table.ix['white']['no_call_back'] = w_nc
table.ix['total']['no_call_back'] = total_nc

table.ix['black']['total'] = b_c + b_nc
table.ix['white']['total'] = w_nc + w_c
table.ix['total']['total'] = total_nc + total_c

table

















Unnamed: 0,call_back,no_call_back,total
black,157,2278,2435
white,235,2200,2435
total,392,4478,4870


## Question 3a
Compute the p value by using a chi squared test of independence 

In [8]:
p_value = scipy.stats.chi2_contingency(table)[1]
p_value


0.0020403793672093755

## Question 3b
Compute a 95% confidence interval. I'm confused about whether n is 2 here because there are two categories (black and white) or if I would have to account for all the categories in the original data. 

In [11]:
lower_limit = stats.chi.ppf(q=.025, df=1)
upper_limit = stats.chi.ppf(q=.975,df=1)
lower_limit, upper_limit

(0.031337982021426597, 2.2414027276049451)