# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context of the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [8]:
len(data)

4870

### 1. What test is appropriate for this problem? Does CLT apply?
For this data, the appropriate test is a two-sample hypothesis test. The two sample groups will be people with black-sounding names and people with white-sounding names, and I will test the difference in the proportions of calls for the two groups. I will use the z-statistic because our sample is large enough (>30). 

For CLT to apply, data:
* must be independent
* have a sample size of 30 or more.

In this case, CLT does apply because the sample size is 4870, which is significantly larger than 30. Additionally, the observations are independent because one's chances of getting called by an employer do not affect another's chance of getting called by the employer.

### 2. What are the null and alternate hypotheses?
H<sub>0</sub>: p<sub>w</sub> = p<sub>b</sub>

H<sub>A</sub>: p<sub>w</sub> $\neq$ p<sub>b</sub>

$\alpha$ = 0.01 or 99% 

where p<sub>w</sub> = proportion of people with white-sounding names that are called by the employer and p<sub>b</sub> = proportion of people with black-sounding names that are called by the employer

### 3. Compute margin of error, confidence interval, and p-value.
Note: There are 2,435 people in each group (black-sounding and white-sounding names)

In [37]:
# Count the # of people who were called back with black-sounding names
b=data[data['race']=='b']
b[b['call']==1].count()

id                    157
ad                    157
education             157
ofjobs                157
yearsexp              157
honors                157
volunteer             157
military              157
empholes              157
occupspecific         157
occupbroad            157
workinschool          157
email                 157
computerskills        157
specialskills         157
firstname             157
sex                   157
race                  157
h                     157
l                     157
call                  157
city                  157
kind                  157
adid                  157
fracblack             153
fracwhite             153
lmedhhinc             153
fracdropout           153
fraccolp              153
linc                  153
                     ... 
parent_emp             46
branch_sales           19
branch_emp             19
fed                    94
fracblack_empzip       47
fracwhite_empzip       47
lmedhhinc_empzip       47
fracdropout_

In [41]:
# Calculate the proportion of people who were called back with black-sounding names
b_prop=157/2435
b_prop

0.06447638603696099

In [42]:
# Count the # of people who were called back with white-sounding names
w=data[data['race']=='w']
w[w['call']==1].count()

id                    235
ad                    235
education             235
ofjobs                235
yearsexp              235
honors                235
volunteer             235
military              235
empholes              235
occupspecific         235
occupbroad            235
workinschool          235
email                 235
computerskills        235
specialskills         235
firstname             235
sex                   235
race                  235
h                     235
l                     235
call                  235
city                  235
kind                  235
adid                  235
fracblack             231
fracwhite             231
lmedhhinc             231
fracdropout           231
fraccolp              231
linc                  231
                     ... 
parent_emp             74
branch_sales           28
branch_emp             30
fed                   143
fracblack_empzip       78
fracwhite_empzip       78
lmedhhinc_empzip       77
fracdropout_

In [44]:
# Calculate the proportion of people who were called back with white-sounding names
w_prop=235/2435
w_prop

0.09650924024640657

In [47]:
# Calculate the difference in proportions pw - pb
w_b=w_prop-b_prop
w_b

0.032032854209445585

In [69]:
# Calculate the difference in std proportions
p=(235+157)/4870
q=(1/2435)*2*p*(1-p)
std=np.sqrt(q)
std

0.0077968940361704568

In [72]:
# Calculate z
z=w_b/std
z

4.1084121524343464

The z-value obtained is much more extreme than the critical z-value for a two-tailed test at a significance level of 99%, 2.576, so we must reject the null hypothesis in favor of the alternative. In other words, the difference in the proportions of calls for people with black-sounding and white-sounding names is significant.

In [89]:
# Calculate the margin of error at the 99% level
ww=(w_prop*(1-w_prop))/2435
bb=(b_prop*(1-b_prop))/2435
margin_error=(np.sqrt(ww+bb))*2.576
margin_error

0.020049962631279322

In [88]:
# Calculate the 99% confidence interval
confidence_interval=(w_b-margin_error, w_b+margin_error)
confidence_interval

(0.011982891578166264, 0.05208281684072491)

### 4. Write a story describing the statistical significance in the context of the original problem.
Based off of my analysis, we are 99% confident that those with white-sounding names have a higher chance (between 1%-5%) of being called back for interviews than those with black-sounding names in the United States, assuming that the sample population provided is representative of Americans. However, this analysis did not take other characteristics such as education level and experience into account.

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
My analysis does not mean that one's race/name is the most important factor in callback success because there are other factors that an employer considers when looking at a potential candidate such as education, experience level, and (illegally) one's gender and age. I would amend my analysis by grouping candidates based off of these other traits and then examine the proportions of candidates who are called back with black- and white-sounding names because it would increase the accuracy of my analysis. When looking at the effect of one particular characteristic of a population, it is best to control for every other possible influence in order to maintain one's analytical integrity.