# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [5]:
import pandas as pd
import numpy as np
from scipy import stats

In [6]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [7]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [11]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [12]:
# number of callbacks for black-sounding names
tot_b_called=sum(data[data.race=='b'].call)
tot_b_called

157.0

In [13]:
total_b=data[data.race=='b'].race.size
total_b

2435

In [14]:
## % of black that were called
per_b=tot_b_called/total_b*100
per_b

6.447638603696099

In [15]:
data.race.size

4870

In [16]:
# number of callbacks for white-sounding names
tot_w_called=sum(data[data.race=='w'].call)
tot_w_called

235.0

In [17]:
total_w=data[data.race=='w'].race.size
total_w

2435

In [18]:
## % white called 
per_w=tot_w_called/total_w * 100
per_w

9.650924024640657

In [19]:
diff=per_w-per_b
diff

3.2032854209445585

### Answer 1

This is a binary response type of problem (1,0) which makes it a Bernoulli distribution or binomial distribution. However, testing the difference between the "percentage called back" for each race will follow a normal distribution in which CLT can be applied. Two sample t-test is appropriate to use in comparing these two percentages.

### Answer 2

There is no difference between black and white resumes/ There is no significant difference between "percentage called back" for black and white resumes. H1: There is difference between black and white resumes/ There is significant difference between "percentage called back" for black and white resumes. Sample size>30 so z-statistic is appropriate

In [21]:
# Your solution to Q3 here


## percentage callback Black variance
P1=per_b/100
n1=total_b
var_b=(P1*(1-P1)/n1)
var_b

2.4771737856498466e-05

In [22]:
## percentage callback White variance
P2=per_w/100
n2=total_w
var_w=(P2*(1-P2)/n2)
var_w

3.580911983304638e-05

In [23]:
## Sampling Distribution P1-P2 variance
var_b_w= var_b + var_w
var_b_w

6.058085768954485e-05

In [24]:
std_b_w=np.sqrt(var_b_w)
std_b_w

0.0077833705866767544

In [25]:
abs(P1-P2)

0.032032854209445585

In [26]:
## Using 95% Confidence level that (P1-P2) is within d of 0.032
#margin of error
moe=1.96*std_b_w
moe

0.015255406349886438

In [27]:
min_P1_P2=abs(P1-P2)-moe
min_P1_P2

0.016777447859559147

In [28]:
max_P1_P2=abs(P1-P2)+moe
max_P1_P2

0.047288260559332024

In [29]:
## confidence interval
ci = abs(P1-P2) + np.array([-1, 1]) * moe
ci

array([0.01677745, 0.04728826])

In [30]:
## Standard Error Calculation
SE=std_b_w
SE

0.0077833705866767544

In [31]:
## degrees of freedom
B1=var_b/n1
W1=var_w/n2

DF=((B1+W1)**2)/(((B1**2)/n1)+((W1**2)/n2))
DF

4713.53819343226

In [32]:
## degrees of freedom
B1=var_b/n1
W1=var_w/n2

DF=((B1+W1)**2)/(((B1**2)/n1)+((W1**2)/n2))
DF

4713.53819343226

In [33]:
t_val=((P1-P2)-0)/SE
t_val

-4.11555043573

In [34]:
p_value = stats.t.sf(np.abs(t_val), DF)*2  # two-sided pvalue = Prob(abs(t)>tt)
p_value

3.9285451158654165e-05

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

### Answer 4

Using 95% confidence level as threshold in all calculations, we can say that there is 95% confidence that confidence interval of "difference in % callbacks in black and white sounding resume" is from 0.017 to 0.047.

### Answer 5

The analysis I did does not mean that race/name is the most important factor in callback success. It only means that race/name is a factor that affect callback success. Correlation between different features and callback success must be analyzed and ranked to test which feature is most important in callback success.