# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [21]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
data.shape

(4870, 65)

In [6]:
sum(data[data.race=='b'].call)

157.0

In [7]:
sum(data.call)

392.0

<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

### 1. What test is appropriate for this problem? Does CLT apply?
Answers:
* Given that this problem deals with binary outcomes, it probably follows a Bernoulli distribution.
* The normal distribution may be used as an approximation of the Binomial distribution, so the CLT would be applicable here as well.

### 2. What are the null and alternate hypotheses?
Answers:
* Null Hypothesis: There is no difference in the callbacks between the two racial groups.
* Alternative Hypothesis: There is a difference in the callbacks between the two racial groups.

In [12]:
white = data[data.race=='w']
black = data[data.race=='b']

In [13]:
# Your solution to Q3 here

# Lets look at our data

In [15]:
# Total no. of callbacks across all records:
total_callbacks = sum(data.call)
total_callbacks

392.0

In [16]:
# total black-sounding and white sounding callbacks out of total
black_total = sum(data[data.race=='b'].call)
print(black_total)

white_total = sum(data[data.race=='w'].call)
print(white_total)

157.0
235.0


In [25]:
# % total of black-sounding
bp = black_total/total_callbacks * 100

In [26]:
# % total of white-sounding
wp = white_total/total_callbacks * 100

Looks like the white-sounding names have a higher percentage of callbacks.

In [27]:
print("\n the diff between these callbacks is ", wp-bp)


 the diff between these callbacks is  19.89795918367348


In [33]:
#Let us test our hypotheses and see if this difference is significant using a combination
# of the ttest, margin of error, CIs and p-value.

# Isolating the 
w = data[data.race=='w'].call
b = data[data.race=='b'].call

t, p = stats.ttest_ind(w, b)
print(t, p)

print("\nLooks like based on the race-related names alone, it seems to be a significant difference")

4.114705290861751 3.940802103128886e-05

Looks like based on the race-related names alone, it seems to be a significant difference


In [36]:
# for the CIs, for white-sounding names
ci = 1.96
# mean of callbacks only for white-sounding names
mean = wp/100
print("mean: ", mean)
n = len(data)
moe = np.sqrt((mean*(1-mean)/n))*ci
print("margin of error: ", moe)
ci_range = (mean - moe, mean + moe)
print("CI for white-sounding names: ", ci_range)

mean:  0.5994897959183674
margin of error:  0.013762244874422664
CI for white-sounding names:  (0.5857275510439447, 0.61325204079279)


In [37]:
# for the CIs, for black-sounding names
ci = 1.96
# mean of callbacks only for black-sounding names
mean = bp/100
print("mean: ", mean)
n = len(data)
moe = np.sqrt((mean*(1-mean)/n))*ci
print("margin of error: ", moe)
ci_range = (mean - moe, mean + moe)
print("CI for black-sounding names: ", ci_range)

mean:  0.4005102040816326
margin of error:  0.013762244874422664
CI for black-sounding names:  (0.38674795920720995, 0.4142724489560553)


* For white-sounding names, the expected call backs are between 58.6% to 61.3%.
* For black-sounding names, the expected call backs are between 38.7% to 41.4%.

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

### 4. Write a story describing the statistical significance in the context or the original problem.
Answer:
Though the t-test suggested that our difference was statistically significant w.r.t to the original posed question, if we were to truly model the problem, we would also see that there are so many other factors (65 total no of columns of data) that would play into this. So, this race-based differentiation, might on the surface seem significant when only analyzed with these two variables. But in truth, it must be analyzed with more than just the proportion of call backs, and if they were black or white-sounding.

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

Answer: The analysis did not seek to answer the question of race/name being the most significant contributor. It isolated one portion of the data and sought to understand the effects seen on only the outcome. Such an analysis is incomplete, given how much more data there is, and lots of other actors to consider such as the education, no of jobs held, total years of experience etc. So, if we were to redo our analyses, we would use as many factors as we can to look into the problem and understand our outcomes better.