# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.describe()
data.info()
data.shape
data.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

## Q1. What test is appropriate for this problem? Does CLT apply?

In [5]:
# Is there a better way to seperate dataframe into 2
df = data[['race','call']]
print(df.race.value_counts())
black = df[df.race == "b"]
white = df[df.race == 'w']

# To check the normality of two sample, we first apply CLT here:
# 3 conditions to meet:
# Randomization: 
# from the desription above: Researchers examined the level of racial discrimination in the United States 
#                            labor market by randomly assigning identical résumés to 
#                            black-sounding or white-sounding names and observing the impact 
#                            on requests for interviews from employers.
# 10% RULE:
# The sample size must not be bigger than 10% of the entire population. IF NO REPLACEMENT WHEN SMAPLING
# From US Bureau of Labor Statistics, there were 143,357,000 in work force in June, 2011

sample_pop_ratio = len(df) / 143357000
if sample_pop_ratio < 0.1:
    print("sample population ration meet the CLT condition")
else:
    print("sample population ratio did not meet the CLT condition")

# Large Enough Sample Size: a. Sample size n should be large enough so that np≥10 and nq≥10
#                        or b. Sample size => 30

p_for_black = np.sum(black.call) / len(black.call)
q_for_black = 1 - p_for_black
p_for_white = np.sum(white.call) / len(white.call)
q_for_white = 1 - p_for_white

print("np and nq for black: " + str(len(black) * p_for_black), str(len(black) * q_for_black))
print("np and nq for white: " + str(len(white) * p_for_white), str(len(white) * q_for_white))

# Both bigger ≥ 10, and sample size are both ≥ 30

# Since the CLT works for both samples(normality), we can apply a independent t-test
print(np.std(black.call))
print(np.std(white.call))
# Does this count as homogeneous of variance



w    2435
b    2435
Name: race, dtype: int64
sample population ration meet the CLT condition
np and nq for black: 157.0 2278.0
np and nq for white: 235.0 2200.0
0.24559901654720306
0.29528486728668213


## Q2. What are the null and alternate hypotheses?

In [6]:
# Assume significant level a = 0.05
# H0: mean of black's call back == mean of white's call back
# H1: mean of black's call back < mean of white's call back

## Q3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

In [9]:
# Still did not get the idea of bootstrapping and replicates

# z_value = stats.norm.ppf(q=0.975)
# margin_error_blk = z_value * math.sqrt((p_for_black * q_for_black) / len(black))
# print(margin_error_blk)
# CI95_blk = (np.mean(black.call) - margin_error_blk, np.mean(white.call) + margin_error_blk)
# print("95% Confidence Interval: " + str(np.mean(black.call) - margin_error_blk), str(np.mean(white.call) + margin_error_blk))
# np.mean(black.call)

# For margin of error of a 2-tail 95%CI:
import math

# margin of error for mean call back rate of white - mean call back rate of black:
# since we are dealing with a sample size >> 30, the t_value will approach z_value
# we are going to use z_value here to construct our 95%CI
print("mean black call back rate: " + str(np.mean(black.call)))
print("mean white call back rate: " + str(np.mean(white.call)))
z_value = stats.norm.ppf(q=0.975)
margin_error = z_value * (math.sqrt((((len(black.call)-1) * np.std(black.call)) + ((len(white.call) - 1) * np.std(white.call))) / (len(df)))) * math.sqrt(1 / len(black.call) + 1 / len(white.call))
print(margin_error)
mean_diff = np.mean(white.call) - np.mean(black.call)
print(mean_diff)
CI_95 = (mean_diff - margin_error, mean_diff + margin_error)
print("The confidence interval for mean call back rate of white - mean call back rate of black is: " + str(mean_diff - margin_error) + "," + str(mean_diff + margin_error))


# H0: mean of black's call back == mean of white's call back
# H1: mean of black's call back < mean of white's call back

print(stats.ttest_ind(black.call, white.call, axis=0, equal_var=True, nan_policy='propagate'))
print("t-test_p-value: " + "%.16f" % float("3.940802103128886e-05"))

# The p value is really the probability of a result at least as extreme as the sample result 
# if the null hypothesis were true. So a p value of 0.0000394080210313 means that if the 
# null hypothesis were true, a sample result this extreme would occur only 0.0000394080210313 of the time.


# p-value < 0.01: there is overwhelming evidence to infer that the alternative hypothesis is true. The test is highly significant.
# 0.01 < p-value < 0.05: there is strong evidence to infer that the alternative hypothesis is true. The test is significant.
# 0.05 < p-value < 0.10: there is weak evidence to infer that the alternative hypothesis is true. The test is not statistically significant.
# 0.10 < p-value: there is no evidence to infer that the alternative hypothesis is true. The test is not statistically significant.

# Since the p-value: 0.00004 < 0.05, we can reject our null hypothesis with overwhelming evidence 
# in favor of the alternative hypothesis that call back rate of black is smaller then call
# back rate of white 


mean black call back rate: 0.0644763857126236
mean white call back rate: 0.09650924056768417
0.029205283954724538
0.03203285485506058
The confidence interval for mean call back rate of white - mean call back rate of black is: 0.00282757090033604,0.06123813880978511
Ttest_indResult(statistic=-4.114705290861751, pvalue=3.940802103128886e-05)
t-test_p-value: 0.0000394080210313


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

## Q4. Write a story describing the statistical significance in the context or the original problem.

In [10]:
# The purpose of this research is to find if discrimination exist in US job market between 
# blacks and whites, data are collected by randomly assigning identical résumés black-sounding 
# or white-sounding names and observing the impact on requests for interviews from employers.

# By performing some statistic test, I realized that the difference of call back rate between 
# white-sounding name and black-sounding name are significant, which white-sounding name get 
# 50% more call for interview than black-sounding name.

# The result indicates that all other things being same, the white-sounding name have a 
# 50% higher call back rate for  interview than black-sounding name, discrimination between
# blacks and whites do exist in the US job market between 2000 - 2002

## Q5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

In [11]:
# Our analysis can only tell us that there exist a callback rate different between blacks and 
# whites, but we cannot conclude that this is the most important factor, since this is the only
# factor we check.

# There still exist other 63 columns for us to pick from, which we can check their relationship 
# with call back rate, for example: sex, yearsexp, etc, which I am going to perform 
# below

## Xtra 1: sex and callback rate

In [22]:
df_sex = data[['call','sex']]
df_sex.head()
male = df_sex[df_sex.sex=='m']
female = df_sex[df_sex.sex=='f']
print("Male call back rate: " + str(np.mean(male.call)))
print("Female call back rate: " + str(np.mean(female.call)))
print(stats.ttest_ind(male.call, female.call, axis=0, equal_var=True, nan_policy='propagate'))

# At the 5% significant level, we cannot reject the null hypothesis that male call bakc rate 
# is different from female call back rate

Male call back rate: 0.07384341955184937
Female call back rate: 0.08248798549175262
Ttest_indResult(statistic=-0.9341989341332145, pvalue=0.3502476207298205)


## Xtra 2: education and callback rate

In [37]:
df_exp = data[['yearsexp','call']]
df_exp.head()

ten_yrs_plus_exp = df_exp[df_exp.yearsexp > 10]
less_exp = df_exp[df_exp.yearsexp <= 10]

print("Call back rate for people that have more than 10 years expereince: " + str(np.mean(ten_yrs_plus_exp.call)))
print("Call back rate for people that have less than 10 years expereince: " + str(np.mean(less_exp.call)))

print(stats.ttest_ind(ten_yrs_plus_exp.call, less_exp.call, axis=0, equal_var=True, nan_policy='propagate'))

# From the result of the t-test, we can reject the null hypothesis, and accpet the alterncative
# that the call back rate for people with more than 10 years exp is higher than those who have 
# less than 10 years exp.


Call back rate for people that have more than 10 years expereince: 0.11386138945817947
Call back rate for people that have less than 10 years expereince: 0.07176166027784348
Ttest_indResult(statistic=4.3861180762876035, pvalue=1.1782773739952898e-05)
