# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [9]:
import warnings
warnings.filterwarnings('ignore')

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats
import seaborn as sns

In [11]:
%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\Section8.3Mini-Projects\2ExamineRacialDiscrimination\EDA_racial_discrimination

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\Section8.3Mini-Projects\2ExamineRacialDiscrimination\EDA_racial_discrimination


In [12]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [13]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [14]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p> Question 1: What test is appropriate for this problem? Does CLT apply? </p>
</div>

In [15]:
#1: 

#Problem statement: is the mean number of white callbacks different than the mean number of black callbacks

#A difference of means test should be able to answer the problem statement.

#The callback data is a binomial distribution because similar to a coin toss there are only two outcomes:
#success - a callback occurs, and failure: a callback does not occur. By definition for a large sample the binomial
#distribution is approximately normal, so the CLT does apply.

<div class="span5 alert alert-success">
<p> Question 2: What are the null and alternate hypotheses </p>
</div>

In [None]:
#Hypotheses:

#H0: uw > 0.5    #mean of white callbacks > mean of black callbacks
#H1: uw <= 0.5

<div class="span5 alert alert-success">
<p> Question 3: 
Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches. </p>
</div>

In [62]:
#3 Margin of error formula is sqrt(p(1-p)/samplesize)

#Since the problem is only concerned with white versus black callbacks, filter out all the no-callback data.

dfc = data[['race','call']]
dfc = dfc[dfc.call == 1]

dfwc = dfc[(dfc.race == 'w') & (dfc.call == 1)]
dfbc = dfc[(dfc.race == 'b') & (dfc.call == 1)]

wc_count = dfwc.call.count()
bc_count = dfbc.call.count()
tc_count = dfc.call.count()

#p = wc_count/tc_count

#me = np.sqrt(p * (1-p)/tc_count)

print('p value: ' + str(wc_count/tc_count) )
print('margin of error: ' + str(np.sqrt(p * (1-p)/tc_count)))

#margin of error is 2.4%

p value: 0.5994897959183674
margin of error: 0.024748829105890144


In [None]:
#3 Confidence Interval - using a 95% confidence level
#In the z-table the value for 0.025 (1-0.95/2) is 0.0120

#The mean is 59%, so we can be 95% confident that in the sample data when a callback occurs it is for a white person
#58% to 60% of the time. 

In [64]:
#3 Function 1 of 2 needed for bootstrapping
def bootstrap_replicate_1d(data,func):
    bs_sample = np.random.choice(data,size=len(data))
    return func(bs_sample)

In [66]:
#3 Function 2 of 2 needed for bootstrapping
def draw_bs_reps(data, func, size=10000):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data,func)

    return bs_replicates

In [67]:
#3 - calculate p-value via bootstrapping
#The above "dfc" dataframe is a concatenation of all the white and black callbacks that occurred in the sample data

# Make an array of translated impact forces
dfc_translated = dfc.call - np.mean(dfc.call) + 0.5

# Take bootstrap replicates of the sample
bs_replicates = draw_bs_reps(dfc_translated, np.mean, 10000)

# Compute fraction of replicates that are less than the observed mean
p = np.sum(bs_replicates <= np.mean(dfc.call)) / 10000

# Print the p-value
print('p = ', p)

p =  1.0


<div class="span5 alert alert-success">
<p> Question 4: Write a story describing the statistical significance in the context or the original problem. </p>
</div>

In [None]:
#Based on the sample data and using race as the only variable, we accept the null hypothesis (uw > 0.5) that white people
#are more likely to receive a call back than black people.
#
#This conclusion is based on the...
#  frequentist approach yielding a 59% probability with a 4% margin of error that the null hypothesis is correct
#  bootstrapping approach yielding a 100% probability that the null hypothesis is correct

<div class="span5 alert alert-success">
<p> Question 5: Does your analysis mean that race/name is the most important factor in callback success? <br>
    Why or why not? If not, how would you amend your analysis? </p>
</div>

In [None]:
#This analysis does not mean that race/name is the most important factor in callback success because there are a number
#of other variables in the study (i.e. education, years of experience, etc...) that could be influencing the callback decision.
#
#At this time, I would not amend my analysis because the additional analysis to factor in the influence of other variables
#is out of the scope of this exercise.
