# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats
import seaborn as sns

In [3]:
%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\Section8.3Mini-Projects\2ExamineRacialDiscrimination\EDA_racial_discrimination

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\Section8.3Mini-Projects\2ExamineRacialDiscrimination\EDA_racial_discrimination


In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [6]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p> Question 1: What test is appropriate for this problem? Does CLT apply? </p>
</div>

In [7]:
#1: 

#Problem statement: is the mean number of white callbacks different than the mean number of black callbacks

#A difference of means test should be able to answer the problem statement.

#The callback data is a binomial distribution because similar to a coin toss there are only two outcomes:
#success - a callback occurs, and failure: a callback does not occur. By definition for a large sample the binomial
#distribution is approximately normal, so the CLT does apply.

<div class="span5 alert alert-success">
<p> Question 2: What are the null and alternate hypotheses </p>
</div>

In [8]:
#Hypotheses:

#H0: uw = ub
#H1: uw <> ub


<div class="span5 alert alert-success">
<p> Question 3: 
Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches. </p>
</div>

In [9]:
#3 Margin of error
#code received from Tommy
w = data.loc[data.race=='w', 'call'].astype('category') # extract call column for white-sounding names and convert to categorical data
b = data.loc[data.race=='b', 'call'].astype('category') # extract call column for black-sounding names and convert to categorical data

w_counts_table = w.value_counts().reset_index()
b_counts_table = b.value_counts().reset_index()

n_w = len(w) # calculate sample size for white-sounding names
n_b = len(b) # calculate sample size for black-sounding names
p_w = w_counts_table.iloc[1, 1]/n_w # calculate sample proportion for white return calls
p_b = b_counts_table.iloc[1,1]/n_b  # calculate sample proportion for black return calls

# observed difference
empirical_diff_proportions = p_w - p_b

# overall proportion of returned calls
p_overall = np.mean(data.call)

# calculate the estimated standard error
se = np.sqrt(p_overall * (1 - p_overall) * (1/n_w + 1/n_b))

print(se)

0.007796894200732824


In [10]:
#3 Confidence Interval - using a 95% confidence level - frequentist

dfc = data['call']

mean, sigma = np.mean(dfc), np.std(dfc)

conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(dfc)))

print('confidence interval is: ' + str(conf_int))

confidence interval is: (0.0766160198711491, 0.08436961385973926)


In [11]:
#3 p-value - frequentist

#ttest
mean = np.mean(dfc)
test_value = 0.5
sd = np.std(dfc)
n = dfc.count()
number_of_tails = 2 

t = (mean - test_value)/(sd / np.sqrt(n))
print('p-value = ' + str(stats.t.cdf(t, df = n-1) * number_of_tails))


p-value = 0.0


In [12]:
#3 Function 1 of 2 needed for bootstrapping
def bootstrap_replicate_1d(data,func):
    bs_sample = np.random.choice(data,size=len(data))
    return func(bs_sample)

In [13]:
#3 Function 2 of 2 needed for bootstrapping
def draw_bs_reps(data, func, size=10000):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data,func)

    return bs_replicates

In [14]:
#3 Confidence Interval - using a 95% confidence level - bootstrapping

# Draw bootstrap replicates
bs_replicates = draw_bs_reps(dfc, np.mean, size=100000)

# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates, [2.5, 97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int)


95% confidence interval = [0.07289527 0.08829569]


In [15]:
#3 - calculate p-value via bootstrapping

# Compute fraction of replicates that are less than the observed mean
p = np.sum(bs_replicates <= np.mean(dfc)) / 100000

# Print the p-value
print('p = ', p)

p =  0.51372


<div class="span5 alert alert-success">
<p> Question 4: Write a story describing the statistical significance in the context or the original problem. </p>
</div>

In [16]:
#Based on the sample data and using race as the only variable, we reject the null hypothesis (uw = ub), so white people
#are more likely to receive a call back than black people.

<div class="span5 alert alert-success">
<p> Question 5: Does your analysis mean that race/name is the most important factor in callback success? <br>
    Why or why not? If not, how would you amend your analysis? </p>
</div>

In [17]:
#This analysis does not mean that race/name is the most important factor in callback success because there are a number
#of other variables in the study (i.e. education, years of experience, etc...) that could be influencing the callback decision.
#
#At this time, I would not amend my analysis because the additional analysis to factor in the influence of other variables
#is out of the scope of this exercise.
