# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [20]:
import pandas as pd
import numpy as np
from scipy import stats

In [21]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [22]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [23]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [24]:
w = data[data.race=='w']
b = data[data.race=='b']

In [25]:
w.call.value_counts()

0.0    2200
1.0     235
Name: call, dtype: int64

In [26]:
b.call.value_counts()

0.0    2278
1.0     157
Name: call, dtype: int64

Q1: What test is appropriate for this problem? Does CLT apply?

SInce we are dealing with proprtions, we want to see if the proportions of those that got a call and were marked 'W' varied greatly versus those that got a call and were marked 'B'. 

In order to determine if this is greatly difference in proportion, we will use a Z score to determine significance. In order to make sure the score is accurate we use a confidence level of 0.05. This gives us an accurate reading on whether race plays a role in the number of resume call backs.

CLT applies in this question since all of the conditions are met in this question.
1. The data is collected and is randomized as is stated
2. Each resume is independent of one another
3. The 10% condition is met since it is safe to assume that the data is acquired from a large number of resumes
4. Sample size is greater than 30


In [27]:
    Q2: WHat are the null and alternate hypotheses?
        
        Null hypothesis is that the proportion of 'W' marked resumes that received a call back is equal to the number of 'B' marked resumes that received a call back
        Alternate hypothesis is that they are not equal

IndentationError: unexpected indent (<ipython-input-27-fb7ea17c141d>, line 3)

In [None]:
# Your solution to Q3 here

Q3: Compute the margin of error, confidence interval, and p-value. Try using the bootstrapping and frequentist statististical approaches

In [None]:
w_counts=dict(w.call.value_counts())
b_counts=dict(b.call.value_counts())

y1 = w_counts[1]
n1 = w_counts[0] + w_counts[1]
p1 = y1/n1

y2 = b_counts[1]
n2 = b_counts[0] + b_counts[1]
p2 = y2/n2

def calc_z(y1,n1,p1,y2,n2,p2):
    p = (y1 + y2)/(n1 + n2)
    z = (p1 - p2)/(p*(1-p)*((1/n1) + (1/n2)))**0.5
    p_value=stats.norm.sf(abs(z))*2
    return z,p_value
z_score, p_val = calc_z(y1,n1,p1,y2,n2,p2)
alpha = 0.05
print('Null Hypothesis: Race does not have significant impact on the rate of callbacks. np-value =', p_val)
if p_val < alpha:
    print("The null hypothesis is rejected")
else:
    print("The null hypothesis cannot be rejected" )

Bootstrap analysis assuming the null hypothesis is true.

In [None]:
#create permutation samples 
def permutation_sample(data1, data2):
    data = np.concatenate((data1,data2))
    permuted_data = np.random.permutation(data)
    
    #split the permuted array 
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]
    
    return perm_sample_1, perm_sample_2
def draw_perm_reps(data_1, data_2, func, size=1):
    
    #create perm_replicates
    perm_replicates = np.empty(size)
    
    for i in range(size):
        perm_sample_1, perm_sample_2 = permutation_sample(data_1,data_2)
        
        perm_replicates[i] = func(perm_sample_1,perm_sample_2)
    
    return perm_replicates

def diff_in_props(rep_1, rep_2):
    count_1 = np.count_nonzero(rep_1 == 1)
    count_2 = np.count_nonzero(rep_2 == 1)
    prop1 = count_1/len(rep_1)
    prop2 = count_2/len(rep_2)
    return prop1-prop2


In [None]:
w_arr=w.call.values
b_arr=b.call.values
original_diff=diff_in_props(w_arr,b_arr)

reps=draw_perm_reps(w_arr,b_arr,diff_in_props, 100000)

p_val=np.sum(reps>=original_diff) / len(reps)

print('Null hypothesis: Race does not have an impact on callback rates of resumes. np-value=', p_val)

if p_val < alpha:
    print("The null hypothesis is rejected")
else:
    print("The null hypothesis is accepted")

In [28]:
CI_95=np.percentile(reps,[2.5,97.5])
print('The 95% confidence interval of difference between the proportions is :', CI_95)

The 95% confidence interval of difference between the proportions is : [-0.01560575  0.01560575]


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

Q4. Write a story describing the statistical significance in the context or original problem






Q5. Does your analysis mean that race/name is the most important factor in callback success? WHy or why not? If not, how yould you amend your analysis?


WHile it is considered in the analysis, there are other factors that are seen within the analysis that we did not consider such as years of experience, education and references which can also affect the callback process. We cannot say definitively that it race is the most important factor.




