# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.stats as smd
import pylab
%matplotlib inline
from scipy import stats
np.random.seed(42)

In [9]:
data = pd.io.stata.read_stata(r'C:\Users\ozeiri\Downloads\us_job_market_discrimination.dta')

In [10]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [11]:
data.describe()

Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,...,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,...,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,...,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,...,0.308866,0.496083,0.259649,0.275854,0.171677,0.278932,0.410141,0.442847,0.361773,0.371308
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
# Creating the ECDF Function
def ecdf (data):
    n= len(data)
    x=np.sort(data)
    y=np.arange(1,n+1)/n
    return x,y


In [13]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [18]:
# calling the proportions of the candidates sound like black 
b_n=len(data[data.race=='b']) # the sample size of black_sounding candidates
b_r=sum(data[data.race=='b'].call)
rate_b=round(b_r/b_n,4) # the proporton of  black_sounding candidates

w_n=len(data[data.race=='w']) # the sample size of white-sounding candidates 
w_r=sum(data[data.race=='w'].call)
rate_w=round(w_r/w_n,4)  # the proporton of  white_sounding candidates

print('The probability of a balck-sounding candidate to a call is ',' ',rate_b)
print('The probability of a white-sounding candidate to a call is ',' ',rate_w)

The probability of a balck-sounding candidate to a call is    0.0645
The probability of a white-sounding candidate to a call is    0.0965


 # what test is appropriate for this problem? Does CLT apply?
 
 We are trying to find if the proportion of white-sounding and black-sounding candidate are getting cal lbacks with the same rate . So the  null hypothesis is whether the proportion of two populations getting calls is equal. To test this Hypothesis of proportions the nust approperiate method is the Z-score test and since the smaple size is big enough is the next question to apply the Z-score test let us verify this condition 






In [20]:
# let us test for the z-test
tb=b_n*rate_b>=5
tw=w_n*rate_w>=5
tb1=b_n*(1-rate_b)>=5
tw1=w_n*(1-rate_w)>=5

tb
tw
tb1
tw1



True

Since the whole conditions hold right and true then we can apply the Z-test score as  hypothesis testing 
 as for the TLCM yes it applies since the sample size is large enough as the above test shows and the samples drawn are independent Black and White sounding names were randomly assigned to similar resumes so they represent a random sample and are independent. This requirement is met.

In [21]:
w = data[data.race=='w']
b = data[data.race=='b']


Question 2: What are the null and alternate hypotheses?

 Null Hypothesis (Ho): rate_w - rate_b = 0

Alternative Hypothesis (Ha): rate_w = rate_b != 0

Significance Level = .05

 Question 3 Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

Since the null hypothesis states that P1=P2, we use a pooled sample proportion (p) to compute the standard error of the sampling distribution.
p = (p1 * n1 + p2 * n2) / (n1 + n2)

where p1 is the sample proportion from population 1, p2 is the sample proportion from population 2, n1 is the size of sample 1, and n2 is the size of sample 2.

Standard error. Compute the standard error (SE) of the sampling distribution difference between two proportions.
SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }

where p is the pooled sample proportion, n1 is the size of sample 1, and n2 is the size of sample 2.

Test statistic. The test statistic is a z-score (z) defined by the following equation.
z = (p1 - p2) / SE


In [22]:
def Z_test_proportions_two_samples (r1,n1,r2,n2, one_sided=False):
    p1=r1/n1
    p2=r2/n2
    p=(r1+r2)/(n1+n2)
    se=np.sqrt(p*(1-p)*(1/n1+1/n2))
    z=(p1-p2)/se
    p=1-stats.norm.cdf(abs(z))
    p *= 2-one_sided
    return z,p



In [23]:
# what is the prop diffrece between the two samples
prop_diff=rate_w-rate_b

z_critical=1.96
pro_hat_w=rate_w*(1-rate_w)/w_n
pro_hat_b=rate_b*(1-rate_b)/b_n
upperlimit=prop_diff+z_critical*(np.sqrt(pro_hat_w+pro_hat_b))
lowerlimit=prop_diff-z_critical*(np.sqrt(pro_hat_w+pro_hat_b))
print('Confidence interval:\t {} -{}'.format(lowerlimit,upperlimit))
z_stat,p_value= Z_test_proportions_two_samples(w_r,w_n,b_r,b_n)
print('z_stat:\t {}\np-value:\t{}'.format(z_stat,p_value))
marginal_error=(upperlimit-lowerlimit)/2
print('margin error:\t+/-{}'.format(marginal_error))


Confidence interval:	 0.016743915691800497 -0.047256084308199504
z_stat:	 4.108412152434346
p-value:	3.983886837577444e-05
margin error:	+/-0.015256084308199504


Bootstrap approach



In [24]:
r = np.sum(data.call)
n = len(data)
# first le us construct a call back array
call_back=np.array([True]*int(r)+[False]*int(n-r))
size=10000
bs_replicates_diff=np.empty(size)


for i in range(size):
    bs_replicates_w=np.sum(np.random.choice(call_back,size=w_n))
    bs_replicates_b=np.sum(np.random.choice(call_back,size=b_n))
    bs_replicates_diff[i]=(bs_replicates_w - bs_replicates_b)/b_n
bs_pvalue=np.sum(bs_replicates_diff>=prop_diff)/len(bs_replicates_diff)
bs_conf=np.percentile(bs_replicates_diff,[2.5,97.5])
bs_mean_diff=np.mean(bs_replicates_diff)

print('sample diff :{}\n'.format(prop_diff))
print('Results \np-value : {} \n 95% condfidence Interval :{} '.format(bs_pvalue,bs_conf))
    

sample diff :0.032

Results 
p-value : 0.0 
 95% condfidence Interval :[-0.01478439  0.01519507] 


 # Write a story describing the statistical significance in the context or the original problem.
 
 With the p-value of zero in both approaches for testing the Hypothesis --__that P1-P2=0 . Or the proportion of the black-sounding name and white-sounding name are the same--__ has rejected the null hypothesis so there is no enough evidence to assume that the null hypothesis is correct. and the sample we have states  it is more likely that white-sounding names will get a call back then a black_sounding name the proportion is 9.65 % to 6.45 % white-sounding name  to balck-soundng name 

# Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

After reading this preface about the problem adressed the study stated that the resumes are identical and the names were randomly selected as white or black sounding names . as a result I belive the only variable across the data set were the names either white or black sounding and teh resumes ast stated before were kind of identical. 

Based on the above analysis I found out that the race/name is most importatnt factor in callback success .