In [55]:
from IPython.display import HTML
HTML('<iframe src=http://stanford.edu/~mwaskom/software/seaborn/index.html width=700 height=350></iframe>')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats

# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

# 1. What test is appropriate for this problem? Does CLT apply?

We have a categorical variable of interest measured in two populations, and we are interested in comparing the proportions of a certain category for the the two populations.

So We will use hypothesis test for comparing two proportions

Yes the Central imit theorem applies here


# 2. What are the null and alternate hypotheses?

Null Hypothesis H0 : We assume that number of proportion of callbacks to black sounding names(P_b) = proportion of callbacks to white sounding names (P_w)

Alternate Hypothesis H1: We reject the null hypothesis and state that there is a significant difference between proportion of callbacks between black and white and there is racial discrimination in US job market

# 3. Compute margin of error, confidence interval, and p-value.

In [56]:
#read data
data = pd.io.stata.read_stata('/home/kiran/Desktop/Springboard/Racial Discrimination/racial_disc/data/us_job_market_discrimination.dta')

In [57]:
# First of all separate the resumes by race: b and w.
df_white=data[data.race=='w']
df_black=data[data.race=='b']

# Total Number of resumes per race:
w_total=len(df_white.race)
b_total=len(df_black.race)
print(w_total,b_total)

# number of callbacks for black-sounding names
b_calls=sum(data[data.race=='b'].call)

# number of callbacks for white-sounding names
w_calls=sum(data[data.race=='w'].call)
print(w_calls,b_calls)

# Sample proportions:
w_sample_p = w_calls/w_total
b_sample_p = b_calls/b_total

print (w_sample_p,b_sample_p)


2435 2435
235.0 157.0
0.0965092402464 0.064476386037


In [58]:
# Sampling variance for whites and blacks sample

var_white= (w_sample_p*(1-w_sample_p))/w_total
var_black= (b_sample_p*(1-b_sample_p))/b_total

print(var_white,var_black)

3.5809119833e-05 2.47717378565e-05


Now to compute 95% confidence interval we will take sampling distribution of difference of proportions between black and white

for which the mean will be diffrence in proportions of black and white and standard deviation will be sqrt(var_white+var_black)


In [59]:
mean_sample_combined= w_sample_p-b_sample_p
std_deviation_combined=np.sqrt(var_white+var_black)
print(mean_sample_combined,std_deviation_combined)

0.0320328542094 0.00778337058668


Now 95% confidence interval will be that there is 95% chance that P1-P2 is within distance(d) of 0.0320 
Now since the sample size is large we assume it follows normal distribution  


In [60]:
#Now find Z score corrosopnding to 97.5 since it is a two tailed distribution so 2.5 % on both sides
Z=stats.norm.ppf(0.975)
print(Z)

1.95996398454


#So distance (d) is 1.96*std_deviatin of population. But the problem is we don't know the true standard deviation of population so we will approximate it using sample proportions standard deviation and use it to calculate true population standard deviation



In [61]:
d = Z*std_deviation_combined # distance d
print(d)

0.0152551260282


In [62]:
Upper_CI = mean_sample_combined+d     #95% upper confidence interval
Lower_CI= mean_sample_combined -d     #95% lower confidence interval

print(Lower_CI,Upper_CI)

0.0167777281812 0.0472879802377


THe above reult indicates thatthere is a 95 % chance that the difference in callbacks between white sounding names will receive morecallbacks than black sounding names

# Hypothesis test to find p-value

Ho: No difference in callbacks between black and white P_w =P_b i.e P_w-P_b=0
H1 : THere is difference in callbacks between black and white P_w != P_b i.e P_w-P_b !=0
Margin of error (alpha) = 5%
Assumption is if H0: P(Pw-Pb|H0)<5% we will reject null hypothesis

SO now we are finding a probablity of finding something that extreme equal to greater than mean_sample_combined which is equal to diffrence in proportion betwen black and white.

We will first calculate true proportion by considering there is no difference in callbacks between black and white and then plug it in to calculate Z value assuming null hypothesis is true

In [63]:
true_proportion= (b_calls+w_calls)/(w_total+b_total) #assuming there is no diffrenece
true_std_deviation= np.sqrt((2*true_proportion*(1-true_proportion))/2435)
Z_value = (mean_sample_combined-0)/true_std_deviation


#Now finding p value
p_value= 1-stats.norm.cdf(Z_value) # since we want to find probability of gretaer than or equl to thatZ-value so 
                                    #doing 1-CDF.

print(Z_value,p_value)

4.10841215243 1.99194341879e-05


From the above p_value we can say that there is less than 5% chance og getting that Z value and we reject the null hypothesis with 95% confidence that there is a significant difference in call backs to black sound names and white sounding names and hence we can sya that there exists racial discrimination in US job market to some extent.