
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [3]:
import pandas as pd
import numpy as np
from scipy import stats

In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
# number of callbacks for balck-sounding names
sum(data[data.race=='b'].call)
#list(data.columns.values)

157.0

In [22]:
#print data
#print data.race
#list(data.columns.values)
#print  data.call

#### What test is appropriate for this problem? Does CLT apply?
This problem can be solved by comparing statistics of getting a call being white against statistics of getting a call being black. Ideally other attributes like skill, education, experience etc should also be taken into account by clustering all applicants (or Resumes) into certain clusters and then comparing the statistics of getting a call being white is to getting a call being black but that may be beyond the scope of this project.

CLT should be applicable because all the individual entries each for black and white applicants are independent and total number of applicants each for black and white are decently large 

In [14]:
n_b = sum(data.race == 'b')
print 'Total Black Applicants: ', n_b

Total Black Applicants:  2435


In [15]:
n_w = sum(data.race == 'w')
print 'Total White Applicants: ', n_w

Total White Applicants:  2435


In [16]:
print data.id.count()

4870


In [17]:
data_b = data[data.race == 'b']

In [19]:
data_b.id.count()

2435

In [20]:
data_w = data[data.race=='w']

In [21]:
data_w.id.count()

2435

#### What are the null and alternate hypotheses?

NULL Hypothesis: Race of applicants does not matter, Black and White have equal chances of getting a call. This means u_b == u_w


Alternate Hypothesis: White Applicants have more chance of getting call. u_w > u_b

#### Computing Confidence Interval

We assume a margin of error as 5% to account for sampling error.

Under this assumption p-value (Probability) would be 0.05

In [46]:
mean_b = data_b.call.mean() 
mean_w = data_w.call.mean()
print "Means are: w/b  ", mean_w, mean_b

Means are: w/b   0.0965092405677 0.0644763857126


In [47]:
sd_b = data_b.call.std()
sd_w = data_w.call.std()
print "SD: W,B:",sd_w,",",sd_b

SD: W,B: 0.295345507397 , 0.245649458963


In [30]:
#z = mean(data_w) - mean(data_b) 
mean_z = sd_w - sd_b
print "Mean_Z: ", mean_z

Mean_Z:  0.0496960484336


In [48]:
import math
sd_mean_z = math.sqrt((sd_b/n_b)**2 + (sd_w/n_w)**2)
print "SD of Mean: ", sd_mean_z
#print 9**2

SD of Mean:  0.000157762556884


In [49]:
#z_value corresponding to 95% Probability of making right inference : One Tailed Statistics 
z_value = 1.65
print z_value*sd_mean_z

0.000260308218858


In [50]:
if mean_z - 0 < z_value*sd_mean_z:
    print "Null Hypothesis is correct:", "White candidates has same chance of getting a call as black:"
else: print "White Candidates have more chance of getting a call."

White Candidates have more chance of getting a call.


##### The 2 distributions are different and it looks like white candidates have more chance than black candidates. However, we haven't look into other aspects of data which include level of education, comp skill, work experience etc. This data sample might be biased.