# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [126]:
import pandas as pd
import numpy as np
from scipy import stats

In [127]:
data = pd.io.stata.read_stata(r'C:\Users\hhtph\Documents\Heather\Big Data Classes\Unit8\us_job_market_discrimination.dta')

In [128]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [129]:
data.count

<bound method DataFrame.count of      id   ad  education  ofjobs  yearsexp  honors  volunteer  military  \
0     b    1          4       2         6       0          0         0   
1     b    1          3       3         6       0          1         1   
2     b    1          4       1         6       0          0         0   
3     b    1          3       4         6       0          1         0   
4     b    1          3       3        22       0          0         0   
5     b    1          4       2         6       1          0         0   
6     b    1          4       2         5       0          1         0   
7     b    1          3       4        21       0          1         0   
8     b    1          4       3         3       0          0         0   
9     b    1          4       2         6       0          1         0   
10    b    1          4       4         8       0          1         0   
11    b    1          4       4         8       0          0         0   
12   

In [130]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [131]:
list(data.columns.values)

['id',
 'ad',
 'education',
 'ofjobs',
 'yearsexp',
 'honors',
 'volunteer',
 'military',
 'empholes',
 'occupspecific',
 'occupbroad',
 'workinschool',
 'email',
 'computerskills',
 'specialskills',
 'firstname',
 'sex',
 'race',
 'h',
 'l',
 'call',
 'city',
 'kind',
 'adid',
 'fracblack',
 'fracwhite',
 'lmedhhinc',
 'fracdropout',
 'fraccolp',
 'linc',
 'col',
 'expminreq',
 'schoolreq',
 'eoe',
 'parent_sales',
 'parent_emp',
 'branch_sales',
 'branch_emp',
 'fed',
 'fracblack_empzip',
 'fracwhite_empzip',
 'lmedhhinc_empzip',
 'fracdropout_empzip',
 'fraccolp_empzip',
 'linc_empzip',
 'manager',
 'supervisor',
 'secretary',
 'offsupport',
 'salesrep',
 'retailsales',
 'req',
 'expreq',
 'comreq',
 'educreq',
 'compreq',
 'orgreq',
 'manuf',
 'transcom',
 'bankreal',
 'trade',
 'busservice',
 'othservice',
 'missind',
 'ownership']

In [132]:
data.dtypes

id                     object
ad                     object
education                int8
ofjobs                   int8
yearsexp                 int8
honors                   int8
volunteer                int8
military                 int8
empholes                 int8
occupspecific           int16
occupbroad               int8
workinschool             int8
email                    int8
computerskills           int8
specialskills            int8
firstname              object
sex                    object
race                   object
h                     float32
l                     float32
call                  float32
city                   object
kind                   object
adid                  float32
fracblack             float32
fracwhite             float32
lmedhhinc             float32
fracdropout           float32
fraccolp              float32
linc                  float32
                       ...   
parent_emp            float32
branch_sales          float32
branch_emp

 The z test should be applied here
 Ho = number of black sounding names and white sounding names that were called back are the same
 Ha = the number of black sounding names and white sounding names that were called back are not 
 the same

 The central limit theorme (CTL) applies because the sample size is large and the sample was chosen
 randomly

In [133]:
b_call = np.sum(data[data.race=='b'].call)

In [134]:
print(b_call)

157.0


In [135]:
sum_b = np.sum(data.race=='b')

In [136]:
print(sum_b)

2435


In [137]:
w_call = np.sum(data[data.race=='w'].call)

In [138]:
print(w_call)

235.0


In [139]:
sum_w=sum(data.race=='w')

In [140]:
print(sum_w)

2435


 Compute the z statistic

In [141]:
z_top = (w_call/sum_w) - (b_call/sum_b)

In [142]:
print(z_top)

0.0320328542094


In [143]:
z_bottom_1 = (w_call + b_call)/(sum_w + sum_b)

In [144]:
print(z_bottom_1)

0.0804928131417


In [145]:
z_bottom_2 = 1 - ((w_call + b_call)/(sum_w + sum_b))

In [146]:
print(z_bottom_2)

0.919507186858


In [147]:
z_bottom_3 = (1/sum_w)+(1/sum_b)

In [148]:
print(z_bottom_3)

0.00082135523614


In [149]:
z_bottom = np.sqrt(z_bottom_1 * z_bottom_2 * z_bottom_3)

In [150]:
z = z_top/z_bottom

In [151]:
# Compute the z statistic
print(z)

4.10841215243


In [152]:
# Compute the p value
from scipy import stats
p = stats.norm.cdf(-z)*2
print(p)

3.98388683759e-05


 Conclusion

 The data set for this exercise consists of 4870 rows and 65 columns. Each row represents an
 individual's resume and the columns are the atributes of the individual such as education attained, 
 computer skills, race, sex, etc. The question posed for this exercise is, 'Are individuals with
 black sounding names called back for an interview at the same rate as individuals with white 
 sounding names. A z-test was utilized with the null hypothesis being that the number of people with 
 black sounding last names were called back for an interview at the same rate as the number of people 
 with white sounding last names. The alternative hypothesis was that the call back rate was not the
 same. The result of the z test was 4.1. At a 95% confidence level the null hypothesis is rejected if 
 it is greater than 1.96 or smaller than -1.96. Our null hypothesis was therefore rejected and we
 conclude that people with black sounding names do not get called back for an interview at the same
 rate of people with white sounding names.

 Call back success
 
 I don't believe that race alone is the most important factor in call back success. The call back 
 success is also influenced by the individuals other attributes including education and experience.
 If the result of the analysis was the same for individuals normalized for education or experience,
 or computer skills, then the conclusion that race is the most important factor in call back success
 would have a stronger basis for this claim.