# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [105]:
import numpy as np
import pandas as pd
import scipy.stats 
import matplotlib.pyplot as plt

In [106]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [112]:
# number of callbacks for black-sounding names
bcb = sum(data[data.race=='b'].call)
bp = sum(data.race=='b')

# number of callbacks for white-sounding names
wcb = sum(data[data.race=='w'].call)
wp = sum(data.race=='w')

bcr = (bcb / bp) * 100
wcr = (wcb / wp) * 100

print(bcr)
print(wcr)

6.4476386037
9.65092402464


In [108]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [109]:
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')

In [110]:
scipy.stats.chisquare(bcr, f_exp=wcr)

Power_divergenceResult(statistic=1.0632181397177682, pvalue=nan)

1.There are multiple tests one can run on this data set, depending on what we decide to look at. The simplest test you can run is a chi square, which I used on the rates of callback for the white sounding names and the black sounding names to check if there is a significant difference in the two percentages. You can also run a two sample t-test on this data set which would involve finding the difference in mean for the two data sets. You can also further break down the data set by gender as well as race or run a multivariate analysis based on education or work experience. The sample consists of a large set, above 4000, and so we should be able to apply CLT. Any analysis we run would be comparing the two groups against one another and not against a normalized distribution so having a normalized distribution does not matter as much. 

2.The null hypothesis is that white sounding names and black sounding names have similar rates of callbacks for their resumes. If we control for other factors that would also be part of the null hypothesis, and it would mean that comparable candidates got the same number of callbacks regardless of whether they had white sounding or black sounding names. 

In [111]:
b = data[data.race=='b'].call
w = data[data.race=='w'].call

confint = 1.96 * (b.std() / len(b))
error = 1.96 * (b.std() / np.sqrt(len(b)))

print(b.mean() - confint, b.mean() + confint)
print(scipy.stats.ttest_ind(b, w))
print(error)

0.06427865555423486 0.06467411587101234
Ttest_indResult(statistic=-4.1147052908617514, pvalue=3.9408021031288859e-05)
0.00975713686648


3.The p-value is 3.94E-5, the confidence interval is between 0.06428 and 0.06467, the margin of error is 0.009757. 

4.Black college educated households have the same amount of wealth as white high school educated households [(source)](http://www.pewresearch.org/fact-tank/2013/08/30/black-incomes-are-up-but-wealth-isnt/). This conflicts with the idea that racial wealth inequality is caused by lack of education and experience. Studies like the one we analyze here show that there are structural disadvantages and implicit racism involved as well. Resumes with black sounding names received a smaller percentage of callbacks compared to white sounding names. It is important to find these kind of effects to help us construct more robust policies to help mitigate the disadvantage facing black people. While affirmative action is one way to try and resolve this, we could also do something similar to the blind orchestra auditions [(source)](https://www.nber.org/papers/w5903) and remove identifying information from resumes, in this case names 

5.The analysis does not necessarily indicate that name or race are the strongest factor in predicting whether a candidate will receive a callback. The way to check and see if name/race are playing a significant role in the number of callbacks is by controlling for variables like education and experience. If there is no variation in the percentage of callbacks when you have similar candidates with the same amount of education and experience it would indicate that name/race is playing a significant role. 