# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [12]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [13]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

## 1. What test is appropriate for this problem? Does CLT apply?
    Yes it does apply in this case! The number of samples is >>30 (4870 to be precise) and we know the sample means. We want to compare and contrast the sample means and we do not know the 

In [15]:
#We are only focusing on two columns of this data. Does being b or w affect the call back rate?
#Let's isolate the two variables for easy access
df = data[['race', 'call']]

#Let's look at the percentage of call backs before delving further

b_calls = sum((df[df.race=='b']).call)
w_calls = sum((df[df.race=='w']).call)

b_length = len(df[df.race=='b'])
w_length = len(df[df.race=='w'])
              
b_percent = b_calls / b_length
w_percent = w_calls / w_length
              
print('Percent of calls made to black candidates:', b_percent)
print('Percent of calls made to white candidates:', w_percent)

Percent of calls made to black candidates: 0.064476386037
Percent of calls made to white candidates: 0.0965092402464


White candidates have approximately 3% more call backs. This will be used to form out hypotheses

## 2. What are the null and alternate hypotheses?

Ho: The mean percentage of white call backs is equal to the mean percentage of black call backs. This means that race has nothing to do with call back rates.
    Xbar w_percent = Xbar b_percent
    
Ha: The mean percentage of white call backs is not equal to the mean percentage of black call backs. This means that race does affect call back rates.
    Xbar w_percent /= Xbar b_percent

## 3. Compute margin of error, confidence interval, and p-value.

I'll use the standard 95% CI. The steps to find out all the relevant statistics are as follows:
- std of b_calls and w_calls
- diff of the mean of samples
- variance of samples
- se (Standard Error) which is the square root of var_w/n_w + var_b/n_b
- tcrit = 1.96
- marg_error = tcrit * se
- CI_95 mean +/- marg_error
- t-statistic is diff of the mean/se
- pvalue use stats.t.sf

In [21]:
#Std
b_std = b_calls.std()
w_std = w_calls.std()

#difference of the means
mean = w_percent - b_percent

#variance from bernouli dist
var_b = b_percent * (1-b_percent)
var_w = w_percent * (1-w_percent)

#SE
se = np.sqrt((var_b/b_length)+ (var_w/w_length))

marg_error = 1.96 * se

CI95 = [mean - marg_error, mean + marg_error]

t_stat = mean/se

pval = stats.t.sf(t_stat, df=b_length-1)

print('The diff of the mean of the two samples is:', mean)
print('The Standard Error is:', se)
print('The 95% CI is:', CI95)
print('The t_stat is:', t_stat)
print('The pvalue is:', pval)

The diff of the mean of the two samples is: 0.0320328542094
The Standard Error is: 0.00778337058668
The 95% CI is: [0.016777447859559147, 0.047288260559332024]
The t_stat is: 4.11555043573
The pvalue is: 1.99553729284e-05


## 4. Write a story describing the statistical significance in the context or the original problem.
The p-value is very small, which leads to the rejection of Ho. This means that race certainly does have an impact on the probabilty for a callback based on a resume.

## 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
There is a lot of great information in the dataframe. Without looking at how other factors affect callbacks like sex, education level, skills, etc, it would be difficult to make a definitive conclusion.