# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [49]:
import pandas as pd
import numpy as np
from scipy import stats
%precision %.4g

u'%.4g'

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [55]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1,0,1,0,0,0,0,0,0,
1,b,1,3,3,6,0,1,1,0,316,...,1,0,1,0,0,0,0,0,0,
2,b,1,4,1,6,0,0,0,0,19,...,1,0,1,0,0,0,0,0,0,
3,b,1,3,4,6,0,1,0,1,313,...,1,0,1,0,0,0,0,0,0,
4,b,1,3,3,22,0,0,0,0,313,...,1,1,0,0,0,0,0,1,0,Nonprofit


Question1: What test is appropriate for this problem? Does CLT apply?

In this particular problem, we are trying to compare if the proportion of called resumes is different in two populations (black- and white- sounding resumes). The apropiate test here is  know as a two-proportion z-test.

CLT theorem applies, as we can treat the proportion (p) as a random variable with a normal distribution with:

mean = p
standar deviation = sqrt{(p/(1-p))/n} , 

where n is the size of the sample.

source: https://www.stat.auckland.ac.nz/~wild/ChanceEnc/Ch07.propCLT.pdf

Question2: What are the null and alternate hypotheses?
    
Let's say that P1 is the proportion of black-sounding resumes that were called, and P2 the proportion 
of white-sounding resumes that were called, then the null (Ho) and alternative (Ha) hypothesis can be stated as:
        
   Ho: P1 - P2 = 0   /   Ha: P1 - P2 ≠ 0

Question3:
    
    source1: https://onlinecourses.science.psu.edu/stat500/node/55
    source2: http://knowledgetack.com/python/statsmodels/two-sample-hypothesis-testing-in-python-with-statsmodels/
    source3: https://stackoverflow.com/questions/20864847/probability-to-z-score-and-vice-versa-in-python
    source4: http://www.kean.edu/~fosborne/bstat/06d2pop.html
    source5: http://www.statisticshowto.com/how-to-calculate-margin-of-error/

In [63]:
# For calculating the statistics in this exercise, we will use the following info:

s1 = sum(data[data.race=='b'].call) # number of called black-sounding profiles
n1 = sum(data.race=='b') # total number of black-sounding profiles
s2 = sum(data[data.race=='w'].call) # number of called white-sounding profiles 
n2 = sum(data.race=='w') # total number of black-sounding profiles
p1 = s1/n1 #proportion of black-sounding profiles
p2 = s2/n2 #proportion of white-sounding profiles
p = (p1 * n1 + p2 * n2) / (n1 + n2) # pooled proportion

SE = np.sqrt(p*(1 - p)*((1/n1) + (1/n2))) #standar error

z = (p2-p1)/SE #z-test statistics score



Q3a:  What is the p value for the observed difference?

In [64]:
# p-value for this z-score can be calculated as follows:
# we multiply it by 2 (2 tailed), because we are not assuming a specific direction for the difference os P1 or P2:

p_value = 2 * (1-stats.norm.cdf(z))
print(p_value)

0.0


In [66]:
#print standar error
print(SE)

0.0


In [67]:
#print differenc between proprotions
print(p1-p2)

-0.0320328542094


Q3b:  What is the confidence interval for the observed difference?

a confidence interval for the diference between proportions p1 and p2 can be calculated as follows (source 5):

(p1- p2) +/- z_alpha*SE

where the value of z_alpha dependes on degree of sconfidence. For a 95% confidence interval, z_alpha = 1.96 

In this particular case, he SE is 0 or at least a value very small (aproximately 0).

Hence, the confidence interval mgiht be a very narro interval centered around the difference in the two

proportions "-0.03203". The margin od error will be aproximately 0.

Q4: Write a story describing the statistical significance in the context or the original problem.
    
Since p_value < 0.05, we can conclude that there's a significant difference in proportions P1 and P2.
In conclusion, the race of an applicant has a significant effect in the chances of the respective
a resume being called.

Q5: Does your analysis mean that race/name is the most important factor in callback success? Why or why not? 
    If not, how would you amend your analysis?
    
While the present analysis shows race is a significant important factor for callback success, it cannot state 
if race is the most important factor. For that, one would need to analysie if other factors, such as "education" 
or "years of experience" are of any statistical significance