# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.


### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [3]:
import pandas as pd
import numpy as np
from scipy import stats

In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


### Percentage of success of both races (w, b)

In [6]:
# number of callbacks for white-sounding names
print("Total w candidates %d and %d received call from employer which is about ***%0.2f percent***"
      %(sum(data.race=="w"), sum(data[data.race=='w'].call),(sum(data[data.race=='w'].call)/sum(data.race=="w"))*100))
# number of callbacks for black-sounding names
print("Total b candidates %d and %d received call from employer which is about ***%0.2f percent***"
      %(sum(data.race=="b"), sum(data[data.race=='b'].call),(sum(data[data.race=='b'].call)/sum(data.race=="b"))*100))

Total w candidates 2435 and 235 received call from employer which is about ***9.65 percent***
Total b candidates 2435 and 157 received call from employer which is about ***6.45 percent***


### 1. What test is appropriate for this problem? Does CLT apply?

We should carry out the hypothesis test to exam whether the call-back rate among the races (i.e. Black-sounding and White-sounding) are significantly different from each other. If so, then we can conclude which race has the considerably higher call-back rate. 

Yes, the Central Limit Therom has applied in this case. The success rate will pile up as the shape of Normal distribution if increasing the sampling size for a significant amount of sample population. 

### 2. What are the null and alternate hypotheses?

So that now we first set up our hypotheses
    - Null Hypothesis H0:  pw = pb
    - Alt. Hypothesis Ha: pw <> pb
    
where 
    - pw = callback success rate for race w
    - pb = callback success rate for race b
    
According to the following experiment result with 10000 sampling, we are able to conclude to reject the Null hypotheses which is the success rate of group w and b are the same. The confidence interval of two groups are not overlapping which implies that the proportion of success rate is signficantly different from each other. In this case, we are also to conclude that the callback success rate of race w is signficiantly Higher than race b. 

#### Now we perform **Bernoulli Bootstrapping** experiment on both races to find out the mean, std and C.I. 

In [12]:
from sklearn.utils import resample
from scipy.stats import binom

wCallLst = data.call[data.race=='w']
wSample = resample(wCallLst, n_samples = 10000)

n=len(wSample)
p = sum(wSample)/len(wSample)
z=1.95 # 95% confidence interval

print("Race w Sample Size: %d"%(n))
print("Race w success rate (pw): %0.3f" %(p))
print("Race w Variance (pw): %0.3f" %(p*(1-p)))
CI = (p-z*np.sqrt(p*(1-p)/n) , p+z*np.sqrt(p*(1-p)/n))
print("Race w Confidence Interval: %0.3f - %0.3g"%(CI[0], CI[1]))


Race w Sample Size: 10000
Race w success rate (pw): 0.093
Race w Variance (pw): 0.084
Race w Confidence Interval: 0.087 - 0.0985


In [11]:
wCallLst = data.call[data.race=='b']
wSample = resample(wCallLst, n_samples = 10000)

n=len(wSample)
p = sum(wSample)/len(wSample)
z=1.95 # 95% confidence interval

print("Race w Sample Size: %d"%(n))
print("Race w success rate (pw): %0.3f" %(p))
print("Race w Variance (pw): %0.3f" %(p*(1-p)))
CI = (p-z*np.sqrt(p*(1-p)/n) , p+z*np.sqrt(p*(1-p)/n))
print("Race w Confidence Interval: %0.3f - %0.3g"%(CI[0], CI[1]))


Race w Sample Size: 10000
Race w success rate (pw): 0.066
Race w Variance (pw): 0.061
Race w Confidence Interval: 0.061 - 0.0706


### 4. Write a story describing the statistical significance in the context or the original problem.

The original problem is to propose that racial discrimination impacts the chance of interviews by employers. This argument which is not new raised topic is relatively serious long-term discussion. I believe that this argumented topic does not only exist in one company or country and every single one of us in the workforce does face this circumstance. Since we are already aware this topic long enough, but do we get a very precise answer from anywhere to verify whether racial discrimination really impacts the outcome of job hunting among different races people. As my point of view, I think that there might be another factors related to this topic. As we observe the dataset for this experiment, we could be aware that every single one of employee / person possesses a set skills or personal identities features (e.g. sex / address etc). With the concern of racial discrimination grows, I would propose to carry out more experiments to discover more how the other features may correlate to different races. For example, what would be the proportion of group of people with advanced educational level etc. By doing more this type of anyalysis, we might be able to find out more behind this.

Based on this experiment, we might learn some insights on how the interview success rate differ from each other. Although the two groups of sampling are assigned based on their context of names, we assume that the race assignments are accurate enough for this experiment. So that we are able to conclude that there is a significant difference between 2 groups of people regardless how accurate of the instances being classified.

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

The original experimental hyoptheses was sussessfully discover how the race associated with the name context impact on the sccess rate of interview callback. However, there is still NO substantial evidence to conclude that race/name is the most important factor affect the callback success. The race/name factor might be one of the factor impact on the callback rate, but we should look for more other factors that might be correlated to the callback rate. According to this particular dataset, we could see some of the features in the dataset which sounds related to the outcome of interview callback, e.g. year of experience, computer skills or sex etc. Based on these leads, we might attempt to try to discover the correlation between these features and then carry out more statistical analysis to test on how the features impact on the callback rate differently.

In [9]:
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')