
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

#### First thing to notice is the fact that there are some fields which won't have the same number of values as the others. This means that those fields will have a lot of NaN's and we might want to be careful into considering this in further analysis. 

## The first problem we want to tackle is **which test is appropriate** for this problem?

Since we have two categorical variables to examine, it might make sense to **estimate the difference between proportions.** 

In [5]:
callbacks_b = sum(data[data.race == 'b'].call)
total_people_b = len(data[data.race == 'b'].call)

callbacks_w = sum(data[data.race == 'w'].call)
total_people_w = len(data[data.race == 'w'].call)

proportions = {'bn': {
                        'success':callbacks_b,
                        'n':total_people_b
                        },
              'wn':{
                        'success':callbacks_w,
                        'n':total_people_w}
              }

df = pd.DataFrame(proportions)
df['total'] = df.bn + df.wn

prop = df.T
prop['p_hat'] = prop.success.astype(float)/prop.n.astype(float)

#it would be easier to calculate here the values under the square root
prop['mid_step'] = (prop.p_hat * (1 - prop.p_hat) / prop.n)
prop

Unnamed: 0,n,success,p_hat,mid_step
bn,2435,157,0.064476,2.5e-05
wn,2435,235,0.096509,3.6e-05
total,4870,392,0.080493,1.5e-05


# Checking conditions

We'll continue estimating this by calculating the margin of error. 

1. **Independece**
    * there is no reason to believe the people can be related in any way, so independence is assured.
    * since both groups have 2435 people which definitely represent less then 10% of the population of people applying for jobs, this condition is also met
    
    
2. **Sample size/Skew**
    * n1p1 >= 10 and n1(1-p1) >= 10
    * n2p2 >= 10 and n2(1-p2) >= 10

In [6]:
race = ['bn','wn']
success_fail = {val: {
        'success': prop.p_hat[val] * prop.n[val], 
        'failure':((1-prop.p_hat[val]) * prop.n[val])
    } for val in race}

for race, values in success_fail.iteritems():
    if values['success'] >= 10 and values['failure'] >= 10:
        print "Condition is met for " + race + ", we have {0} successes and {1} failures".format(values['success'], values['failure'])
        

Condition is met for wn, we have 235.0 successes and 2200.0 failures
Condition is met for bn, we have 157.0 successes and 2278.0 failures


## Since we're already down this road, we might as well tackle the question of margin of error and confidence interval

In [7]:
point_estimate = prop.p_hat['wn'] - prop.p_hat['bn']
standard_error = (prop.mid_step['bn'] + prop.mid_step['wn']) ** 0.5
a1 = 0.68
a2 = 0.95
a3 = 0.997

z1 = stats.norm.ppf(1 - ((1-a1)/2))
z2 = stats.norm.ppf(1 - ((1-a2)/2))
z3 = stats.norm.ppf(1 - ((1-a3)/2))

z_vals = {a1:z1, a2:z2,a3:z3}

ME = {level:z_score*standard_error for level, z_score in z_vals.iteritems()}
ME[0.95]

CI = {level:(point_estimate - me_level, point_estimate + me_level) for level, me_level in ME.iteritems()}
for level, ci in CI.iteritems():
    print "CI for {0}% is {1}".format(level*100, ci)

CI for 68.0% is (0.024292619971581962, 0.039773088447309209)
CI for 95.0% is (0.016777728181230755, 0.047287980237660412)
CI for 99.7% is (0.0089338501323753469, 0.055131858286515824)


### We are 95% confident that the proportion of white sounding names which get a callback for an interview is *1.68 to 4.73 higher* than the proportion of black sounding names which get callbacks for interviews. 

## Now about that hypothesis test.

H0: p_white - p_black = 0 #stating that there is no difference between black sounding names and white sounding names

HA: p_white - p_black != 0 #stating that there is a difference between black sounding names and white sounding names

We will need to calculate a pooled proportion in order to continue this exercise

In [8]:
p_pool = (prop.success['bn'] + prop.success['wn']) / (prop.n['bn'] + prop.n['wn'])
p_pool

0.080492813141683772

Because we don't have a population proportion which we can use in this situation, we will use a pooled proportion in order to check whether the conditions for a hypothesis test are met.

1. **Independece**: has been proved above.
2. **Success - Failure Condition:**

In [9]:
race = ['bn','wn']
success_fail = {val: {
        'success': prop.p_hat[val] * prop.n[val], 
        'failure':((1-prop.p_hat[val]) * prop.n[val])
    } for val in race}

for race, values in success_fail.iteritems():
    if values['success'] >= 10 and values['failure'] >= 10:
        print "Condition is met for " + race + ", we have {0} successes and {1} failures".format(values['success'], values['failure'])
        

Condition is met for wn, we have 235.0 successes and 2200.0 failures
Condition is met for bn, we have 157.0 successes and 2278.0 failures


Now that we have checked for these conditions, we can continue on calculating the **p-value**

In [10]:
p_pool_bn = p_pool * (1-p_pool) / prop.n['bn'] 
p_pool_wn = p_pool * (1-p_pool) / prop.n['wn'] 
standard_error = np.sqrt(p_pool_bn + p_pool_wn)

z_score = (point_estimate - 0)/standard_error
z_score

4.1084121524343464

Given that we have such a high z-score, we can tell that the p-value will be very close to 0.

In [11]:
p = (1-stats.norm.cdf(np.abs(z_score)))*2
p

3.9838868375774439e-05

## Discuss statistical significance

##### Because the p-value is very low compared with the alpha = 0.05 threshold we established, we will reject the null hypothesis, and accept the alternative hypothesis which states that there is a statistically significant difference between the callbacks received by white sounding names and black sounding names. 