# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [4]:
import pandas as pd
import numpy as np
from scipy import stats

In [5]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [6]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [7]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>1. The population is sufficiently large (over 2,400 samples for each race) that a z test can be used.
    For the central limit theorem to apply, the samples must be random, independent, and either normally distributed or with sufficient sample size (usually n >= 30). In this sample, there is no reason to believe that the sampling was non-random, the population size is large enough that independence can be assumed, and the sample size is large enough that normality of the underlying data is not required. The CLT therefore does apply.</p>
</div>

<div class="span5 alert alert-success">
<p>2. The null hypothesis is that the mean of "call", which is equal to the proportion of applicants who receive a call, will be equal for applicants with typically white names and applicants with typically black names. The alternative hypothesis is that they are unequal.
    An alternative way of stating the null hypothesis is that the difference between the call-back likelihood for those with white-sounding names and those with black-sounding names is zero, with an alternative hypothesis that the difference is not zero. This equivalent set of hypotheses is what will be used for the calculations below.</p>
</div>

In [8]:
w = data[data.race=='w']
b = data[data.race=='b']

# 3. Margin of Error/Confidence Interval and P-values

## Bootstrap method

For the bootstrap method, the data will be randomly permuted 10,000 times. The differences in callback rates in the sample means of the permutations will then be compared to the difference in callback rates observed in the data to determine how likely the results are by chance alone.

The following two functions are defined to facilitate the bootstrap permutations.

In [54]:
def create_permutation(d1, d2):
    """Takes two arrays as parameters. Scrambles the values of the arrays and returns two permuted arrays"""

    # Permute the combination of the two datasets
    permuted_data = np.random.permutation(np.concatenate((d1, d2)))

    # Split the permuted array into two permutations of the sizes of the original arrays
    p1 = permuted_data[:len(d1)]
    p2 = permuted_data[len(d1):]

    return p1, p2


def perm_mean_diff(d1, d2, reps=10000):
    """Takes two arrays and the specified number of replicates as parameters. Returns an array of the specified size
    with the difference in means between the replicates"""

    # Create the array of replicates
    perm_replicates = np.empty(reps)

    # For each item in the array, create a replicate
    for i in range(reps):
        # Generate a permutation using the create_permutation function
        p1, p2 = create_permutation(d1, d2)

        # Compute the mean
        perm_replicates[i] = np.mean(p1) - np.mean(p2)

    return perm_replicates

Now, the functions are used to generate 10,000 replicates using the observed data.

In [63]:
white_calls = w.call.values.astype(int)
black_calls = b.call.values.astype(int)
wb_mean_replicates = perm_mean_diff(white_calls, black_calls)

The wb_mean_replicates array contains 10,000 values for what the difference between white-sounding and black-sounding callback rates might have been by random chance if their true population callback rates were identical. Not one of the 10,000 replicates is as extreme as the observed result, suggesting that it is extremely unlikely that the callback rates differed due to chance alone.

In [87]:
observed_rate_diff = np.mean(white_calls) - np.mean(black_calls)
print('observed difference in callback rates: {}'.format(observed_rate_diff))

# The p value is the proportion of simulated callback differences at least as extreme as the observed difference
p = np.sum(wb_mean_replicates >= abs(observed_rate_diff)) / len(wb_mean_replicates)
print('p-value: {}'.format(p))

observed difference in callback rates: 0.032032854209445585
p-value: 0.0


Next, the margin of error/confidence interval is calculated. This can be estimated using bootstrapping as the sample mean difference that would have been calculated with repeated sampling with replacement from the existing samples. The following function is defined to facilitate this replicate sampling.

In [73]:
def draw_replicates(d, func, size=10000):
    """The data, function to be tested, and number of replicates are passed in as function parameters. An array of the specified
    size is returned, with each element equal to the given func applied to a random bootstrap sample of the data."""
    replicates = np.empty(size)
    for i in range(size):
        replicates[i] = func(np.random.choice(d, size=len(d)))
        
    return replicates

Next, 10,000 replicates are drawn for both the black- and white-sounding callback rates. The difference between each pair of draws is one of the 10,000 simulated differences.

In [75]:
white_rep = draw_replicates(white_calls, np.mean)
black_rep = draw_replicates(black_calls, np.mean)

bootstrap_diff = white_rep - black_rep

The 95% confidence interval for the difference between white and black callback rates is the 2.5th through 97.5th percentiles in the bootstrapped differences.

In [80]:
ci = np.percentile(bootstrap_diff, [2.5, 97.5])
print('the 95% CI for the difference between white and black callback rates is: [{}, {}]'.format(ci[0], ci[1]))
print('the mean is {} and the margin of error is {}'.format(np.mean(bootstrap_diff), np.mean(bootstrap_diff)-ci[0]))

the 95% CI for the difference between white and black callback rates is: [0.01683778234086243, 0.047227926078028754]
the mean is 0.032020739219712525 and the margin of error is 0.015182956878850094


## Frequentist Method

For the frequentist method, a z-test can be used as sample sizes are sufficiently large. As above, the null hypothesis is that the difference between the white- and black-sounding callback rates is zero, and the alternative hypothesis is that the difference is not equal to zero.

First, the white and black callback rates, variances of the distributions, and the mean and variance of the difference between the callback rates is calculated.

In [36]:
# Calculate the mean white and black callback rates and the difference between the two
w_mean = np.mean(w.call)
b_mean = np.mean(b.call)
w_b_diff = w_mean - b_mean

# Calculate the variance of each distribution
w_var = (w_mean * (1 - w_mean))/len(w)
b_var = (b_mean * (1 - b_mean))/len(b)

# The variance of the difference of the samples is their sum
w_b_var_diff = w_var + b_var

# The standard deviation of the difference of the samples is the square root of the variance
w_b_sd_diff = np.sqrt(w_b_var_diff)

Next, a margin of error is computed for the mean difference. This is performed by taking the value 1.96, which is known to be the number of standard deviations away from the mean of a normal distribution to calclate a 95% margin of error, and multiplying it by the standard deviation of the differences.

In [82]:
margin_of_error = 1.96 * w_b_sd_diff

In [81]:
print('difference in means: {}\nmargin of error: {}'.format(w_b_diff, margin_of_error, 0))
print('95% confidence interval: [{}, {}]'.format(w_b_diff - margin_of_error, w_b_diff + margin_of_error))

difference in means: 0.03203285485506058
margin of error: 0.015255406348684322
95% confidence interval: [0.016777448506376254, 0.0472882612037449]


Next, the p-value must be calculated. The z-score is the number of standard deviations that the difference in means, ~.032, is away from zero.

In [88]:
z_score = w_b_diff/w_b_sd_diff
p_value = stats.norm.sf(abs(z_score))*2
print('p-value: {}'.format(p_value))

p-value: 3.86256381290969e-05


In [90]:
w_mean

0.09650924056768417

<div class="span5 alert alert-success">
<p>4. The extremely low p-value suggests that there is almost zero chance that the observed results were due to chance alone. This suggests that those with black-sounding names are less likely to be called back for job interviews than those with white-sounding names. The observed difference in callback rates, 3.2 percentage points, is also very practically significant; the white callback rate is 9.65%, so a black applicant is roughly 1/3 less likely to be called than a white applicant.
    </p>
    <p>
5. This does not suggest that race/name is the most important factor in callback success. Several factors were untested in this analysis - it is almost certain that educational attainment, job experience, or other factors are significant predictors of callback success. This analysis suggests that race/name is an important factor in callback success, but does not suggest its importance relative to other factors.
To test whether race/name is the most important factor, a similar analysis would need to be repeated for all other factors. The impact that the other factors have on callback rates would need to be compared to the impact of race/name to determine whether race/name is the most important.</p>
</div>