# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
len(data)

4870

In [6]:
data['race']

0       w
1       w
2       b
3       b
4       w
5       w
6       w
7       b
8       b
9       b
10      b
11      w
12      b
13      w
14      b
15      w
16      w
17      b
18      w
19      b
20      b
21      w
22      w
23      w
24      w
25      b
26      b
27      w
28      b
29      b
       ..
4840    b
4841    b
4842    b
4843    w
4844    b
4845    w
4846    w
4847    w
4848    b
4849    b
4850    b
4851    w
4852    w
4853    b
4854    w
4855    w
4856    b
4857    b
4858    b
4859    b
4860    w
4861    w
4862    w
4863    w
4864    b
4865    b
4866    b
4867    w
4868    b
4869    w
Name: race, dtype: object

In [7]:
data['call']

0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
5       0.0
6       0.0
7       0.0
8       0.0
9       0.0
10      0.0
11      0.0
12      0.0
13      0.0
14      0.0
15      0.0
16      0.0
17      0.0
18      0.0
19      0.0
20      0.0
21      0.0
22      0.0
23      0.0
24      0.0
25      0.0
26      0.0
27      0.0
28      0.0
29      0.0
       ... 
4840    0.0
4841    0.0
4842    0.0
4843    1.0
4844    0.0
4845    0.0
4846    1.0
4847    1.0
4848    1.0
4849    0.0
4850    0.0
4851    0.0
4852    0.0
4853    0.0
4854    0.0
4855    0.0
4856    0.0
4857    0.0
4858    0.0
4859    1.0
4860    0.0
4861    1.0
4862    0.0
4863    0.0
4864    0.0
4865    0.0
4866    0.0
4867    0.0
4868    0.0
4869    0.0
Name: call, dtype: float32

In [14]:
black = data[data.race=='b'].call
white = data[data.race=='w'].call
black_arr = np.array(black)
white_arr = np.array(white)

In [27]:
# print(black, white, black_arr, white_arr)
# print(len(black == 0))
print(black.value_counts())
print(white.value_counts())

0.0    2278
1.0     157
Name: call, dtype: int64
0.0    2200
1.0     235
Name: call, dtype: int64


## 1. What test is appropriate for this problem? Does CLT apply?

Because the outcomes are binary (i.e. the individuals either did, or did not, receive a call for interview), the samples are Bernouilli distributions. We can therefore either compare the population proportions using a 95% confidence interval, or by doing a hypothesis test. The Central Limit Theorum does apply, as we can assume that over multiple random samples of resume submission in the population as a whole, the probability of interview call backs would follow a normal distribution.


## 2. What are the null and alternate hypotheses?

The null hypothesis, $ H_0 $, is that there is NO difference between the way that white-sounding and black-sounding names are treated, which is to say that the mean difference between the two populations is 0: 

$$ P_1 - P_2 = 0 $$

which would also entail that the mean of any sampling distribution of the proportion between these two groups should also equal 0:

$$ \bar P_1 - \bar P_2 = 0 $$ 

The alternate hypothesis is that there IS a difference, such that:

$$ P_1 - P_2 \neq 0 $$

We want to test this with a significance level, $ \alpha $, of < 5%. If we assume $ H_0 $, this means we want to figure out the probability of our sample occuring given that $ H_0 $ is true. This probability is known as the P-value:

$$ P (\bar P_1 - \bar P_2 \, | \, H_0) < 5%) $$


## 3. Compute margin of error, confidence interval, and p-value.

### Margin of error

The margin of error will be similar for both samples, as they have the same sample size. Let's look at the margin of error for the sample of black-sounding names.

If we score an interview callback with a 1, and a non-callback with a zero, in our sample of 2435 people we have 157 ones and 2278 zeros. The probability of getting a callback is then: (157 x 1) + (2278 x 0) / 2435 = 0.064 = 6.4%

The sample variance $ S^2 $ of this sample is: 157(1 - 0.064)$^2$ + 2278(0 - 0.064)$^2$ / 2453 - 1 = 0.0603

The standard deviation $ S $ of the sample is the square root of this number, which is 0.25.

If we assume, in line with the Central Limit Theorum, that the standard deviation of the sample is approximately equal to the standard deviation of the actual population, we can then estimate the standard deviation of the sampling distribution of the sample mean:

$$ \sigma_\bar x = \frac{\sigma}{\sqrt 2435} \approx \frac{S}{\sqrt 2435} \approx \frac {0.25}{49.35} \approx {0.005}$$

To find the margin of error of the mean of our sample $ \mu_\bar x $ we need to ask ourselves: within what margin or interval of the sample mean can we be reasonably confident, which is to say 95.4% confident, that the true mean $\mu$ of the actual population will fall? We know that 95.4% of the population will fall within two standard deviations of the mean in a normal distribution, so using the CLT, we are able to say that:

$$ p  ( \mu \, is\, within\, 2\sigma_\bar x \,of\, \bar x)\, = \, 95.4% $$

As $ \sigma_\bar x \, = \, 0.005 $, $2\sigma_\bar x \,=\, 0.01 $, which is 1%. So our margin of error is 1%. Thus we can be 95.4% sure that the true population mean falls within 1% of 6.4%, that is to say:

$$ 5.4 \, \lt \, \mu \, \lt \, 7.4 $$


### Confidence Interval

Clearly we used a confidence interval to compute the margin of error, above (a margin of error is in fact just another way of expressing a confidence interval). But what we're really interested in here is using a confidence interval to express the degree to which we can be certain that there is a significant difference between the outcomes for our two sample groups of black-sounding and white-sounding names. 

In other words, with what level of confidence can we confirm or reject the null hypothesis outlined above that there is no significant difference between the outcomes of the two groups?

$$ P_1 - P_2 = 0 $$

As affirmed above, this hypothesis entails that the mean of the sampling *proportion* of samples of the two groups should also equal 0:

$$ \bar P_1 - \bar P_2 = 0 $$

As this is a hypothesis about a sample distribution of a sample mean (albeit a sample of proportions, rather than absolute sizes) we can calculate a confidence interval much as we did for the margin of error, above. 

Right now, $ \bar P_1 - \bar P_2 $ is not equal to zero but to 0.0645 - 0.0965 = 0.032. So what is our 95% confidence interval here? What's the interval $ d $ such that we can be 95% that the true difference between the sample groups will fall within $ d $ of 0.032 (or 3.2%)?

To find this interval we can use a $ z $ table. This is a two-tailed interval (i.e. it stretches out to both sides from the mean), so we want to find the 97.5% confidence value in a normal distribution, and apply it to both sides of our mean. Looking up 97.5% on the table gives us a $ z $ score of 1.96, so:

$$ d = 1.96 \, * \, \sigma(\bar p_1 - \bar p_2) $$

where $ \sigma(\bar p_1 - \bar p_2) $ is the standard deviation of the sample distribution of the sample proportion. To calculate *that* we calculate the variance of the two samples according to the CLT formula:

$$ \sigma_\bar p ^2 = \frac{P_x (1-P_x)}{2435} $$

then subtract the result for $\bar p_2$ from the result for $\bar p_1 $ to give us the variance of the sampling distribution of the sample proportions, and then take the square root of that figure to give us the standard deviation. As our sample size is quite large, we can estimate the true population means $P_1$ and $P_2$ using our sample proportions (once of which we already calculate above when looking at the margin of error):

$$ \sigma(\bar p_1 - \bar p_2) \approx \sqrt (\frac{P_1 (1 - P_1)}{2435} + \frac{P_2 (1 - P_2)}{2435}) $$

If we plug in the numbers, we get a result of 0.007, and so we can now calculate $ d $:

$$ d \, = \, 1.96 \, * \, 0.007 \, = \, 0.013 $$

So we have our answer: our 95% confidence interval is 0.013 + 0.013 = 0.026. So there's a 95% chance that the true proportional difference between the two groups of black-sounding and white-sounding names is within 1.3% of 3.2%, which is to say that: 

$$ 1.9 \, \lt \, P \, \lt \, 4.5 $$

As zero lies outside this range, we can be 95% certain that the actual difference between the groups is not zero, and therefore we can reject the null hypothesis.

#### P-Value

Flipping this round, we can calculate the P-value of getting a difference between two samples of 3.2% by plugging our result back into the CLT formula and consulting the $ Z $ table again:

$$ z = \frac{statistic \, - \, hypothesized\,value}{estimated\,standard\,error\,of\,the\,statistic} $$

which is: 

$$ z = \frac{(\bar p_1 - \bar p_2) - \mu_p}{\sigma(\bar p_1 - \bar p_2)} $$

which is: 

$$ z = \frac{0.032 \, - \, 0}{0.007} = 4.57 $$

We already know that our critical $ z $ value for a 95% confidence interval is 1.96. 4.57 is way beyond this, so extreme it doesn't even register on the $ z $ table, so we can confidently say that the P-value or probability of getting a difference between the two samples of 3.2% is less than 5% - and in fact less than 1% - and we can again reject the null hypothesis.


## 4. Write a story describing the statistical significance in the context of the original problem.

Now that all the above calculations have been done, we can write our story. When we examined the affect of racially inflected surnames on the interview callback rate on identical resumés submitted under black-sounding and white-sounding names, we found that the black-sounding names had a callback rate of 6.45%, and the white-sounding names had a callback rate of 9.65%. If we assume that, other things being equal, resumé callback rates for the job-seeking population at large follow a normal distribution based on the quality of the resumé and its appropriateness for the position, the chances of getting a difference of this magnitude between our two samples (with their substantial sample population sizes of 2,435 people per sample), is less than 1%. We can in addition be 95% certain that people with white-sounding names are between 1.8% and 4.5% more likely to receive an interview callback from a resumé submission than people with black-sounding names, all other things being equal.


## 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

The analysis suggests that race/name is a factor in callback success, but not the most important factor. Even at its most statistically likely, the influence it has on the outcome is less than 5%. To get a clearly picture of the other factors at play and their relative influence, we would need to do a series of similar comparative analyses isolating the other factors in the dataset, which including education, location, age, gender, military experience and previous employment experience.
