# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

### Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set()
sns.set_style("whitegrid")
palette = sns.diverging_palette(220, 20, sep = 20, n = 4)
sns.set_palette(palette)

In [2]:
df = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
df = df[['race', 'call']]

### Question 1: What test is appropriate for this problem? Does CLT apply?

According to the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution:
> The Bernoulli distribution is the probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p, that is, the probability distribution of any single experiment that asks a yes–no question; the question results in a boolean-valued outcome, a single bit of information whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q.


According to Binomial distribution: https://en.wikipedia.org/wiki/Binomial_distribution
> The binomial distribution, denoted as B(n, p), with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: a random variable containing single bit of information: success/yes/true/one (with probability p) or failure/no/false/zero (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

> Also, since B(n, p) is a sum of n independent, identically distributed Bernoulli variables with parameter p. This fact is the basis of a hypothesis test, a "proportion z-test", for the value of p using x/n, the sample proportion and estimator of p, in a common test statistic.[13]

According to Central limit theorem: https://en.wikipedia.org/wiki/Central_limit_theorem
> When independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

According to Z Test: Definition & Two Proportion Z-Test: http://www.statisticshowto.com/z-test/
> This tests for a difference in proportions. A two proportion z-test allows you to compare two proportions to see if they are the same.
* The null hypothesis (H0) for the test is that the proportions are the same.
* The alternate hypothesis (H1) is that the proportions are not the same.


In this question:
1. Each call of a resume is a random Bernoulli variable which takes either 1 or 0.
2. A sequence of propability of call of success of n independent resume, is binomial distributed.
3. Althogh the sequence of propability of call of sucess of n independent resume is binomail distribued, according to centrel limit therom, their properly normalized sum, which is sum of sucess calls, would tend toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed.

** So yes CLT can apply and two proportion Z-test is appropriate for this problem.**

###  Question 2: What are the null and alternate hypotheses?
#### The null hypothesis for the test is that the proportions of call of success of black-sounding and white-sounding are the same.


#### The alternate hypothesis for the test is that the proportions of call of success of black-sounding and white-sounding are **not** the same.

### Question 3: Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

In [4]:
df_w = df[df.race == 'w']
df_w = df_w.drop(['race'], axis = 1)
df_b = df[df.race == 'b']
df_b = df_b.drop(['race'], axis = 1)

#### Bootstrapping Approach:

In [5]:
def permutation_sample(data1, data2):
    data = np.concatenate((data1, data2))
    permuted_data = np.random.permutation(data)
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]
    
    return perm_sample_1, perm_sample_2

def draw_perm_reps(data_1, data_2, func, size = 1):
    perm_replicates = np.empty(size)

    for i in range(size):
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

def diff_of_proportion(data_1, data_2):
    return (sum(data_1) / len(data_1) - (sum(data_2) / len(data_2)))

empirical_diff_propertion = diff_of_proportion(df_w['call'], df_b['call'])

perm_replicates = draw_perm_reps(df_w, df_b, 
                                 diff_of_proportion, size = 1000)

p = np.sum(perm_replicates >= empirical_diff_propertion) / len(perm_replicates)
print('Bootstrapping Approach p value:', p)

Bootstrapping Approach p value: 0.0


In [6]:
# Need to revisit:
conf_int = np.percentile(perm_replicates, [2.5, 97.5])
print('Bootstrapping Approach confidence interval:', conf_int)

Bootstrapping Approach confidence intervel: [-0.01478439  0.01480492]


#### Frequentist statistical using two proportion Z-Test: http://www.statisticshowto.com/z-test/
#### References for calculating margin of error & confidence interval for two proportions: http://www.stat.wmich.edu/s216/book/node85.html

In [7]:
# Calculate margin of error and confidence interval
z_value = 1.96

# white-sounding
w_p = sum(df_w.call) / len(df_w.call)
prop_white_cb = 235/2435
w_std_err = np.sqrt((w_p * (1 - w_p) / len(df_w.call)))

# black-sounding
b_p = sum(df_b.call) / len(df_b.call)
b_std_err = np.sqrt((b_p * (1 - b_p) / len(df_b.call)))
prop_black_cb = 157/2435

std_err_diff = np.sqrt((w_std_err ** 2 + b_std_err ** 2))
margin_of_err_diff = z_value * std_err_diff
print('Frequentist statistical test Margin of Error:', margin_of_err_diff)

p_diff = w_p - b_p

conf_int = (p_diff - margin_of_err_diff, p_diff + margin_of_err_diff)
print('Frequentist statistical test confidence interval:', conf_int)

Frequentist statistical test Margin of Error: 0.0152554063499
Frequentist statistical test confidence intervel: (0.016777447859559147, 0.047288260559332024)


In [8]:
# Calculate p value
# Step 1: Find the two proportions:
p_w_succss = sum(df_w.call) / df_w.shape[0]
p_b_succss = sum(df_b.call) / df_b.shape[0]

# Step 2: Find the overall sample proportion.
p_success = (sum(df_w.call) + sum(df_b.call)) / (df_w.shape[0] + df_b.shape[0])

# Step 3: Compute Z score
Z = (p_w_succss - p_b_succss - 0) / \
    (( p_success * (1 - p_success) * ( 1 / df_w.shape[0] + 1 / df_b.shape[0])) ** (1 / 2))
print('Z-score:', Z)

# Step 4: For confidence level 95%, the z value is 1.96.
# As Z score is larger than 1.96, we can reject the null hypothesis.

Z-score: 4.108412152434346


#### Statsmodels also has library for two proportional Z-Test: http://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html

In [9]:
import statsmodels.api as sm

z_score, p_value = sm.stats.proportions_ztest([sum(df_w.call), sum(df_b.call)], [df_w.shape[0], df_b.shape[0]])
print('Two proportional Z-Test Z-score:', z_score)
print('Two proportional Z-Test p-value:', p_value)

Two proportional Z-Test Z-score: 4.10841215243
Two proportional Z-Test p-value: 3.98388683759e-05


  from pandas.core import datetools


### Question 4: Write a story describing the statistical significance in the context or the original problem &
### Question 5: Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

#### As p value is so small in both bootstrapping tests and frequentist statistical test, we can reject the null hypothesis. We can conclude that the proportions of call of success of black-sounding and white-sounding are **not** the same.
#### The story here is based on the analysis, racial discrimination continues to be a significant issue in the United States labor market.
