# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.


### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


##  1. What test is appropriate for this problem? Does CLT apply?

As the variables are categorical we'll use hypothesis test for comparing two proportions.

__Does CLT apply?__

There should be at least 10 successes and 10 failures in the sample
np>=10 and n(1-p)>=10
Yes, it meets the conditions then CLT can be applied.

As the first conditon is met, let's focus on the second.
we use the pooled proportion:

$\hat{p}$Pool = $\frac{\ Total successes}{\ total  n} $ = $\frac{\ Number successes 1 + Number successes 2}{\ n1+n2} $ = $\frac{\ Number calls w + Number calls b}{\ n1+n2} $

In [4]:
rcd = data[['race', 'call']]
rcd_white = rcd[rcd.race=='w']
rcd_black = rcd[rcd.race=='b']

In [5]:
len(rcd_white)

2435

In [6]:
len(rcd_black)

2435

__Checking no of calls recieved by each race__

In [7]:
rcdcall_white = sum(rcd_white.call)
rcdcall_black = sum(rcd_black.call)

__Calculting Sample Proportion__

* white_sp = Sample proportion of white people getting calls
* black_sp = Sample proportion of black people getting calls

In [8]:
white_sp = rcdcall_white/len(rcd_white)
black_sp = rcdcall_black/len(rcd_black)

In [9]:
print(white_sp, black_sp)

0.0965092402464 0.064476386037


In [10]:
propool = round((rcdcall_white+rcdcall_black)/(len(rcd_white)+len(rcd_black)),2)
propool

0.080000000000000002

check if n1p_pool >=10, n1(1-p_pool) >=10:

In [11]:
np_w=len(rcd_white)*propool
nlp_w=len(rcd_white)*(1-propool)

and for n2p_pool >=10, n2(1-p_pool) >=10:

In [12]:
np_b=len(rcd_black)*propool
nlp_b=len(rcd_black)*(1-propool)

In [13]:
print(np_w,nlp_w,np_b,nlp_b)

194.8 2240.2 194.8 2240.2


__All these values are above 10, then they meet the conditions.__

__So, CLT can be Applied.__

## 2) What are the null and alternate hypotheses?

* H0: pw = pb

* HA: pw $\neq$ pb


## 3) Compute margin of error, confidence interval, and p-value.
* ME = Z* · SE(pw - pb)</sub>

* Z = (pw - pb) - Null / SE(pw - pb)</sub>

We need t-statistics

Before That we need to calculate sampling distribution:

* $\hat{p}$diff = $\hat{p}$w - $\hat{p}$b $\sim$ N(mean = 0, SE 
$\hat{p}$diff)

where:

SE $\hat{p}$diff is equal to:

In [14]:
#SE uses p_pool as population proportion:
se_diff = round((((propool*(1-propool))/len(rcd_black))+((propool*(1-propool))/len(rcd_white)))**0.5,2)
se_diff

0.01

we need the Z-value to calculate the Marginal Error (me):

Z-value= $\hat{p}$w - $\hat{p}$b / SE_diff

## Z-Value

In [15]:
Zvalue = round((white_sp - black_sp)/se_diff,2)
Zvalue

3.2000000000000002

## Marginal Error

In [16]:
Merror = round(Zvalue*se_diff,2)
Merror

0.029999999999999999

## CONFIDENT INTERVAL:

In [17]:
u_ci = (white_sp - black_sp) + Merror
u_ci

0.062032854209445584

In [18]:
l_ci = (white_sp - black_sp) - Merror
l_ci

0.0020328542094455865

In [19]:
print("Confidence_Interval is (%.7f, %.7f)"%(l_ci,u_ci))

Confidence_Interval is (0.0020329, 0.0620329)


## P-Value

In [20]:
from statsmodels.stats.proportion import proportions_ztest as pz

In [21]:
cw = len(rcd_white[rcd_white.call == 1])
cb = len(rcd_black[rcd_black.call == 1])

In [22]:
p = pz(np.array([cw,cb]),np.array([len(rcd_white),len(rcd_black)]),value=0)
p

(4.1084121524343464, 3.9838868375850767e-05)

__In the proportion test the P value Came as 0.0000039 or 0.000004__

* Which we can fairly call it as 0

## 4) Discuss statistical significance

The low value of the p-value indicates that we can reject the H0(NULL) hypothesis. 

This means that there are differences on how resumes are looked according to race.

## 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

_The results of this study indicate that, all other things being equal, race is still an important factor in the American labor market. An African American applicant' s race certainly has negative effects on his employment prospects on average. Resumes with white-sounding names received 50 percent more callbacks than those with black names._