In [53]:
from __future__ import division, print_function
import numpy as np
import os
import pandas as pd

DATA LINK: https://github.com/fedhere/PUI2016_fb55/blob/master/HW4_fb55/effectivenesofNYCPost-PrisonEmploymentPrograms_solution.ipynb

## Z-TEST - Example 1, Does new bus route improve commute time?

*Research Question: Is the new bus route for line X8 improving commute time (travel time at peak hours)?*

H0: the commute time is the same or longer with the new bus route as it was before: 

**TimeNew.mean >= TimeOld.mean, significance level = 0.05.**

Ha: the commute time is shorter with the new bus route as it was before: 

**TimeNew.mean < TimeOld.mean**

In [7]:
os.system("curl -O https://raw.githubusercontent.com/fedhere/PUI2016_fb55/master/Lab3_fb55/times.txt")
os.system("mv times.txt " + os.getenv("PUIDATA"))

0

In [9]:
times = pd.read_csv(os.getenv('PUIDATA') + '/' + 'times.txt', names=['durations'], header = None)
times.head()

Unnamed: 0,durations
0,31.622239
1,32.821376
2,30.229101
3,31.413766
4,39.01055


In [10]:
(type(times))

pandas.core.frame.DataFrame

In [13]:
#times.durations.values
df = times.durations.values

In [20]:
mean_pop = 36  # given
sd_pop = 6  #given
mean_sample = df.mean()
sd_sample = df.std()
N = len(times)
alpha = 0.05

In [21]:
z = (mean_pop - mean_sample) / (sd_pop / np.sqrt(N))
z

2.5563971861666701

In [22]:
# OPTIONAL

print("Z score for Bus X8 rerouting: {0:.1f}".format(z))
print("This means we are {0:.1f} standard deviations away".format(z))
print("from the mean of the old trip duration\n\n")
if z > 2:
    result = "IS REJECTED"
else: 
    result = "CANNOT BE REJECTED"

print("The Null Hypothesis that the new route does not improve commuting")
print("{0:s} at the {1:.1f}% significance level".format(result, 100*(1-alpha)))

Z score for Bus X8 rerouting: 2.6
This means we are 2.6 standard deviations away
from the mean of the old trip duration


The Null Hypothesis that the new route does not improve commuting
IS REJECTED at the 95.0% significance level


## Interpretation of the Z-statistic

This z-statistic of 2.56 means that the mean commute time is 2.56 standard deviations away from the mean of all the bus times. 2 standard deviations away from the mean encompasses 95% of the data - so in this case, there is even less than 5% chance that the fact that the new bus commute time is shorter than the average is due to chance. In other words, it is very unlikely (less than 5%) that the shorter commute time for this bus line is due to chance. This means we can reject the null hypothesis and conclude that the new bus line does indeed reduce commute time

# Z-TEST & CHISQUARE - Example 2, Felony Data



## CEO Transitional Jobs
Null Hypothesis: The % of former prisoners employed in CEO transitional jobs after release is the same or lower for candidates who participated in the program as for the control group, significance level p=0.05.

H0 Equation: proportion employed in CEO jobs in program group <= proportion employed in CEO jobs in control group

$H_0: P_0 - P_1 \geq$ 0

Alternative Hypothesis: The % of former prisoners employed in CEO transitional jobs after release is the higher for candidates who participated in the program as for the control group, significance level p=0.05.

Ha Equation: proportion employed in CEO jobs in program group > proportion employed in CEO jobs in control group

$H_a: P_0 - P_1 $< 0

In [23]:
# Significance level
alpha=0.05
# Proportion of control group employed in CEO job
p_0 = 3.5*0.01 
# Proportion of program group employed in CEO job
p_1 = 70.1*0.01

if p_0-p_1 >= 0:
    print ("the Null holds")
else:
    print ("we must assess the statistical significance")

# Sample size of control group
n_0 = 409
# Sample size of program group
n_1 = 564

# Count of employed in control group
Nt_0 = p_0 * n_0
# Count of employed in program group
Nt_1 = p_1 * n_1

we must assess the statistical significance


In [24]:
# Sample proportion (pooled proportion)
sp = (p_0 * n_0 + p_1 * n_1) / (n_1 + n_0)
print (sp)

0.4210472764645426


In [25]:
def sp_stdev(p, n):
    return(np.sqrt(p * (1 - p) / n[0] +  p * (1 - p) / n[1]))

sp_stdev_2y = sp_stdev(( Nt_0 + Nt_1) / (n_0 + n_1), [n_0, n_1])
print (p_0, n_0, n_1, sp_stdev_2y)

0.035 409 564 0.0320658086057


In [26]:
def zscore(p0, p1, s):
    return((p0 - p1) / s)

z_2y = zscore(p_1, p_0, sp_stdev_2y)
print (z_2y)

20.7697865408


## How to read a Z-Table

https://github.com/fedhere/UInotebooks/blob/master/HowToReadZandChisqTables.md

In [27]:
# The z-value of 20.8 is larger than what's given in the z-table, so we'll take the area value for the highest available value, 0.9998
# This is actually not the highest - she was confused, but you get the idea
p_2y = 1 - 0.9998


def report_result(p,a):
    print ('is the p value {0:.2f} smaller than the critical value {1:.2f}? '.format(p,a))
    if p < a:
        print ("YES!")
    else: 
        print ("NO!")
    
    print ('the Null hypothesis is {}'.format( 'rejected' if p < a  else 'not rejected') )

    
report_result(p_2y,alpha)

is the p value 0.00 smaller than the critical value 0.05? 
YES!
the Null hypothesis is rejected


### Z-Statistic Conclusion: CEO Transitional Job:
From our z-statistic, we obtained a p-value of 0.00 from the z-stat table. 0.00 is obviously smaller than our alpha level of 0.05, we can reject the null hypothesis and conclude that the % of former prisoners employed in CEO transitional jobs after release is the higher for candidates who participated in the program as for the control group at a significance level of alpha = 0.05

## Felony Conviction Tests

Hypotheses Statements: Convicted of a felony

Null Hypothesis: Those in the program group have the same or higher rates of felonies over three years after the program than those in the control group, alpha = 0.05.

H0 Equation: proportion felonies in program group >= proportion felonies in control group

Alternative Hypothesis: Those in the program group have lower rates of felonies over three years following the program than those in the control group.

Ha Equation: proportion felonies in program group < proportion felonies in control group

## Z-TEST

In [None]:
P_0_recid = 10.0 * 0.01
P_1_recid = 11.7 * 0.01
n_0_recid = 409
n_1_recid = 568


Nt_0_recid = P_0_recid  * n_0_recid 
Nt_1_recid = P_1_recid  * n_1_recid 

sp_stdev_recid = sp_stdev((Nt_0_recid + Nt_1_recid) / \
                          (n_0_recid + n_1_recid), [n_0_recid, n_1_recid])
print ( "test standard deviation error: %.3f"%sp_stdev_recid)


z_recid = zscore(P_1_recid, P_0_recid, sp_stdev_recid)

In [28]:
# Significance level
alpha=0.05
# Proportion of control group convicted of a felony
p_0 = 11.7*0.01 
# Proportion of program group convected of a felony
p_1 = 10.0*0.01

if p_0-p_1 >= 0:
    print ("the Null holds")
else:
    print ("we must assess the statistical significance")

# Sample size of control group (with Recidivism data)
n_0 = 409
# Sample size of program group (with Recidivism data)
n_1 = 568

# Count of felony convicts in control group
Nt_0 = p_0 * n_0
# Count of felony convicts in program group
Nt_1 = p_1 * n_1

the Null holds


In [29]:
# Sample proportion (pooled proportion)
sp = (p_0 * n_0 + p_1 * n_1) / (n_1 + n_0)
print (sp)

0.10711668372569089


In [31]:
def sp_stdev(p, n):
    return(np.sqrt(p * (1 - p) / n[0] +  p * (1 - p) / n[1]))

sp_stdev_2y = sp_stdev(( Nt_0 + Nt_1) / (n_0 + n_1), [n_0, n_1])
print (p_0, n_0, n_1, sp_stdev_2y)

0.11699999999999999 409 568 0.0200556791612


In [32]:
def zscore(p0, p1, s):
    return((p0 - p1) / s)

z_2y = zscore(p_1, p_0, sp_stdev_2y)
print (abs(z_2y))
# Absolute value used because our z-table only contains positive values
# Could have used the negative value and just not subtracted from 1 in the next step

0.84764020522


In [34]:
# Find area corresponding to z-score of 0.85
p_2y = 1 - 0.8023


def report_result(p,a):
    print ('is the p value {0:.2f} smaller than the critical value {1:.2f}? '.format(p,a))
    if p < a:
        print ("YES!")
    else: 
        print ("NO!")
    
    print ('the Null hypothesis is {}'.format( 'rejected' if p < a  else 'not rejected') )

    
report_result(p_2y,alpha)

is the p value 0.20 smaller than the critical value 0.05? 
NO!
the Null hypothesis is not rejected


## Z-Statistic Conclusion: Convicted of a felony:

From our z-statistic, we obtained a p-value of 0.18 from the z-stat table. 0.18 is greater than our alpha level of 0.05, so we fail to reject the null hypothesis that those in the program group have the same or higher rates of felonies over three years after the program than those in the control group.

## CHISQUARE TEST

In [36]:
a = 568 * 0.1
b = 568 * (1 - 0.1)
c = 409 * 0.117
d = 409 * (1 - 0.117)
print(a, b, c, d)

56.800000000000004 511.2 47.853 361.147


In [41]:
col1tot = a + c
col2tot = b + d
Ntot = 977
col1tot

104.653

In [42]:
col2tot

872.347

**Observed:**

|Convicted felony|Yes|No|
|:--------------:|:------:|:----------:|
|test sample|0.1*568 = 56.8|0.9*568 = 511.2|568|
|control sample|0.117*409 = 47.853|0.883*409 = 361.147|409|
|
|Total|104.653|872.347|977|

**Expected:**

|Convicted felony|Yes|No|
|:--------------:|:------:|:----------:|
|test sample|(568*104.653)/977 = 60.84|(568*872.347)/977 = 507.16|568|
|control sample|(409*104.653)/977 = 43.81|(409*872.347)/977 = 365.19|409|
|
|Total|104.653|872.347|977|

In [43]:
def chisqstat(N, values, expect_num):
    return(((values[0][0] * values[1][1] - values[0][1] * values[1][0])**2) * N / expect_num)

Ntot = 977
expected_num = 568 * 409 * 104.653 * 872.347
sample_values = [[0.1 * 568, 0.9 * 568], [0.117 * 409, 0.883 * 409]]
 

print (chisqstat(Ntot,  sample_values, expected_num))

0.7184939175052886


## How to read a Chisquare table

https://github.com/fedhere/UInotebooks/blob/master/HowToReadZandChisqTables.md

## Chisquare table

http://passel.unl.edu/Image/Namuth-CovertDeana956176274/chi-sqaure%20distribution%20table.PNG

In [54]:
# Degrees of freedom = 1, since df = n - 1 (where n is the number of classes, in this case 2 - the control and program groups)
# Our chi-square value of 0.72 falls between the areas 0.455 and 1.32 in the table, returning a p-value between 0.25 and 0.5. 
# For simplicity's sake, we'll take the average of these and determine a p-value of 0.375.

# P-value
p_chi = 0.375


def report_result(p,a):
    print ('is the p value {0:.2f} smaller than the critical value {1:.2f}? '.format(p,a))
    if p < a:
        print ("YES!")
    else: 
        print ("NO!")
    
    print ('the Null hypothesis is {}'.format( 'rejected' if p < a  else 'not rejected') )

    
report_result(p_chi,alpha)

is the p value 0.38 smaller than the critical value 0.05? 
NO!
the Null hypothesis is not rejected


## Chi-Square Test Conclusion: Convicted of a felony

From our chi-square value, we obtained a p-value between 0.25 and 0.5 from the chi-square table. This range is greater than our alpha level of 0.05, so we fail to reject the null hypothesis that those in the program group have the same or higher rates of felonies over three years after the program than those in the control grou