# Week 4: Statistical Analysis and Project

# Hypothesis Testing in Python 

This is a core data analysis technique for experimentation. This has become more important with the rise of big data and web commerce. For example, if you want to study which advertising method is more effective, show half of your customers one website and the other half another website with the same content displayed differently. Gather information about how often people buy the product, and see which advertisement performs better. This is sometimes called "A-B testing." 

"A-B testing" introduces some interesting ethical questions.... for later.

A hypothesis is a statement we can test. Test the following hypothesis: The students who sign up for a MOOC class soon after it is launched are more enthusiastic about it, and thereform perform better in the class.

In [1]:
import pandas as pd
import numpy as np
import os

In [5]:
df = pd.read_csv('../data/grades.csv')
print(df.columns.values)
df.head()

['student_id' 'assignment1_grade' 'assignment1_submission'
 'assignment2_grade' 'assignment2_submission' 'assignment3_grade'
 'assignment3_submission' 'assignment4_grade' 'assignment4_submission'
 'assignment5_grade' 'assignment5_submission' 'assignment6_grade'
 'assignment6_submission']


Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


We will set the cutoff for *early* and *late* participants at having copmpleted assignment 1 by the end of Dec. 2015.

In [11]:
early = df[df['assignment1_submission'] <= '2015-12-31']
late = df[df['assignment1_submission'] > '2015-12-31']

In [12]:
print(early.size)
print(late.size)

16328
13767


In [13]:
early.mean()

assignment1_grade    74.972741
assignment2_grade    67.252190
assignment3_grade    61.129050
assignment4_grade    54.157620
assignment5_grade    48.634643
assignment6_grade    43.838980
dtype: float64

In [14]:
late.mean()

assignment1_grade    74.017429
assignment2_grade    66.370822
assignment3_grade    60.023244
assignment4_grade    54.058138
assignment5_grade    48.599402
assignment6_grade    43.844384
dtype: float64

It looks like there are some small differences, around a point, between the two groups. Is this significant?

Use a critical value $\alpha$, the threshold as to how much chance you are willing to accept. In social sciences $\alpha$ is generally 0.1, 0.05, or 0.01. In physics, it could be as low as $10^{-5}$.

You can use a T-test to compare the mean of two different populations.

In [15]:
alpha = 0.05

In [16]:
from scipy import stats

In [17]:
stats.ttest_ind(early['assignment1_grade'], late['assignment1_grade'])

Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)

The p-value is much larger than $\alpha=0.05$ so we cannot reject the null hypothesis, which is that the two populations is the same. Another way to phrase it is to say that there is no significant difference between the two populations.

In [18]:
stats.ttest_ind(early['assignment2_grade'], late['assignment2_grade'])

Ttest_indResult(statistic=1.3239868220912567, pvalue=0.18563824610067967)

In [19]:
stats.ttest_ind(early['assignment3_grade'], late['assignment3_grade'])

Ttest_indResult(statistic=1.7116160037010733, pvalue=0.087101516341556676)

In [20]:
stats.ttest_ind(early['assignment4_grade'], late['assignment4_grade'])

Ttest_indResult(statistic=0.16232182017140787, pvalue=0.87106661104475747)

In [21]:
stats.ttest_ind(early['assignment5_grade'], late['assignment5_grade'])

Ttest_indResult(statistic=0.060639738799428348, pvalue=0.95165136357928726)

When we set the $\alpha$ to 0.05, we are saying that there will be a significant difference 5% of the time just due to chance. As we run more and more t-tests, we are more likely to find a positive results just because we are running more t-tests. This is sometimes called "p-hacking," or, doing many tests until you find one which is of statistical significant. 

To handle p-hacking:

* Bonferroni correction: Tighten $\alpha$ based on the number of tests you are running. If you chose $\alpha = 0.05$, but now you're running 3 tests, change to $\alpha = 0.05 / 3 = 0.0167$. This is very conservative
* Hold-out sets: Hold out some data for testing to see how generalizable your conclusions are. This is heavily used in machine learning, called "Cross-fold validation."
* Investigation pre-registration: Outline what you expect to find and why, and describe the tests that would backup a positive proof of this. Register it with a third party (academic journal?) Then run the study and report the results, regardless of whether the results were positive or not. 