### Basic statistical testing

En esta conferencia vamos a revisar basics of statistical testing in python. Hablaremos sobre hypothesis testing, statistical significance y usaremos scipy to run student's t-test

In [10]:
# En data science usamos statistics en un monton de caminos diferentes. Vamos a refrescar nuestro conocimiento
# sobre que es hypothesis testing, lo cual es a core data analysis activity behind experimentation.
# El objetivo de una hypothesis testing es determinar si, por ejemplo, las dos diferentes condiciones que tenemos
# en un experimento have resulted in different impacts

# importemos nuestras librerias usuales
import numpy as np
import pandas as pd

# Ahora traigams algo nuevo. Traeremos algunas librerias nuevas de scipy
from scipy import stats

In [2]:
# Scipy es una interesante coleccion de librerias para data science y hemos usado muchas de estas librerias.
# Scipy incluye pandas y numpy pero tambien tiene plotting libraries como matplotlib, y number of 
# scientific library functions as well

In [11]:
# Cuando hacemos hipotesis testing, realmente tenemos dos statements de interes: 
# El primero es nuestra actual explanation, que podemos llamar alternative hypothesis, y el secundo es 
# que la explanation que tenemos no es suficiente y llamamos a esto the null hypothesis
# Our actual testing method es determinar si nuestra null hypothesis es true or not. Si encontramos que hay
# diferencias entre grupos, podemos rechazar the null hypothesis y aceptar nuestra alternativa


# Veamos un ejemplo, usaremos alguna data de notas
df = pd.read_csv('resources/week-4/datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [4]:
# Si vemos nuestro dataframe, vemos que hay seis diferentes asignaciones. Veamos algunos 
# resumenes estadisticos para este DataFrame

print('There are {} rows and {} columns'.format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [12]:
# Vamos a segmentar esta poblacion en dos piezas. Diremos que quienes terminaron la primera asignacion
# al final de December 2015 seran early finishers, y aquellos que la terminaron un tiempo despues seran late finishers

early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016'] # Pasamos la columna del df a datetime
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [13]:
# Hay varias formas de tomar los late finishers. Por ejemplo, podriamos consultar de la misma forma pero pidiendo
# que sea > 2016. En este caso, sabiendo que early finishers y df tienen el mismo index, realmente estamos buscando
# aquellos estudiantes que no esten en early finishers

late_finishers= df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [8]:
# Otra forma seria hacer una interseccion del df con early_finishers, si hacemos un left join solo nos quedaran
# aquellos items en el left dataframe. Tambien se puede escribir una funcion que determine si alguien es
# early or late y luego llamar .apply() y añadir una nueva columna al dataframe. Hay un numero razonable de cosas
# que pueden hacerse para esto

In [9]:
# El Pandas DataFrame object tiene una variedad de funciones estadisticas asociadas a el. Podemos llamar
# estas funciones directamente en el dataframe. Veamos los promedios para nuestras dos poblaciones

In [14]:
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024304
74.0450648477065


In [15]:
# Se ven muy similares. Pero son los mismos? Que queremos decir con similar? Aqui es donde
# los t-test entran. Esto permite formar una alternative hypothesis('There are different' ) y tambien una
# null hypothesis ('There are the same') e intentar probar esta null hypothesis.

# Cuando hacemos hypothesis testing, tenemos que elegir un nivel de significancia (significance level) como
# de cuanto chance le vamos a dar para aceptarlas. El significance level suele llamarse alpha. Por ejemplo,
# podriamos establecer un alpha de 0.05 o de 5%. Este es un valor comun para utilizar pero es un poco
# arbitrario


# L libreria de SciPy contiene un numero de diferentes statistical tests y forms para basar una hypothesis 
# testing en python. Usaremos ttest_ind() que hace independientes t-test (meaning the population are not related to one another)
# El resultado del ttest_index() son las t-statistic y un p-value. Este ultimo valor es la probabilidad,
# que es muy importante para nosotros y nos indica el chance (entre 0 y 1) o la posibilidad de que nuestra 
# null hypothesis sea verdadera

# vamos a importar ttest_ind function
from scipy.stats import ttest_ind

# Ahora corramos esta funcion con nuestras dos poblaciones
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.3223540853721596, pvalue=0.18618101101713855)

In [16]:
# Aqui vemos que la probabildiad es 0.18, esto esta por encima de nuestro alpha de 0.05. Esto quiere decir que 
# no podemos rechazar nuestra null hypothesis. La null hypothesis fue que ambas poblaciones eran las mismas y no tenemos
# suficiente certeza en nuestra evidencia (porque es mayor que alpha) para sacar una conclusion de lo contrario
# Esto tampoco quiere decir que las poblaciones sean las mismas

In [17]:
# Por que no miramos los otros assignment grades?

print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.21088896270044244)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.1067999810222786)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352609489, pvalue=0.9075854011989859)


In [18]:
# Bueno, por lo que vemos en esta data no tenemos suficiente evidencia que sugiera que las poblaciones difieren
# respecto al grade. Veamos estos pvalue por un momento, Por ejemplo, uno de las asignaciones, el assignment 3
# tiene un p-value around 0.1. Esto significa que si aceptamos un nivel de chance similarity de 11% esto seria
# considerado estadisticamente significativo. Como una investigacion, esto podria sugerir que hay algo aqui que
# vale la pena cnsiderar seguir. Por ejemplo, si tuvieramos un pequeño numero de participantes (que no es asi)
# o si habia algo unico en esta tarea ya que se relaciona con el experimento (sea cual sea) puede haber diferentes
# experimentos que podriamos correr

In [19]:
# P-value han sido atacados recientemente por ser insuficientes para decirnos sobre las interacciones que esta
# sucediento y otras tecnicas, como intervalos de confianza y analisis bayesianos, se estan utilizando con mas 
# regularidad

# Un asunto con p-values es que podemos correr mas tests, a medidas que ejecutas mas pruebas, es probable que obtengas
# un valor estadisticamente significativo solo por casualidad. Veamos una pequeña simulacion de esto.


# Vamos a crear un df de 100 columnas cada una con 100 numeros
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.631622,0.980429,0.541378,0.675257,0.594209,0.126157,0.677277,0.733248,0.542124,0.375118,...,0.119582,0.211202,0.842628,0.698714,0.58857,0.76674,0.519179,0.304661,0.327297,0.356673
1,0.141092,0.293658,0.956149,0.982746,0.281,0.346991,0.652229,0.282293,0.068041,0.089381,...,0.736282,0.449924,0.179042,0.946815,0.181816,0.783499,0.021885,0.245215,0.603233,0.255904
2,0.383116,0.898514,0.43271,0.422635,0.889698,0.136309,0.71712,0.834116,0.869712,0.559002,...,0.263236,0.759099,0.94363,0.833461,0.475799,0.610442,0.799343,0.989965,0.496995,0.232954
3,0.857676,0.790472,0.307499,0.653475,0.012058,0.662111,0.909895,0.398625,0.787891,0.384,...,0.488357,0.099418,0.258302,0.52244,0.03374,0.390086,0.180167,0.473175,0.085052,0.034355
4,0.497272,0.816769,0.888165,0.779664,0.883219,0.22529,0.250772,0.447332,0.015132,0.412379,...,0.581744,0.488824,0.380911,0.234609,0.853153,0.423854,0.4892,0.573071,0.287947,0.154285


In [22]:
# Creemos un segundo dataframe

df2 = pd.DataFrame([np.random.random(100) for x in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.798359,0.824411,0.16351,0.679847,0.130451,0.961485,0.897837,0.226593,0.106009,0.285651,...,0.706896,0.458154,0.868299,0.537835,0.52992,0.090139,0.442255,0.084874,0.397607,0.027577
1,0.905959,0.806044,0.795189,0.689599,0.589827,0.57666,0.07899,0.83891,0.851161,0.839004,...,0.806992,0.322622,0.546312,0.543335,0.294068,0.094001,0.956558,0.972448,0.896301,0.570958
2,0.900753,0.968876,0.875826,0.681239,0.209147,0.032033,0.279006,0.051622,0.940853,0.818897,...,0.71843,0.27441,0.122726,0.250765,0.528643,0.237594,0.419881,0.15831,0.060842,0.25518
3,0.024976,0.27703,0.617675,0.460331,0.892588,0.90359,0.417025,0.610786,0.434334,0.289827,...,0.20082,0.25012,0.156291,0.530534,0.653799,0.893017,0.48038,0.070557,0.373636,0.323213
4,0.136186,0.314427,0.670367,0.321676,0.069062,0.874041,0.217379,0.650205,0.001485,0.863905,...,0.689687,0.589834,0.04555,0.215566,0.660558,0.370395,0.39811,0.633917,0.042203,0.251415


In [27]:
# Entonces, son estos df iguales? Por una fila dentro de df1, es esta la misma que la fila dentro de df2

# Veamos esto, digamos que neustro critical value es 0.1 o un alpha de 10%,  vamos a comparar 
# cada columna en df1 con la misma columna en df2 (same numered) y vamos a reportar el p-value
# a ver si es menor del 10%, que significaria que tenemos suficiente evidencia para decir que las columnas 
# diferentes.

# Vamos a escribir esto en una funcion llamada test_columns

def test_columns(alpha=0.1):
    # Queremos tener track de cuantos difieren
    num_diff= 0
    # Ahora queremos iterar sobre las columnas
    for col in df1.columns:
        # Podemos correr ttest_ind entre dos dataframes
        teststat,pval=ttest_ind(df1[col],df2[col])
        
        # Podemos revisar el pvalue versus the alpha
        if pval <= alpha:
            # Podemos imprimir si son diferentes e incrementar num_diff
            print('Col {} is statistically significantly different at alpha={}, pval={}'.format(col,alpha,pval))
            num_diff=num_diff+1
            
        # Y tambien podemos imprimir algun summary stats
    print('Total number diferrent was {}, which is {}%'.format(num_diff,float(num_diff)/len(df1.columns)*100))
# Ahora corramos el codigo
test_columns()

Col 19 is statistically significantly different at alpha=0.1, pval=0.05084279351359974
Col 41 is statistically significantly different at alpha=0.1, pval=0.06859568063257189
Col 47 is statistically significantly different at alpha=0.1, pval=0.05089053080235727
Col 60 is statistically significantly different at alpha=0.1, pval=0.055192627504264284
Col 69 is statistically significantly different at alpha=0.1, pval=0.0911505695191257
Col 70 is statistically significantly different at alpha=0.1, pval=0.006981454838530834
Col 76 is statistically significantly different at alpha=0.1, pval=0.05463199062992337
Col 78 is statistically significantly different at alpha=0.1, pval=0.0735626502049387
Col 83 is statistically significantly different at alpha=0.1, pval=0.07212426894988395
Col 87 is statistically significantly different at alpha=0.1, pval=0.02219707331086324
Col 88 is statistically significantly different at alpha=0.1, pval=0.09415341863524902
Col 95 is statistically significantly diffe

In [28]:
# Interesante. Vemos que tenemos un puñado de columnas que son diferentes. De hecho, ese numero se ve
# muy parecido al valor de alpha que elegimos. Que ocurre? No deberian ser todas las columnas las mismas?
# Recordemos que ttest hace un check si dos sets son similares segun un cierto nivel de confidencia dado, en nuestro
# caso, 10%. Mientras mas comparaciones random hagas, mas pasara que sean la misma by chance. En este ejemplo
# revisamos 100 columns por lo que esperariamos que there to be roghly 10 of them if your alpha was 0.1

# Podemos probar otro alpha value tambien
test_columns(0.05)

Col 70 is statistically significantly different at alpha=0.05, pval=0.006981454838530834
Col 87 is statistically significantly different at alpha=0.05, pval=0.02219707331086324
Col 95 is statistically significantly different at alpha=0.05, pval=0.0010441104480407234
Total number diferrent was 3, which is 3.0%


In [30]:
# Asi que hay que tener esto en mente cuando se hagas statistical tests como t-test que tiene un p-value
# Enteder que p-value no es magico y tiene un umbral para poder informar los resultados y tratar de 
# responder la hipotesis. Cual es un umbral razonable? eso depende de la pregunta y se necesita tener
# expertos en el dominio del tema para entender mejor qué se considera significativo

# Por diversion, vamos a crear un segundo DataFrame usando una distribucion no normal. Como chi squared

df2 = pd.DataFrame([np.random.chisquare(df=1, size= 100) for x in range(100)])
test_columns()

Col 0 is statistically significantly different at alpha=0.1, pval=0.0002227949102324349
Col 1 is statistically significantly different at alpha=0.1, pval=0.0008907084452604087
Col 2 is statistically significantly different at alpha=0.1, pval=9.175068898692099e-05
Col 3 is statistically significantly different at alpha=0.1, pval=0.00029501341217386535
Col 4 is statistically significantly different at alpha=0.1, pval=0.0009243885762726112
Col 5 is statistically significantly different at alpha=0.1, pval=0.0013708906467274967
Col 6 is statistically significantly different at alpha=0.1, pval=0.0003550034535231731
Col 7 is statistically significantly different at alpha=0.1, pval=2.6142125306450425e-05
Col 8 is statistically significantly different at alpha=0.1, pval=0.0003864194590648553
Col 9 is statistically significantly different at alpha=0.1, pval=0.0003303901536382445
Col 10 is statistically significantly different at alpha=0.1, pval=0.0006713039636976327
Col 11 is statistically signi

In [31]:
# Aqui vemos que todas o la mayoria de columnas para el test son estadisticamente significantes al 10%


Aqui discutimos un poco sobre basics of hypothesis testing in python. Se introdujo la libreria Scipy con la que pueden hacerse t-test. Hay mucho que aprender sobre hypothesis testing, por ejempo, hay diferentes test que se pueden aplicar dependiendo de la forma de la data y los diferentes caminos para reportar resultados, no solamente p-values, como confidence intervals, bayesian analyses. Pero esto dara una idea basica de como empezar a comparar dos poblaciones para ver sus diferencias, lo cual es una tarea comun en la ciencia de Datos