# Permutation Testing

Permutation testing is a way of checking whether labels are significant. For example, in the Titanic data, suppose we wanted to check whether `class` has an effect on survival rate. We could do something like this:

1. Find a statistic we're interested in, say 
$$
T = \frac{P(\text{survived}|\text{first class})}{P(\text{survived}|\text{third class})}  
$$
2. Compute this for the actual data
3. Recompute for lots of permutations of the `class` label. This is our null hypothesis distribution.
4. See where our observation falls, relative to the null hypothesis.

In [1]:
# To edit this:
# code $(jupyter --data-dir)/nbextensions/snippets/snippets.json

# imports a library 'pandas', names it as 'pd'
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns

# enables inline plots, without it plots don't show up in the notebook
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
# %config InlineBackend.figure_format = 'png'
# mpl.rcParams['figure.dpi']= 300

In [2]:
df = pd.read_csv("../../week01-benson/02-git_viz/data/titanic.csv")

## Exercises

1. Implement a permutation test to answer the question aboove
2. Time your algorithm, and see how quick you can get it running
3. Explore some other Titanic hypotheses (in real life you'd need to watch out for false discovery rate)

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [15]:
df_third = df[df['pclass'] == 3].reset_index()
df_first = df[df['pclass'] == 1].reset_index()

In [49]:
third_surv = df_third.survived.mean()

In [50]:
first_surv = df_first.survived.mean()

Passengers in first class survived 2.6 times more frequently than those in third class in our observed population.

In [51]:
obs = first_surv / third_surv

In [58]:
def get_pooled_classes(class_1, class_2):
    # returns pooled survival distribution of class_1 and class_2 and the number of passengers
    # in each class, and the observed ratio of survival between the two classes
    df_class_1 = df[df['pclass'] == class_1].reset_index()
    df_class_2 = df[df['pclass'] == class_2].reset_index()
    
    obs = df_class_1.survived.mean() / df_class_2.survived.mean()
    
    pooled_classes = np.array(pd.concat([df_class_1, df_class_2])['survived'])
    
    return pooled_classes, len(df_class_1), len(df_class_2), obs

In [77]:
pool, num_1, num_2, obs = get_pooled_classes(1, 2)

In [79]:
def permutation_test(pool, num_first):
    pool = np.random.permutation(pool)
    
    simulated_first = pool[:num_first]
    simulated_third = pool[num_first:]
    
    return simulated_first.mean() / simulated_third.mean()

In [95]:
permutation_test(pool, num_1)

1.14775828460039

In [45]:
pooled_classes = np.array(pd.concat([df_third, df_first])['survived'])

In [53]:
ratios = []

for _ in range(10000):
    ratios.append(permutation_test(pooled_classes, len(df_first), len(df_third)))
    
ratios = np.array(ratios)

In [97]:
def run_full_test(class1, class2, iters=10000):
    pool, num_first, num_second, obs = get_pooled_classes(class1, class2)
    
    ratios = []
    
    for _ in range(10000):
        ratios.append(permutation_test(pool, num_first))
        
    ratios = np.array(ratios)
    
    p_value = (ratios >= obs).mean()
    
    return p_value

In [101]:
run_full_test(2, 3)

0.0

In [63]:
%timeit run_full_test(1, 2)

263 ms ± 9.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Explore more hypotheses

In [64]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [68]:
def encode_adult_male(x):
    if x == True:
        return 1
    else:
        return 0

In [69]:
df['adult_male'] = df['adult_male'].apply(lambda x: encode_adult_male(x))

In [72]:
df[df['adult_male'] == 0]['survived'].mean()

0.7175141242937854

In [73]:
df[df['adult_male']== 1]['survived'].mean()

0.16387337057728119

In [None]:
def get_pooled_parameters(param_1, param_2):
    # returns pooled survival distribution of class_1 and class_2 and the number of passengers
    # in each class, and the observed ratio of survival between the two classes
    df_class_1 = df[df['pclass'] == class_1].reset_index()
    df_class_2 = df[df['pclass'] == class_2].reset_index()
    
    obs = df_class_1.survived.mean() / df_class_2.survived.mean()
    
    pooled_classes = np.array(pd.concat([df_class_1, df_class_2])['survived'])
    
    return pooled_classes, len(df_class_1), len(df_class_2), obs