# The idea of permutation

## A mosquito problem

![](https://matthew-brett.github.io/cfd2019/images/https://matthew-brett.github.io/cfd2019/images/mosquito_banner.png)

With thanks to John Rauser: [Statistics Without the Agonizing Pain](https://www.youtube.com/watch?v=5Dnw46eC-0o)

## The data

In [None]:
# Import Numpy library, rename as "np"
import numpy as np
# Import Pandas library, rename as "pd"
import pandas as pd

# Set up plotting
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

In [None]:
# An extra tweak to make sure we always get the same random numbers.
np.random.seed(42)

Read in the data:

In [None]:
mosquitoes = pd.read_csv('mosquito_beer.csv')
mosquitoes.head()

Get the number of activated mosquitoes for the "after" treatment rows, separating the "beer" group and the "water" group.

In [None]:
# After treatment rows.
afters = mosquitoes[mosquitoes['test'] == 'after']
# After beer treatment rows.
beers = afters[afters['group'] == 'beer']
# The Activated numbers for the after beer rows.
beer_activated = np.array(beers['activated'])
beer_activated

In [None]:
n_beer = len(beer_activated)
n_beer

In [None]:
# Same for the water group.
waters = afters[afters['group'] == 'water']
water_activated = np.array(waters['activated'])
water_activated

In [None]:
n_water = len(water_activated)
n_water

## The permutation way

* Calculate difference in means
* Pool
* Repeat many times:
    * Shuffle
    * Split
    * Recalculate difference in means
    * Store

## On balls

![](https://matthew-brett.github.io/cfd2019/images/just_balls.png)

## The difference in means

![](https://matthew-brett.github.io/cfd2019/images/beer_mean.png)

In [None]:
print(np.mean(beer_activated))
print(np.mean(water_activated))

## The difference in means

![](https://matthew-brett.github.io/cfd2019/images/water_mean.png)

In [None]:
actual_diff = np.mean(beer_activated) - np.mean(water_activated)
actual_diff

## Pool

In [None]:
pooled = np.append(beer_activated, water_activated)
pooled

## Shuffle

![](https://matthew-brett.github.io/cfd2019/images/fake_balls1.png)

In [None]:
np.random.shuffle(pooled)
pooled

## A difference if the null is true

![](https://matthew-brett.github.io/cfd2019/images/fake_beer_mean1.png)

## One difference on null

![](https://matthew-brett.github.io/cfd2019/images/fake_water_mean1.png)

In [None]:
fake_beer = pooled[:n_beer]
fake_water = pooled[n_beer:]
fake_diff = np.mean(fake_beer) - np.mean(fake_water)
fake_diff

## And again

![](https://matthew-brett.github.io/cfd2019/images/fake_beer_mean2.png)

In [None]:
np.random.shuffle(pooled)
fake_beer = pooled[:n_beer]
fake_water = pooled[n_beer:]
fake_diff = np.mean(fake_beer) - np.mean(fake_water)
fake_diff

## Another difference on null

![](https://matthew-brett.github.io/cfd2019/images/fake_water_mean2.png)

In [None]:
np.random.shuffle(pooled)
fake_beer = pooled[:n_beer]
fake_water = pooled[n_beer:]
fake_diff = np.mean(fake_beer) - np.mean(fake_water)
fake_diff

## And so on, 10000 times

In [None]:
fake_differences = np.zeros(10000)
for i in np.arange(10000):
    np.random.shuffle(pooled)
    fake_beer = pooled[:n_beer]
    fake_water = pooled[n_beer:]
    fake_diff = np.mean(fake_beer) - np.mean(fake_water)
    fake_differences[i] = fake_diff
plt.hist(fake_differences);

In [None]:
n_ge_actual = np.count_nonzero(fake_differences >= actual_diff)
n_ge_actual

In [None]:
p_ge_actual = n_ge_actual / 10000
p_ge_actual