### Hypothesis testing
Elements of Data Science

## Hypothesis Testing Learning Goals
Develop and test an hypothesis
- Hypothesis
    - testable hypothesis
    - statistic
- Simulation: Empirical distribution
    - Repeat and collect outcomes
    - Iteration: 
        `for i in np.arange(samples)`
- Examine resulting distribution of outcomes
    - Probability distribution
    - Uncertainty
- p-test

In [1]:
import numpy as np
from datascience import *

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

## Slicing from the left and right ends of an array

```
         0  1  2  3  4    index counting from left to right
data = [ 2, 3, 9, 5, 4]
        -5 -4 -3 -2 -1    index counting from right to left
        
One way to think of slicing is to thing of the indices as cutting points.        
+–+–+–+–+–+–+
|p|y|t|h|o|n|
+–+–+–+–+–+–+
0 1 2 3 4 5 6


-6  -5  -4  -3  -2  -1   
-+---+--–+--–+--–+--–+--–+
 | p | y | t | h | o | n |
-+--–+--–+--–+--–+--–+--–+
 0   1   2   3   4   5   6
 
 seq = list('python')
 seq[0:3] yields "pyt"
 seq[-3:-1] yield "ho"

```

In [31]:
seq = list('python')
seq

['p', 'y', 't', 'h', 'o', 'n']

In [32]:
seq[0:3]

['p', 'y', 't']

In [46]:
seq[-3:-1]

['h', 'o']

In [42]:
seq[-1]

'n'

In [41]:
seq[-1:6]

['n']

In [43]:
seq[-5:-3]

['y', 't']

In [44]:
# Also works for lists
ice_cream_flavors = ['Vanilla', 'Chocolate', 'Strawberry', 'Mint Chip', 'Peach']
ice_cream_flavors[2]

'Strawberry'

In [45]:
ice_cream_flavors[-3]

'Strawberry'

**Students: Try slicing ice_cream_flavors using positive and negative indices**

## Hypothesis Testing

#### Test statistic differences


In [52]:
def diff_n(values, n):
    '''Calculate the difference n steps apart'''
    return np.array(values)[n:] - np.array(values)[:-n]

diff_n(make_array(1.0, 1.5,1.4, 1.2, 2.0, 1.8), 2)

array([ 0.4, -0.3,  0.6,  0.6])

After all of the examples of slicing above, you should be able to figure out how the function diff_n() works!

In [49]:
positive = np.count_nonzero(diff_n(make_array(1.0, 1.5,1.4, 1.2, 2.0, 1.8), 2) > 0)
positive

3

In [50]:
negative = np.count_nonzero(diff_n(make_array(1.0, 1.5,1.4, 1.2, 2.0, 1.8), 2) < 0)
negative

1

In [None]:
differences = diff_n(rates, years)

## Creating and sampling distribution for the null hypothesis
The null hypothesis is that the temperature is equally likely to go up or down. There is no systemmatic global warming. Any apparent trend is just the result of random fluctuations.

### Inference and Climate Change

In [60]:
# Create a table with the null hypothesis proababilities
uniform = Table().with_columns(
    "Change", make_array('Increase', 'Decrease'),
    "Chance", make_array(0.5,        0.5))
uniform

Change,Chance
Increase,0.5
Decrease,0.5


In [57]:
# Use the Table.sample_from_distribution() method to simulate the null hypothesis
sample = uniform.sample_from_distribution('Chance', 100)
sample

Change,Chance,Chance sample
Increase,0.5,51
Decrease,0.5,49


In [58]:
# Compare the total positive and negative instances
increases = sample.column("Chance sample").item(0)  
decreases = sample.column("Chance sample").item(1)  
print("+", increases,"-",decreases)

+ 51 - 49


## Where this is going...
As you can probably guess, we are going to look at the number of instances of temperature increases and decreases in the data for diffent countries around the globe. We will undoubtedly see more increases than decreases, but how likely is that to be a matter of random chance? Is the difference **significant?**