### Hypothesis testing
Elements of Data Science

## Hypothesis Testing Learning Goals
Develop and test an hypothesis
- Hypothesis
    - testable hypothesis
    - statistic
- Simulation: Empirical distribution
    - Repeat and collect outcomes
    - Iteration: 
        `for i in np.arange(samples)`
- Examine resulting distribution of outcomes
    - Probability distribution
    - Uncertainty
- p-test

In [1]:
import numpy as np
from datascience import *

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Slicing from the left and right ends of an array

```
         0  1  2  3  4    index counting from left to right
data = [ 2, 3, 9, 5, 4]
        -5 -4 -3 -2 -1    index counting from right to left
        
One way to think of slicing is to thing of the indices as cutting points.        
+–+–+–+–+–+–+
|p|y|t|h|o|n|
+–+–+–+–+–+–+
0 1 2 3 4 5 6


-6  -5  -4  -3  -2  -1   
-+---+--–+--–+--–+--–+--–+
 | p | y | t | h | o | n |
-+--–+--–+--–+--–+--–+--–+
 0   1   2   3   4   5   6
 
 seq = list('python')
 seq[0:3] yields "pyt"
 seq[-3:-1] yield "ho"

```

In [31]:
seq = list('python')
seq

['p', 'y', 't', 'h', 'o', 'n']

In [32]:
seq[0:3]

['p', 'y', 't']

In [46]:
seq[-3:-1]

['h', 'o']

In [42]:
seq[-1]

'n'

In [41]:
seq[-1:6]

['n']

In [43]:
seq[-5:-3]

['y', 't']

In [44]:
# Also works for lists
ice_cream_flavors = ['Vanilla', 'Chocolate', 'Strawberry', 'Mint Chip', 'Peach']
ice_cream_flavors[2]

'Strawberry'

In [45]:
ice_cream_flavors[-3]

'Strawberry'

**Students: Try slicing ice_cream_flavors using positive and negative indices**

## Hypothesis Testing

#### Test statistic differences


In [8]:
values[1:]

array([ 1.5,  1.7,  2. ,  1.8])

In [7]:
values = make_array(1.0, 1.5, 1.7, 2.0, 1.8)
values[1:] - values[:-1]

array([ 0.5,  0.2,  0.3, -0.2])

In [4]:
def diff_n(values, n):
    return np.array(values)[n:] - np.array(values)[:-n]

diff_n(make_array(1.0, 1.5, 1.7, 2.0, 1.8), 2)

array([ 0.7,  0.5,  0.1])

In [None]:
positive = np.count_nonzero(diff_n(make_array(1.0, 1.5,1.4, 1.2, 2.0, 1.8), 2)>0)
positive

In [None]:
negative = np.count_nonzero(diff_n(make_array(1.0, 1.5,1.4, 1.2, 2.0, 1.8), 2) < 0)
negative

In [None]:
!pip install meteostat
from datetime import datetime
import matplotlib.pyplot as plt
from meteostat import Point, Daily

In [None]:
# Set time period
start = datetime(2021, 1, 1)
end = datetime(2021, 12, 31)
# Create Point for SERC, Philadelphia
location = Point(39.9816, -75.153, 70)
# Get daily data for 2021
data = Daily(location, start, end)
data = data.fetch()['tavg'].values # Values in Celsius
data = (data*9/5)+32

In [None]:
differences = diff_n(rates, years)

### Inference and Climate Change

In [None]:
uniform = Table().with_columns(
    "Change", make_array('Increase', 'Decrease'),
    "Chance", make_array(0.5,        0.5))
uniform.sample_from_distribution('Chance', 100)

In [None]:
sample = uniform.sample_from_distribution('Chance', 100)
increases = sample.column("Chance sample").item(0)  
decreases = sample.column("Chance sample").item(1)  
print("+", increases,"-",decreases)

In [None]:
sample

### Testing differences

In [None]:
def diff_n(values, n):
    return np.array(values)[n:] - np.array(values)[:-n]

diff_n(make_array(1, 10, 100, 1000, 10000), 2)

In [None]:
!pip install meteostat
from datetime import datetime
import matplotlib.pyplot as plt
from meteostat import Point, Daily

In [None]:
# Set time period
start = datetime(2021, 1, 1)
end = datetime(2021, 12, 31)
# Create Point for SERC, Philadelphia
location = Point(39.9816, -75.153, 70)
# Get daily data for 2021
data = Daily(location, start, end)
data = data.fetch()['tavg'].values # Values in Celsius
data = (data*9/5)+32

In [None]:
differences = diff_n(rates, years)