# Hypothesis Testing

## Scenarios

- Chemistry - do inputs from two different barley fields produce different
yields?
- Astrophysics - do star systems with near-orbiting gas giants have hotter
stars?
- Economics - demography, surveys, etc.
- Medicine - BMI vs. Hypertension, etc.
- Business - which ad is more effective given engagement?

![img1](./img/img1.png)

![img2](./img/img2.png)

### Null Hypothesis / Alternative Hypothesis Structure

<img src="img/img3.png" width=350>

### The Null Hypothesis

![gmork](https://vignette.wikia.nocookie.net/villains/images/2/2f/Ogmork.jpg/revision/latest?cb=20120217040244)  
There is NOTHING, **no** difference.

### The Alternative hypothesis

![difference](./img/giphy.gif)

### Error

- TYPE I: False positive rate (incorrectly reject)
- TYPE II: False negative rate (incorrectly fail to reject)

### Choosing the right error rate

- Alpha, α
- Sigma, σ
- Depends on field of study, 0.00001 ≤ α ≤ 0.2

### T-test

Why use it?
- Sometimes the population standard deviation is irrelevant, and sometimes it’s
unknown. (we’ll get to the different types of t-test later)
- Sometimes a sample is too small to be confident that it’s an accurate representation of reality

### T vs Z (again)

A t-test is like a modified z-test:
- Penalize for small sample size - “degrees of freedom”
- Use sample std. dev. s to estimate population σ

<img src="img/img5.png" width=500>

### T and Z in detail
<img src="img/img4.png" width=500>

### T-value table

<img src="img/img6.png" width=500>

### P-Values
<img src="https://imgs.xkcd.com/comics/significant.png" width=500>

[Source](https://xkcd.com/882/)

### Language of Hypothesis Testing

If p < α : we *reject* the null hypothesis<br>
If p > α : we *fail to reject* the null hypothesis


Language is **important**

### What if the experiment fails?

- Don’t throw out failed experiments
- This methodology, with this data, does not produce significant results
 - More data
 - More time
 - More details

### T-test success recipe

Regardless of the type of t-test you are performing, there are 5 main steps to executing them:

- Set up null and alternative hypotheses

- Choose a significance level

- Calculate the test statistic

- Determine the critical or p-value (find the rejection region)

- Compare t-value with critical t-value to accept or reject the Null hypothesis.

# Question 1
Is this any different from population?
- Population mean = 85
- Sample = [90,100,110]

#### Using `scipi`

In [None]:
# H0 = there is no difference in our sample vs. population
# Ha = there is a difference between our sample and the population

In [1]:
from scipy.stats import ttest_1samp
data = [90, 100, 110]
ttest_1samp(data, 85)

Ttest_1sampResult(statistic=2.5980762113533156, pvalue=0.12168993434632014)

#### Manual implementation

In [2]:
from statistics import stdev

data = [90,100,110]
mu = 85
n = len(data)
s = stdev(data)
df = n-1

t = (100-85)/(s/(n**.5))

In [3]:
print(t)
print(df)

2.5980762113533156
2


# Question 2

I'm buying jeans from store A and store B.  I know nothing about their inventory other than prices. Should I go just one store for a less expensive pair of jeans?
I'm pretty apprehensive about this big decision so alpha = 0.10

Try this both manually and with scipy

- [20,30,30,50,75,25,30,30,40,80]
- [60,30,70,90,60,40,70,40]

In [4]:
from scipy.stats import ttest_ind

In [5]:
store1 = [20,30,30,50,75,25,30,30,40,80]
store2 = [60,30,70,90,60,40,70,40]

In [8]:
ttest_ind(store1, store2, equal_var=False).pvalue/2

0.05342518984181651

In [9]:
# one sided 0.053 < alpha,
# so one store is cheaper than the other

In [10]:
import numpy as np

In [11]:
from scipy import stats
# np.random.seed(12345678)
# Test with sample with identical means:
rvs1 = stats.norm.rvs(loc=5, scale=10, size=500)
rvs2 = stats.norm.rvs(loc=5, scale=10, size=500)
stats.ttest_ind(rvs1, rvs2)

Ttest_indResult(statistic=1.0509128710979914, pvalue=0.2935530069735938)

In [12]:
print(t)
print(df)

2.5980762113533156
2


# Question 3
Given the same data 1, how many more samples would you need to achieve p = 0.01, assuming sample mean and sample std. dev. do not change.

In [13]:
data = [90,100,110]
mu = 85
n = len(data)
s = stdev(data)
df = n-1

t = (100-85)/(s/(n**.5))

In [14]:
print(t)

2.5980762113533156


In [15]:
for n in range(3,10):
    df = n-1
    t = (100-85)/(s/(n**.5))
    print (df,t)

2 2.5980762113533156
3 3.0
4 3.3541019662496843
5 3.674234614174767
6 3.968626966596886
7 4.242640687119286
8 4.5


# Using T-tests for hypothesis testing for the means

In [17]:
import pandas as pd

In [18]:
df = pd.read_csv('../day-2-hypothesis-testing/data/WA_Fn-UseC_-Telco-Customer-Churn.csv')

[Link to the dataset](https://www.kaggle.com/blastchar/telco-customer-churn)

__Your Turn__

1. Find how many different values are there in the PaymentMethod column.


In [21]:
df.PaymentMethod.value_counts()

Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64

__Your Turn__

1. Select one of the categories above in PaymentMethod and we will investigate whether this data is statistically significantly different from the national data or not.

2. Suppose we know that nationwide the average monthly average spendings for the service is $70 but we don't know the standard deviation for this data. Construct a hypothesis testing for the case, the certain PaymentMethod is different than the national data.
  - hint: use `scipy.stats.ttest_1samp`

3. In our case we will focus on Payment Method == `'Mailed check'` but you can work with others too.

$H_{a}$: Spending by mailed check is different than national average spending

$H_{0}$: There is no difference in the average spending between Mailed check and national average

$\alpha$: 0.05


In [28]:
sample = df.loc[df.PaymentMethod == 'Mailed check'].MonthlyCharges

In [27]:
df.loc[df.PaymentMethod == 'Mailed check'].MonthlyCharges

1       56.95
2       53.85
7       29.75
10      49.95
16      20.65
        ...  
7019    20.15
7027    73.35
7030    20.05
7038    84.80
7041    74.40
Name: MonthlyCharges, Length: 1612, dtype: float64

In [29]:
ttest_1samp(sample, 70)

Ttest_1sampResult(statistic=-39.79616513656546, pvalue=8.78311106869432e-242)

In [30]:
sample.mean()

43.917059553349915

__Your Turn__

1. From the data set 'df' get the rows with SeniorCitizen ==1 and keep them in a variable called seniors

2. Keep other records in a variable called 'others'

3. Check how many observations do we have in each sample

In [40]:
seniors = df.loc[df.SeniorCitizen == 1].MonthlyCharges
len(seniors)

1142

In [41]:
others = df.loc[df.SeniorCitizen == 0].MonthlyCharges
len(others)

5901

__Your Turn__

1. Now we would like to compare the MonthlyCharges for Seniors and others. 

I hypothesize that seniors should have lower Monthlycharges average than others. 

2. Write a hypothesis test that checks this claim.


$H_{a}:$ seniors have lower Monthlycharges average than others

$H_{0}:$ seniors have higher or equal Monthly charges than others

$\alpha:$ 0.05

Now we will test our results by using two sample t_test (`scipy.stats.ttest_ind`).

In [42]:
ttest_ind(seniors, others, equal_var= False)

Ttest_indResult(statistic=22.288279118400933, pvalue=3.826212668910673e-98)

In [None]:
# pvalue < alpha
# reject H0
# seniors have lower Monthlycharges average than others

In [43]:
seniors.mean()

79.82035901926453

In [44]:
others.mean()

61.84744111167598