# Hypothesis Testing

We would like to know if the effects we see in the sample (observed data) are likely to occur in the population. 

The way classical hypothesis testing works is by conducting a statistical test to answer the following question:
> Given the sample and an effect, what is the probability of seeing that effect just by chance?

Here are the steps on how we would do this

1. Compute test statistic
2. Define null hypothesis
3. Compute p-value
4. Interpret the result

If p-value is very low (most often than now, below 0.05), the effect is considered statistically significant. That means that effect is unlikely to have occured by chance. The inference? The effect is likely to be seen in the population too. 

This process is very similar to the *proof by contradiction* paradigm. We first assume that the effect is false. That's the null hypothesis. Next step is to compute the probability of obtaining that effect (the p-value). If p-value is very low (<0.05 as a rule of thumb), we reject the null hypothesis. 

Recommended lecture:
https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce

In [1]:
# Libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats

## Data preparation

In [2]:
# Read the input file
# Data extracted from DGT (Direccion General de Trafico) statistics and indicators webpage
# https://www.dgt.es/es/seguridad-vial/estadisticas-e-indicadores/matriculaciones-definitivas/tablas-estadisticas/
df_cars_brand = pd.read_csv('data/matriculacions_turismes_2020.txt',sep='\t',encoding='latin-1')

In [3]:
df_cars_brand.head()

Unnamed: 0,PROVINCIAS,A_BRUNS_LINDER,A.M.C.,AC_CARS,ACURA,ADRIA,ALFA_ROMEO,ALLIED_VEHICLES_LTD,ALPINA,ALPINE,...,VOLKNER,VOLKSWAGEN,VOLKSWAGEN_AG,VOLKSWAGEN_V_W,VOLVO,VW-PORSCHE,WESTFIELD,WIESMANN,WILLYS_OVERLAND,WILLYS_VIASA
0,Araba/Álava,0,0,0,0,0,10,0,0,0,...,0,289,0,18,157,0,0,0,0,0
1,Albacete,0,0,0,0,0,142,0,0,0,...,0,442,0,9,98,0,0,0,0,0
2,Alicante/Alacant,0,0,0,1,1,58,0,0,0,...,0,2439,0,9,298,0,0,0,0,0
3,Almería,0,0,0,0,0,6,0,0,0,...,1,733,2,35,150,0,0,0,0,0
4,Ávila,0,0,0,0,0,24,0,0,0,...,0,10,0,1,77,0,0,0,0,0


## Hypothesis testing

### One sample t-test

The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.

- H0: average sales is = 0
- H1: average sales is <> 0

In [5]:
# We use data from sales of a brand in all provinces
data = df_cars_brand[df_cars_brand['PROVINCIAS']<='Lleida']['ADRIA']
#data = df_cars_brand[df_cars_brand['PROVINCIAS']<='Lleida']['VOLKSWAGEN']

In [6]:
data.describe()

count    28.000000
mean      0.285714
std       1.329359
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max       7.000000
Name: ADRIA, dtype: float64

In [7]:
print(data)
data_mean = np.mean(data)
print(data_mean)
tset, pval = stats.ttest_1samp(data, 0)
print("p-values",pval)
alpha = 0.05

if pval < alpha:    # alpha value is 0.05 or 5%
    print("We reject null hypothesis with confidence {}".format(1-alpha))
else:
    print("We don't reject null hypothesis with confidence {}".format(1-alpha))

0     0
1     0
2     1
3     0
5     0
6     0
7     7
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
32    0
38    0
47    0
50    0
Name: ADRIA, dtype: int64
0.2857142857142857
p-values 0.2654124768964152
We don't reject null hypothesis with confidence 0.95


### Two sampled t-test

Two sampled T-test :-The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as: Independent t Test.

- H0: average sales of brand 1 = average sales of brand 2
- H1: average sales of brand 1 <> average sales of brand 2

In [17]:
# We use data from all 52 provinces
data1 = df_cars_brand['VOLVO']
data2 = df_cars_brand['ALFA_ROMEO']

In [13]:
print(data1.describe())
print(data2.describe())

count     28.000000
mean     147.678571
std      132.239696
min        6.000000
25%       73.500000
50%      114.500000
75%      191.250000
max      629.000000
Name: VOLVO, dtype: float64
count     52.000000
mean      56.019231
std      139.562832
min        0.000000
25%        2.000000
50%       17.500000
75%       33.500000
max      881.000000
Name: ALFA_ROMEO, dtype: float64


In [14]:
ttest,pval = stats.ttest_ind(data1,data2)
print("p-value",pval)
alpha = 0.05

if pval < alpha:    # alpha value is 0.05 or 5%
    print("We reject null hypothesis with confidence {}".format(1-alpha))
else:
    print("We don't reject null hypothesis with confidence {}".format(1-alpha))

p-value 0.005546748269408945
We reject null hypothesis with confidence 0.95


### Paired sampled t-test

The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

- H0: mean difference between two samples is 0
- H1: mean difference between two samples is not 0

In [19]:
# We use data from all 52 provinces
data1 = df_cars_brand['VOLVO']
data2 = df_cars_brand['ALFA_ROMEO']
print(data1.mean())
print(data2.mean())

236.40384615384616
56.01923076923077


In [20]:
ttest,pval = stats.ttest_rel(data1, data2)
print(pval)
alpha = 0.05

if pval < alpha:    # alpha value is 0.05 or 5%
    print("We reject null hypothesis with confidence {}".format(1-alpha))
else:
    print("We don't reject null hypothesis with confidence {}".format(1-alpha))

0.03358221107201263
We reject null hypothesis with confidence 0.95


In [22]:
# Note: paired two sampled test needs data from the two samples to be of the same length
data1 = df_cars_brand[df_cars_brand['PROVINCIAS']<='Lleida']['VOLVO']
data2 = df_cars_brand['ALFA_ROMEO']

ttest,pval = stats.ttest_rel(data1, data2)
print(pval)
alpha = 0.05

if pval < alpha:    # alpha value is 0.05 or 5%
    print("We reject null hypothesis with confidence {}".format(1-alpha))
else:
    print("We don't reject null hypothesis with confidence {}".format(1-alpha))

ValueError: unequal length arrays

## Exercises

- How does the parameter alpha affect the results?

- One sample t-test
  * Find a brand for which the one sample test with 95% confidence gives a negative result (H0 cannot be discarded)
  * Find a brand for which the one sample test with 95% confidence gives a positive result (H0 is discarded)
  * Perform a one sample one sided t-test. Hint: look at the documentation of the function at https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html
  
- Two sample t-test
  * Find two brands for which the two sample t-test with 95% confidence gives a negative result (equality of means cannot be discarded)
  * Find two brands for which the two sample t-test with 95% confidence gives a positive result (equality of means is discarded)

- Paired two sample t-test
  * Find two brands for which the paired two sample t-test with 95% confidence gives a negative result (equality of means cannot be discarded)
  * Find two brands for which the paired two sample t-test with 95% confidence gives a positive result (equality of means is discarded)
  
- Two sample t-test and paired sampled t-test
  * Which are the differences between the two sample test and the paired two sample test? 
  * Which one is more accurate?
  * Which one would you choose if possible?
  * Find two brands for which the two tests give the same result when comparing equality of means
  * Find two brands for which the two tests give a different result when comparing equality of means
  