# Hypothesis Testing

We would like to know if the effects we see in the sample (observed data) are likely to occur in the population. 

The way classical hypothesis testing works is by conducting a statistical test to answer the following question:
> Given the sample and an effect, what is the probability of seeing that effect just by chance?

Here are the steps on how we would do this

1. Compute test statistic
2. Define null hypothesis
3. Compute p-value
4. Interpret the result

If p-value is very low (most often than now, below 0.05), the effect is considered statistically significant. That means that effect is unlikely to have occured by chance. The inference? The effect is likely to be seen in the population too. 

This process is very similar to the *proof by contradiction* paradigm. We first assume that the effect is false. That's the null hypothesis. Next step is to compute the probability of obtaining that effect (the p-value). If p-value is very low (<0.05 as a rule of thumb), we reject the null hypothesis. 

Recommended lecture:
https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce

In [1]:
# Libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats

## Data preparation

In [2]:
# Read the input file
# Data extracted from DGT (Direccion General de Trafico) statistics and indicators webpage
# https://www.dgt.es/es/seguridad-vial/estadisticas-e-indicadores/matriculaciones-definitivas/tablas-estadisticas/
df_cars_brand = pd.read_csv('data/matriculacions_turismes_2020.txt',sep='\t',encoding='latin-1')

In [3]:
df_cars_brand.head()

Unnamed: 0,PROVINCIAS,A_BRUNS_LINDER,A.M.C.,AC_CARS,ACURA,ADRIA,ALFA_ROMEO,ALLIED_VEHICLES_LTD,ALPINA,ALPINE,...,VOLKNER,VOLKSWAGEN,VOLKSWAGEN_AG,VOLKSWAGEN_V_W,VOLVO,VW-PORSCHE,WESTFIELD,WIESMANN,WILLYS_OVERLAND,WILLYS_VIASA
0,Araba/Álava,0,0,0,0,0,10,0,0,0,...,0,289,0,18,157,0,0,0,0,0
1,Albacete,0,0,0,0,0,142,0,0,0,...,0,442,0,9,98,0,0,0,0,0
2,Alicante/Alacant,0,0,0,1,1,58,0,0,0,...,0,2439,0,9,298,0,0,0,0,0
3,Almería,0,0,0,0,0,6,0,0,0,...,1,733,2,35,150,0,0,0,0,0
4,Ávila,0,0,0,0,0,24,0,0,0,...,0,10,0,1,77,0,0,0,0,0


## Hypothesis testing

### One sample t-test

When working with a small sample (n < 30), the One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.

- H0: average sales is = 0
- H1: average sales is <> 0

In [9]:
# We use data from sales of a brand in all provinces
data = df_cars_brand[df_cars_brand['PROVINCIAS']<='Lleida']['ADRIA']
#data = df_cars_brand[df_cars_brand['PROVINCIAS']<='Lleida']['VOLKSWAGEN']

In [11]:
data.describe()

count    28.000000
mean      0.285714
std       1.329359
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max       7.000000
Name: ADRIA, dtype: float64

In [13]:
print(data)
data_mean = np.mean(data)
print(data_mean)
tset, pval = stats.ttest_1samp(data, 0)
print("p-values",pval)

if pval < 0.05:    # alpha value is 0.05 or 5%
    print(" we are rejecting null hypothesis")
else:
    print("we are not rejecting null hypothesis")

0     0
1     0
2     1
3     0
5     0
6     0
7     7
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
32    0
38    0
47    0
50    0
Name: ADRIA, dtype: int64
0.2857142857142857
p-values 0.2654124768964152
we are not rejecting null hypothesis


### Two sampled t-test

Two sampled T-test :-The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as: Independent t Test.

- H0: average sales of brand 1 = average sales of brand 2
- H1: average sales of brand 1 <> average sales of brand 2

In [14]:
# We use data from all 52 provinces
data1 = df_cars_brand['ADRIA']
data2 = df_cars_brand['VOLKSWAGEN']

In [15]:
print(data1.describe())
print(data2.describe())

count    52.000000
mean      0.192308
std       0.990909
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max       7.000000
Name: ADRIA, dtype: float64
count       52.000000
mean      1503.980769
std       4019.365141
min         10.000000
25%        286.750000
50%        603.500000
75%       1160.250000
max      28353.000000
Name: VOLKSWAGEN, dtype: float64


In [17]:
ttest,pval = stats.ttest_ind(data1,data2)
print("p-value",pval)
if pval <0.05:
    print("we reject null hypothesis")
else:
    print("we are not rejecting null hypothesis")

p-value 0.008166148033981843
we reject null hypothesis


### Paired sampled t-test

The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

- H0: mean difference between two samples is 0
- H1: mean difference between two samples is not 0

In [18]:
# We use data from all 52 provinces
data1 = df_cars_brand['ADRIA']
data2 = df_cars_brand['VOLKSWAGEN']

In [19]:
ttest,pval = stats.ttest_rel(data1, data2)
print(pval)
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

0.009431386965847617
reject null hypothesis
