# Statistical Tests Intro

## Bivariate Student's T test
We may want to verify if two sample datasets may be coming from the same distribution or different ones, i.e. they are not very different or they are different enough.

This test's hypothesis are the following:
- H0 (base hypothesis) both datasets are coming from a sampling of the same distribution
- H1 (alternative hypothesis) the datasets may have been sampled out of different distributions

In [1]:
import pandas as pd

read daily temperatures measured in Rome from 1951 to 2009

This is the content:
|column| meaning|
|------|--------|
|SOUID| categorical: measurement source id
|DATE| calendar day in YYYYMMDD format
|TG| average temperature
|Q_TG| categorical: quality tag 9=invalid

In [2]:
roma = pd.read_csv("TG_SOUID100860.txt",skiprows=20)

This dataset column names include spaces, we need to remove them

In [3]:
roma.columns = list(map(str.strip,roma.columns))

In [4]:
roma.columns

Index(['SOUID', 'DATE', 'TG', 'Q_TG'], dtype='object')

In [5]:
roma.DATE = pd.to_datetime(roma.DATE,format="%Y%m%d")

In [6]:
roma["MONTH"] = roma.DATE.dt.month

In [7]:
roma["YEAR"] = roma.DATE.dt.year

In [8]:
roma_cleaned = roma.loc[roma.Q_TG != 9,:]

In [9]:
roma_giugno_1951 = roma_cleaned.loc[
    (roma_cleaned.YEAR == 1951) & (roma_cleaned.MONTH == 6),
    "TG"
]

In [10]:
roma_giugno_2009 = roma_cleaned.loc[
    (roma_cleaned.YEAR == 2009) & (roma_cleaned.MONTH == 6),
    "TG"
]

In [11]:
import scipy.stats

In [12]:
from scipy.stats import ttest_ind

In [13]:
ttest_ind(roma_giugno_1951,roma_giugno_2009)

TtestResult(statistic=np.float64(-2.167425930725216), pvalue=np.float64(0.03432071944797424), df=np.float64(58.0))

## ANOVA
[Analysis of Variance](https://en.wikipedia.org/wiki/Analysis_of_variance) is a family of methodologies that extend Student's T; their goal is to prove whether a factor makes a difference when it splits a set of measurements.

It can be used to show if a disease treatment is effective or not. The factor or categorical independend variable is the "input", while the health parameter is a continuous dependent variable or "output".

We are going to show the simplest use of it called `Fixed Mixture`

### (Optional) How it works
The main idea is to see how a grouping is or not relevant to explain the global variance

In order to understand this test we can derive a formula representing the contribution to the global variance given by each the variance within each subgroup and the variance between all subgroups

Let's start with the usual variance definition

\begin{equation}
var[X] := \frac{\sum_{x \in X}{(x - \bar{x})^2}}{n - 1}
\end{equation}

where $\bar{x} = E[X] = \frac{\sum_{x \in X}x}{card[X]}$

given a partition $X_i$ i.e. $\bigcup_{i \in G}{X_i} =  X$ and $n = card[X]$ ; $n_i = card[X_i]$ ; 

so $\bar{x_i} = E[X_i] = \frac{\sum_{x \in X_i}x}{n_i}$

breaking the sum into each subgroup and adding and removing $x_i$ inside the square we have

\begin{equation}
var[X] := \frac{\sum_{i \in G}{\sum_{x \in X_i}(x - \bar{x} + \bar{x_i} - \bar{x_i})^2}}{n - 1}
\end{equation}

developing the square

\begin{equation}
var[X] := \frac{\sum_{i \in G}{\sum_{x \in X_i}(x - \bar{x_i})^2 + (\bar{x_i} - \bar{x})^2 -2(x - \bar{x_i})(\bar{x_i} - \bar{x})}}{n - 1}
\end{equation}

we can bring the constant part outside of the sum

\begin{equation}
var[X] := \frac{\sum_{i \in G}{( n_i (\bar{x_i} - \bar{x})^2 + \sum_{x \in X_i}(x - \bar{x_i})^2 -2(\bar{x_i} - \bar{x})\sum_{x \in X_i}{(x - \bar{x_i})}})}{n - 1}
\end{equation}

by definition of $\bar{x_i}$ we have

\begin{equation}
\forall_{i \in G}\sum_{x \in X_i}{(x - \bar{x_i})} = 0
\end{equation}

so the last addend can be simplified; let's break the fraction

\begin{equation}
var[X] := \frac{\sum_{i \in G}{ n_i (\bar{x_i} - \bar{x})^2 }}{n - 1} + \frac{\sum_{i \in G}{\sum_{x \in X_i}(x - \bar{x_i})^2}}{n - 1}
\end{equation}

now we multiply and divide the second part by $n_i - 1$ in order to show how it may be seen as a variance

\begin{equation}
var[X] := \frac{\sum_{i \in G}{ n_i (\bar{x_i} - \bar{x})^2 }}{n - 1} + \sum_{i \in G}{\frac{(n_i - 1)}{n -1}\frac{\sum_{x \in X_i}(x - \bar{x_i})^2}{n_i - 1}}
\end{equation}

now we can show the variance of each subgroup

\begin{equation}
var[X] := \frac{\sum_{i \in G}{ n_i (\bar{x_i} - \bar{x})^2 }}{n - 1} + \sum_{i \in G}{\frac{(n_i - 1)}{n -1}var[X_i]}
\end{equation}

both addends are weighted; this is equivalent to

\begin{equation}
var[X] := var[E[X_i]] + E[var[X_i]]
\end{equation}


https://github.com/KenDaupsey/One-Way-Repeated-measures-ANOVA/blob/main/One_Way_Repeated_measures_ANOVA.ipynb

In [9]:
unemployment = pd.read_csv("unemployment_it.csv")

In [10]:
import statsmodels.api as sm

In [11]:
from statsmodels.formula.api import ols

In [7]:
ols('frequency ~ ()', data=data).fit()

AttributeError: module 'pysdmx' has no attribute 'fmr'