# Statistical Tests

## Bivariate Student's T test
We may want to verify if two sample datasets may be coming from the same distribution or different ones, i.e. they are not very different or they are different enough.

This test's hypothesis are the following:
- H0 (base hypothesis) both datasets are coming from a sampling of the same distribution
- H1 (alternative hypothesis) the datasets may have been sampled out of different distributions

In [1]:
import pandas as pd

read daily temperatures measured in Rome from 1951 to 2009

This is the content:
|column| meaning|
|------|--------|
|SOUID| categorical: measurement source id
|DATE| calendar day in YYYYMMDD format
|TG| average temperature
|Q_TG| categorical: quality tag 9=invalid

In [2]:
roma = pd.read_csv("TG_SOUID100860.txt",skiprows=20)

This dataset column names include spaces, we need to remove them

In [3]:
roma.columns = list(map(str.strip,roma.columns))

In [4]:
roma.columns

Index(['SOUID', 'DATE', 'TG', 'Q_TG'], dtype='object')

In [5]:
roma.DATE = pd.to_datetime(roma.DATE,format="%Y%m%d")

In [6]:
roma["MONTH"] = roma.DATE.dt.month

In [7]:
roma["YEAR"] = roma.DATE.dt.year

In [8]:
roma_cleaned = roma.loc[roma.Q_TG != 9,:]

In [9]:
roma_giugno_1951 = roma_cleaned.loc[
    (roma_cleaned.YEAR == 1951) & (roma_cleaned.MONTH == 6),
    "TG"
]

In [10]:
roma_giugno_2009 = roma_cleaned.loc[
    (roma_cleaned.YEAR == 2009) & (roma_cleaned.MONTH == 6),
    "TG"
]

In [11]:
import scipy.stats

In [12]:
from scipy.stats import ttest_ind

In [13]:
ttest_ind(roma_giugno_1951,roma_giugno_2009)

TtestResult(statistic=np.float64(-2.167425930725216), pvalue=np.float64(0.03432071944797424), df=np.float64(58.0))