# T Tests

In [38]:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

### Dataset
#### Student Alcohol Consumption
Obtained in a survey of students math and portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students.

- Dataset retrieved from: https://www.kaggle.com/uciml/student-alcohol-consumption

In [2]:
students_df = pd.read_csv('./students.csv')

In this notebook we will use `sex`, `age` and `Dalc` (Daily alcohol compsumption)

In [78]:
students_df[['sex', 'age', 'Dalc']].sample(10)

Unnamed: 0,sex,age,Dalc
198,F,16,1
438,F,17,1
347,F,17,1
439,F,15,1
208,M,16,1
588,F,17,1
511,F,17,1
166,M,19,1
522,F,16,1
5,M,16,1


## T Tests

### One sample T test

Checks whether a sample mean differs from the population mean

### Example

Let's get a sample of Daily Alcohol `Dalc` where `age` is equal to 18

In [70]:
Dalc_18 = students_df.query('age == 18').sample(30)['Dalc'].to_list()

And the total pupulation `Dalc` mean

In [73]:
dalc_mean = students_df['Dalc'].mean()
dalc18_mean = np.mean(Dalc_18)
print(f'Total population Dalc mean: {dalc_mean}')
print(f'18 years old population Dalc mean: {dalc18_mean}')

Total population Dalc mean: 1.50231124807396
18 years old population Dalc mean: 1.4666666666666666


- **Null Hypothesis**: 18 years old population `Dalc` is similar to the total population `Dalc`
- If **`p-value`** < `0.05`, reject Null Hypothesis

Using scipy stats funtion `ttess_1samp(a=SAMPLE, popmean=TOTAL_MEAN)`
- Yields the t value `statistic` and the `p-value`

In [82]:
stats.ttest_1samp(a=Dalc_18, popmean=dalc_mean) 

Ttest_1sampResult(statistic=-0.26733436055470067, pvalue=0.7911038211772043)

### Two sample test

Investigates whether the means of two independent data samples differ from one another. The null hypothesis is that the means of both groups are the same

#### Example
Lets get a sample of 30 values for each female `F` and male `M` Daily Alcohol consumption `Dalc`

In [85]:
female_Dalc = students_df[students_df.sex == 'F']['Dalc'].sample(30).to_list()
male_Dalc = students_df[students_df.sex == 'M']['Dalc'].sample(30).to_list()


- **Null Hypothesis**: female and male Dalc are similar 
- If **`p-value`** < `0.05`, reject Null Hypothesis

Using scipy stats funtion `ttest_ind(SAMPLE_A, SAMPLE_B)`
- Yields the t value `statistic` and the `p-value`

In [89]:
stats.ttest_ind(female_Dalc, male_Dalc)

Ttest_indResult(statistic=-1.8990043750819299, pvalue=0.0625415543098149)

Using statsmodles stats funtion `ttest_ind(SAMPLE_A, SAMPLE_B)`
- Yields a 3 value tuple (`t`, `p-value`, `degrees of freedom`) 

In [90]:
sm.stats.ttest_ind(female_Dalc, male_Dalc)

(-1.8990043750819294, 0.06254155430981496, 58.0)