## T-Tests
- Used on two samples data sets that are numeric
    - if you have more than two data sets, then you must either do it a) pairwise T-tests, or something else (e.g. Anova)

A T-test does two things:
1. How **different** two means values (obtained from two sets of distributed data) are - T value
    - Samller values indicate less differences
    
2. How **significant** are these differences (i.e. did they occur randomly/by chance) - P value
    - Smaller values indicate that they were not random (i.e. P=0.05 means there is a 5% probability that the results happened by chance).


Random, Gaussian (i.e. normal) distributin of data points
- random.normal: https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
    - creates a Gaussian distribution with a curve width based on a standard deviation value

In [None]:
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np

In [None]:
## Three data sets - with means at 0.0 and 2.0, and standard deviations of 1.0 and 0.5
data1 = np.random.normal(0.0, 1.0, size=50)
data2 = np.random.normal(0.0, 0.5, size=50)
data3 = np.random.normal(2.0, 1.0, size=50)

Note: std = 0, would provide perfect resulting means of either 0.0 and 2.0.

So, let's check what the means actually are (i.e. the effect of the std. dev.)

In [None]:
np.mean(data1)

In [None]:
np.mean(data2)

In [None]:
np.mean(data3)

In [None]:
plt.plot()
plt.plot(data1, '-o')
plt.plot(data2, '-r')
plt.plot(data3, '-g')

plt.hlines(np.mean(data1), 0, 50, colors='blue')
plt.hlines(np.mean(data2), 0, 50, colors='red')
plt.hlines(np.mean(data3), 0, 50, colors='green')

plt.show()

Use seaborn to easily plot the histogram of the data

In [None]:
import seaborn as sns

plt.plot()
sns.kdeplot(data1, color='blue', shade=True)
sns.kdeplot(data2, color='red', shade=True)
sns.kdeplot(data3, color='green', shade=True)
plt.title("Histogram of Data")
plt.show()

#### t-test
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

- "The test measures **whether the mean (expected) value differs significantly across samples**."

- A **large p-value** (e.g. **0.05, 0.1**), then it is likely that the averages are not distinguishable."

- "If the p-value is smaller than the threshold, e.g. 1%, 5% or 10%, then we reject the null hypothesis of equal averages."

In [None]:
t_stat, p_value = stats.ttest_ind(data1, data3, equal_var = False)

In [None]:
t_stat

In [None]:
p_value

---
#### Real-world example

Research task: Investigate the height, weight, gender and age of a population of people (e.g. Germans)
- We can't investigate the full population, so we must sample a random subset of people

In [None]:
len(np.random.random_sample((50,)))

In [None]:
import pandas as pd

np.random.seed(10)

height = np.random.uniform(0.5, 2.0, size=50)
weight = np.random.uniform(15.0, 90.0, size=50)
gender = np.random.choice(('Male', 'Female'), size=50, p=[0.4, 0.6]) ## With a 40:60 ratio of male:female
age = np.random.uniform(3.0, 80.0, size=50)

In [None]:
height

In [None]:
np.array([height, weight, gender, age])

Create a dataframe that contains the information of our variables.
- 1 catagorical type data: gender
- 3 numeric type data: height, weight and age

In [None]:
df = pd.DataFrame(list(zip(height, weight, gender, age)), columns=['height (m)', 'weight (kg)', 'gender', 'age'])

In [None]:
df

In [None]:
df[df["gender"] == "Male"].count()

In [None]:
df[df["gender"] == "Female"].count()

Now let's look at our random distribution and think about correlation between the data.

In [None]:
hist = sns.FacetGrid(df, col="gender")
hist.map(sns.distplot, "height (m)", bins=20)
hist.add_legend()

In [None]:
hist = sns.FacetGrid(df, col="gender")
hist.map(sns.distplot, "weight (kg)", bins=20)
hist.add_legend()

In [None]:
hist = sns.FacetGrid(df, col="gender")
hist.map(sns.distplot, "age", bins=20)
hist.add_legend()

Compute the mean and median as a function of the gender.
- use pandas groupby function

In [None]:
df.groupby(['gender']).mean()

In [None]:
df.groupby(['gender']).median()

In [None]:
sns.lmplot(x='weight (kg)', y='age', data=df, hue='gender', fit_reg=True, legend=True)
plt.show()

In [None]:
sns.lmplot(x='height (m)', y='age', data=df, hue='gender', fit_reg=True, legend=True)
plt.show()

In [None]:
sns.lmplot(x='height (m)', y='weight (kg)', data=df, hue='gender', fit_reg=True, legend=True)
plt.show()

In [None]:
sns.jointplot(x='height (m)', y='age', data=df, kind='reg', color='b')
plt.show()

####
Research Question: Is there a difference between the number of mean and women in the population?
- Hypothesis Zero (aka Null Hypotheis): there is no difference
- Hypothesis One: there is a difference

In other words - is our sampling of the real population artifically skewed, or is it likely to be real?

In [None]:
sns.countplot(x="gender", data=df)

We must first determine a threshold for how low our statisical result is for use to have confidence that the Null Hypothesis is wrong - i.e. the probability value (P-value) that the Null Hypothesis is wrong.

Common convention is 5%

One-Sample Proportion Test

In [None]:
from scipy import stats

test = np.array([60,40])
stats.zscore(test)


In [None]:
df.groupby('gender')['age'].transform(stats.zscore)

In [None]:
x = (df.groupby('gender')['age'])

In [None]:
for i in x:
    print(i)