# T-Tests and P-Values

## Create Some Fake Data

In [1]:
import numpy as np
from scipy import stats

A = np.random.normal(25.0, 5.0, 10000)
B = np.random.normal(26.0, 5.0, 10000)

# ttest_ind
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
# "Calculate the T-test for the means of two independent samples of scores."

stats.ttest_ind(A, B)

TtestResult(statistic=-13.463795990211027, pvalue=3.8600385734513946e-41, df=19998.0)

## About T-Test
The t-statistic is a measure of the difference between the two sets expressed in units of standard error.   
Put differently, it's the **size of the difference relative to the variance in the data**.   
A high t value means there's probably a real difference between the two sets; you have "significance".   

## About The P-Value
The P-value is a measure of the **probability of an observation lying at extreme t-values**; so a low p-value also implies "significance."   

## Statistical Significance
If you're looking for a "statistically significant" result, you want to see a very low p-value and a high t-statistic (well, a high absolute value of the t-statistic more precisely). In the real world, statisticians seem to put more weight on the p-value result.

In [2]:
# Here, B will be very similar to the previous "A"
B = np.random.normal(25.0, 5.0, 10000)

stats.ttest_ind(A, B)

TtestResult(statistic=-0.15680210388030358, pvalue=0.8754023976039887, df=19998.0)

Now, our t-statistic is much lower and our p-value is really high. This supports the null hypothesis - that there is no real difference in behavior between these two sets.

## Impact of Sample Size On T-Test
Does the sample size make a difference? Let's do the same thing - where the null hypothesis is accurate - but with 10X as many samples:

In [3]:
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 5.0, 100000)

stats.ttest_ind(A, B)

TtestResult(statistic=-0.4652174724048546, pvalue=0.6417762340822187, df=199998.0)

Our p-value actually got a little lower, and the t-test a little larger, but still not enough to declare a real difference. So, you could have reached the right decision with just 10,000 samples instead of 100,000. Even a million samples doesn't help, so if we were to keep running this A/B test for years, you'd never acheive the result you're hoping for:

In [4]:
A = np.random.normal(25.0, 5.0, 1000000)
B = np.random.normal(25.0, 5.0, 1000000)

stats.ttest_ind(A, B)

Ttest_indResult(statistic=-0.9330159426646413, pvalue=0.350811849590674)

If we compare the same set to itself, by definition we get a t-statistic of 0 and p-value of 1:

In [5]:
stats.ttest_ind(A, A)

Ttest_indResult(statistic=0.0, pvalue=1.0)

The threshold of significance on p-value is really just a judgment call. As everything is a matter of probabilities, you can never definitively say that an experiment's results are "significant". But you can use the t-test and p-value as a measure of signficance, and look at trends in these metrics as the experiment runs to see if there might be something real happening between the two.

## Activity

Experiment with more different distributions for A and B, and see the effect it has on the t-test.