In [1]:
import scipy.stats
import random
import math
import numpy as np

# T-tests

This notebook will ecplore some simple cases of hypothesis testing, in particular t-tests, and how to perform them in Python.

## Z-tests

Before moving onto t-tests, first conisder the simpler, and probably more familiar case of Z-tests.

### One-sample Z-test

This is perhaps the most familiar entry level hypothesis test, and the test is based on whether or not a sample comes from a known Normal distribution. Consider that we have a candidate distribution

$ X \sim N(\mu, \sigma^2) $

For the sake of being explicit, say $\mu = 10, \sigma=3$. If the sample we had was $x = 7.1$, then the "z-score" can be calculated as $(x - \mu ) / \sigma $

In [2]:
z_score = (7.1 - 10)/3
print(f"z-score = {z_score}")

z-score = -0.9666666666666668


One might perform a one or two tailed test based on the null and alternative hypotheses, e.g. if our test stat comes from a distribution $X' \sim N(\mu', \sigma) $

$H_0: \mu' = \mu, H_1: \mu' \neq \mu$

would be a two-tailed test, and 

$H_0: \mu' = \mu, H_1: \mu' < \mu$

would be a one-tailed test. 

Since the candidate distribution is $N(\mu, \sigma^2)$, the null hypothesis implicitly assumes the z-score is distributed $N(0,1)$.

From the z-score, one can calculate the p-value, which is the probability of getting a value more extreme than the test value

In [3]:
p_value_2tail = scipy.stats.norm.sf(abs(z_score)) * 2
print(f"two-tailed p-value = {p_value_2tail}")

two-tailed p-value = 0.33371069574356604


So there is a 0.33 probability of a more extreme sample occuring. At a 10% significance level, if the probability of a more extreme sample were less than 0.1, then the test sample would be so extreme that the null hypothesis would be rejected (in this case, the null hypthesis is not rejected).

For a one-tailed test, the p-value would be calculated as

In [4]:
p_value_1tail = scipy.stats.norm.sf(abs(z_score))
print(f"one-tailed p-value = {p_value_1tail}")

one-tailed p-value = 0.16685534787178302


### Multi-sample Z-test

If instead of one sample, there are multiple samples, and one is concerned with whether or not the mean of those samples comes from the candidate distribution. This is essentially a one sample test, but makes use of the fact that if

$ X, Y \sim N(\mu, \sigma^2) $

and are independent then

$ X + Y \sim N(2\mu, 2\sigma^2) $

and for a constant, $k$,

$ kX \sim N(k\mu, k^2\sigma^2) $

In the case of having multiple test samples, and assuming they are independent (and from the same candidate distribution), one can calculate the mean of the test samples, $\bar{x}$ which under the null hypothesis will have a distribution

$ \frac{1}{n} \sum_{i=1}^n X_i \sim N(\mu, \sigma^2 / n)$

In this case the z-score can be calculated as

$ \textrm{z score} = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}$

and the same principles for one and two tailed tests can be applied. This tests assumes $\sigma$ and $\mu$ are known. $\mu$ is often a constants taken to be related to the test hypothesis/business problem.

### Difference between two sample sets Z-test
This is similar to the multi-sample Z-test, except now there are two sets of multi-sample data to be compared (e.g. test stats collected from A/B testing).

As before, one can calculate the sample mean of the two sample sets. This test is usually concerned with the difference between the two sample sets, e.g. a null hypothesis of the candidate distributions for each sample-set having the same mean.

For a pair of samples sets $\mathcal{X}_1, \mathcal{X}_2$, which contain i.i.d. samples coming from distributions

$ X_i \sim N(\mu_i, \sigma_i^2) $,

One can compute a z-score as

$ \textrm{z score} = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$

Due to the identity

$ aX_1 + bX_2 \sim N(a\mu_1 + b\mu_2, a^2\sigma_1^2 + b^2\sigma_2^2) $

under the null hypothesis, the z-score will be distributed $N(0,1)$. One can then go ahead and perform hypothesis tests. The same principles for one and two tailed tests can be applied. This tests assumes $\sigma_i$ and $\mu_i$ are known. The $\mu_i$ are constants often related to the test hypothesis/business problem. In the case of A/B testing, the null hypothesis is often $\mu_1 - \mu_2 = 0$.

## T-tests

### Central limit theorems

Z-tests assume that the candidate distribution is Normal. When is this justified? In multi-sample scenarios, one often calls on a "Central Limit Theorem", to say that a sample mean (no matter the original distribution) of $n$ samples will asymptotically approach a Normal distribution as $n \to \infty$. There is a subtelty on the actual conditions, how quickly the sample mean distribution actually approaches a Normal distirbution (as a function of $n$), and any other conditions. In practice, a central limit theorem applies to each possible distribution, some examples of which are [here](https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/BS704_Probability12.html). In practice, the original underlying distributions are unknown, so statisticians usually go along with "The Central Limit Theorem holds when $n \geq 30$".

The assumptions of the above Z-tests can fail if
- a sample mean distribution is ill-defined
- the sample points are not independent
- the number of samples is not sufficiently large
- the variances of the underlying distributions are unknown 

([this](https://bytepawn.com/beyond-the-central-limit-theorem.html) covers the failure cases in more detail). One can address the third and fourth cases by performing a t-test instead of a z-test. 

[this](https://www.analyticsvidhya.com/blog/2020/06/statistics-analytics-hypothesis-testing-z-test-t-test/) covers some choices of the flavour of test one might choose from a practical perspective.

### Multi-sample T-tests

In this case, instead of a z-score, one calculates a t-score. The null hypothesis assumes that the t-score is sampled from a [t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution).

A t-score is calculated similarly to a z-score, but using the sample standard deviation

$ \textrm{t score} = \frac{\bar{x} - \mu}{s / \sqrt{n}}$

where

$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $

In the example below,

$H_0: \mu = 0.1$

$H_1: \mu > 0.1$

In [5]:
random.seed(42)
n = 15

samples = [random.random() for _ in range(n)]  # samples from U[0,1)
xbar = sum(samples) / n
s = math.sqrt(sum([(x-xbar)**2 for x in samples])/(n-1))
t_score = (xbar - 0.1) / (s / math.sqrt(n))
print(f"t-score = {t_score}")

t-score = 3.6387384717963607


This can be used to calculate a p-value for a one-tailed test (since the t-distribution is symmetric, a two-tailed test would increase the p-value by a factor of 2). The degrees of freedom (since the test assumes that the input samples are independent) is $n-1$

In [6]:
p_value_1tail = scipy.stats.t.sf(abs(t_score), df=n-1)
print(f"one-tailed p-value = {p_value_1tail}")

one-tailed p-value = 0.0013420587507718997


The probability of a more extreme example is 0.0013; at the 10% significance level, this is less likely than 0.1, and so the null hypothesis is rejected. 

scipy.stats can directly compute the statistic

In [7]:
scipy.stats.ttest_1samp(samples, 0.1, alternative="greater")

Ttest_1sampResult(statistic=3.6387384717963607, pvalue=0.0013420587507718997)

### Difference between two sample sets T-test - unequal variances

This follows the same principle as the Z-test example, except now the test statistic uses the sample standard deviation, as opposed to a pre-defined standard deviation

$ \textrm{t score} = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$

There is a breakdown of nomenclature here; many times one comes across a two sample-set T-test, the test assumes that the variances are the same (which needn't be the case). The version here may be called in the literature _Welch's T-test_, or _unequal variances T-test_. The variations one might encounter are explained [here](https://en.wikipedia.org/wiki/Student%27s_t-test).

In this case, there will be a pair of sample sets to compare/test against hypotheses, e.g. for A/B testing.

In [8]:
random.seed(43)
n_1 = 11
n_2 = 16
samples_1 = [random.random() for _ in range(n_1)]  # samples from U[0,1)
# samples from U[-0.5,1.5), so that the var is different from samples_1
samples_2 = [random.random()*2 - 0.5 for _ in range(n_2)]

The null and alternative hypotheses will typically be represented by assumptions on $\mu_i$, e.g.

$H_0: \mu_1 = \mu_2$

$H_1: \mu_1 \neq \mu_2$

As before, one can go ahead and compute the test statistic based on the null hypothesis

In [9]:
xbar_1 = sum(samples_1)/n_1
xbar_2 = sum(samples_2)/n_2

s_1_sqr = sum([(x-xbar_1)**2 for x in samples_1])/(n_1-1)
s_2_sqr = sum([(x-xbar_2)**2 for x in samples_2])/(n_2-1)

t_score = (xbar_1 - xbar_2) / math.sqrt( s_1_sqr/n_1 + s_2_sqr/n_2 )
print(f"t-score = {t_score}")

t-score = -1.2797528973559824


In the case of Welch's t-test, there is a subtelty that the test statistic is not quite t-distributed. In simpler forms of the two-sample-set T-test, the test statistic (under the null hypothesis) is t-distributed, and the degrees of freedom is $(n_1 - 1) + (n_2 - 1)$. In the case of Welch's t-test, the test statistic is approximated by a t-distribution with degrees of freedom

$ \textrm{d.o.f} =  \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1}+\frac{(s_2^2/n_2)^2}{n_2-1}}$

In [10]:
def get_welchs_ttest_dof(s1sqr, s2sqr, n1, n2):
    numerator = (s1sqr/n1 + s2sqr/n2) ** 2
    denominator = (s1sqr/n1) ** 2 / (n1 - 1) + (s2sqr/n2) ** 2 / (n2 - 1)
    return numerator / denominator

In [11]:
dof = get_welchs_ttest_dof(s_1_sqr, s_2_sqr, n_1, n_2)
print(f"degrees of freedom = {dof}")

degrees of freedom = 24.369240640703026


Note that this value is a little different to what $n_1 + n_2 -2$ would have given. 

Based on the hypotheses, this is a 2-tailed test

In [12]:
p_value_2tail = scipy.stats.t.sf(abs(t_score), df=dof) * 2
print(f"two-tailed p-value = {p_value_2tail}")

two-tailed p-value = 0.2126820106340286


scipy.stats can directly compute the statistic

In [13]:
# set equal_var to False, since the two sample sets do not have equal variance; this does a Welch's test
scipy.stats.ttest_ind(samples_1, samples_2, equal_var=False)

Ttest_indResult(statistic=-1.2797528973559833, pvalue=0.2126820106340284)

Going along with the null hypothesis, the probability of getting a more extreme result is 0.213; using a significance level of 10%, the resultant p-value is not extreme enough for the null hypothesis to be rejected.

### Difference between two sample sets T-test - equal variances

In the case of the two sample sets sharing the same variance, a test statistic that actually follows a t-distribution can be calculated, as [here](https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_similar_variances_(1/2_%3C_sX1/sX2_%3C_2)).

In [14]:
random.seed(43)
n_1 = 11
n_2 = 16
samples_1 = [random.random() for _ in range(n_1)]  # samples from U[0,1)
samples_2 = [random.random() for _ in range(n_2)]  # samples from U[0,1)

In [15]:
xbar_1 = sum(samples_1)/n_1
xbar_2 = sum(samples_2)/n_2

s_1_sqr = sum([(x-xbar_1)**2 for x in samples_1])/(n_1-1)
s_2_sqr = sum([(x-xbar_2)**2 for x in samples_2])/(n_2-1)

dof = n_1 + n_2 - 2
s_pooled = math.sqrt( ((n_1-1)*s_1_sqr + (n_2-1)*s_2_sqr)/dof)
t_score = (xbar_1 - xbar_2) / (s_pooled * math.sqrt(1.0/n_1 + 1.0/n_2))
print(f"t-score = {t_score}")

t-score = -1.3797269616522334


In [16]:
p_value_2tail = scipy.stats.t.sf(abs(t_score), df=dof) * 2
print(f"two-tailed p-value = {p_value_2tail}")

two-tailed p-value = 0.1798872270104553


which can be obtained from scipy.stats

In [17]:
scipy.stats.ttest_ind(samples_1, samples_2)

Ttest_indResult(statistic=-1.3797269616522334, pvalue=0.1798872270104553)

Note that this gives a slightly different answer to Welch's test, since that test is an approximation

In [18]:
# DANGER WILL ROBINSON - when the variances are the same, there is no need to use this approximation
scipy.stats.ttest_ind(samples_1, samples_2, equal_var=False)

Ttest_indResult(statistic=-1.3453442754218516, pvalue=0.1938014455473255)

### Difference between two sample sets T-test - dependent tests

This is a special case of two sample sets, where the relation between the sets is not so much A-set versus B-set, but that the sets are the same size, and represent the same population entities under different conditions (e.g. a class of students who do an exam and get exam scores, and then the same set of students doing an exam a year later).

Consider such a set of results, and the expectation that the increase in marks should be 5 points.

In [19]:
random.seed(44)
n = 15

results_1 = [int(random.random() * 100) for _ in range(n)]  # uniform random ints in [0,100)
results_2 = [int(random.random() * 100) for _ in range(n)]  # uniform random ints in [0,100)

In this artificially created dataset, the student at index "i" in results_1 is the same student at index "i" in results_2; the relevant entry in results_2 is meant to be the same student's result when taking the exam a year later.

Based on the expectation of a result increase, one can construct a hypothesis test

$H_0:$ the difference between results_2 and results_1 has mean 5

$H_1:$ the difference between results_2 and results_1 has mean < 5

The t-score can be calculated as done [here](https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent_t-test_for_paired_samples)

In [20]:
deltas = np.array(results_2) - np.array(results_1)
xbar = sum(deltas)/n
s = np.std(deltas, ddof=1)  # sample std deviation, courtesy of numpy
mu0 = 5  # the 5 from the null hypothesis

t_score = (xbar - mu0)/(s/math.sqrt(n))
print(f"t-score = {t_score}")

t-score = -0.23560440840463792


In [21]:
p_value_1tail = scipy.stats.t.sf(abs(t_score), df=n-1)
print(f"one-tailed p-value = {p_value_1tail}")

one-tailed p-value = 0.4085756326449204


which can also be computed directly using scipy.stats (which doesn't have an entry for placing $\mu_0$, so the value for $\mu_0$ has been subtracted from the second set of results, and the scipy.stats.ttest_rel function "sees" a problem of $H_0: \mu=0, H_1: \mu < 0$)

In [22]:
scipy.stats.ttest_rel(np.array(results_2)-mu0, results_1, alternative="less")

Ttest_relResult(statistic=-0.23560440840463792, pvalue=0.4085756326449204)

At a 10% significance level, the p-value is not sufficiently extreme for the null hypothesis to be rejected.

## References

[1] _Student's t-test_ https://en.wikipedia.org/wiki/Student%27s_t-test

[2] _Welch–Satterthwaite equation_ https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation

[3] _Beyond the Central Limit Theorem_ https://bytepawn.com/beyond-the-central-limit-theorem.html

[4] _Central Limit Theorem_
https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/BS704_Probability12.html

[5] _Hypothesis Testing and Z-Test vs. T-Test_ https://www.analyticsvidhya.com/blog/2020/06/statistics-analytics-hypothesis-testing-z-test-t-test/