# Student’s t-Test

The Student’s t-Test is a statistical hypothesis test for testing whether two samples are expected to have been drawn from the same population.

Assumptions :

* In working with the means of the samples, the test assumes that both samples were drawn from a Gaussian distribution. 
* The test also assumes that the samples have the same variance, and the same size, although there are corrections to the test if these assumptions do not hold. If you have two independent samples but you do not know that they have equal variance, you can use `Welch's t-test`.

There are two main versions of Student’s t-test:

**Independent Samples** The case where the two samples are unrelated.

**Dependent Samples** The case where the samples are related, such as repeated measures on the same population. Also called a paired test.

Both the independent and the dependent Student’s t-tests are available in Python via the ttest_ind() and ttest_rel() SciPy functions respectively.

In [3]:
from scipy import stats
import numpy as np
np.random.seed(12345678)

### Student’s t-Test for Independent Samples


In [None]:

# t-test for independent samples
from math import sqrt
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from scipy.stats import sem
from scipy.stats import t
 
# function for calculating the t-test for two independent samples
def independent_ttest(data1, data2, alpha):
	# calculate means
	mean1, mean2 = mean(data1), mean(data2)
	# calculate standard errors
	se1, se2 = sem(data1), sem(data2)
	# standard error on the difference between the samples
	sed = sqrt(se1**2.0 + se2**2.0)
	# calculate the t statistic
	t_stat = (mean1 - mean2) / sed
	# degrees of freedom
	df = len(data1) + len(data2) - 2
	# calculate the critical value
	cv = t.ppf(1.0 - alpha, df)
	# calculate the p-value
	p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0
	# return everything
	return t_stat, df, cv, p

In [14]:
 
# seed the random number generator
seed(1)
# generate two independent samples
data1 = 5 * randn(100) + 50
data2 = 5 * randn(100) + 51
# calculate the t test
alpha = 0.05
t_stat, df, cv, p = independent_ttest(data1, data2, alpha)
print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))

t=-2.262, df=198, cv=1.653, p=0.025


We can now apply our own implementation on the same data, using the function defined in the previous section.

The function will return a t-statistic value and a critical value. We can use the critical value to interpret the t statistic to see if the finding of the test is significant and that indeed the means are different as we expected.

In [15]:
# interpret via critical value
if abs(t_stat) <= cv:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are equal.')
# interpret via p-value
if p > alpha:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are equal.')

Reject the null hypothesis that the means are equal.
Reject the null hypothesis that the means are equal.


### Student’s t-Test for Dependent Samples

In [18]:
# t-test for dependent samples
from math import sqrt
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from scipy.stats import t

# function for calculating the t-test for two dependent samples
def dependent_ttest(data1, data2, alpha):
	# calculate means
	mean1, mean2 = mean(data1), mean(data2)
	# number of paired samples
	n = len(data1)
	# sum squared difference between observations
	d1 = sum([(data1[i]-data2[i])**2 for i in range(n)])
	# sum difference between observations
	d2 = sum([data1[i]-data2[i] for i in range(n)])
	# standard deviation of the difference between means
	sd = sqrt((d1 - (d2**2 / n)) / (n - 1))
	# standard error of the difference between the means
	sed = sd / sqrt(n)
	# calculate the t statistic
	t_stat = (mean1 - mean2) / sed
	# degrees of freedom
	df = n - 1
	# calculate the critical value
	cv = t.ppf(1.0 - alpha, df)
	# calculate the p-value
	p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0
	# return everything
	return t_stat, df, cv, p

In [19]:
# seed the random number generator
seed(1)
# generate two independent samples (pretend they are dependent)
data1 = 5 * randn(100) + 50
data2 = 5 * randn(100) + 51

# calculate the t test
alpha = 0.05
t_stat, df, cv, p = dependent_ttest(data1, data2, alpha)
print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))

t=-2.372, df=99, cv=1.660, p=0.020


In [20]:
# interpret via critical value
if abs(t_stat) <= cv:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are equal.')
# interpret via p-value
if p > alpha:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are equal.')

Reject the null hypothesis that the means are equal.
Reject the null hypothesis that the means are equal.


In [None]:
### One Sample T Test

A One Sample T-Test is a statistical test used to evaluate the null hypothesis that the mean  of a 1D sample dataset of independant observations is equal to the true mean  of the population from which the data is sampled. In other words, our null hypothesis is that
For our T-test, we will be using a significance level of 0.05. On the matter of doing ethical science, it is good practice to always state the chosen significance level for a given test before actually conducting the test. This is meant to ensure that the analyst does not modify the significance level for the purpose of achieving a desired outcome.

### with scipy

In [4]:
## Test with sample with identical means:


rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=5,scale=10,size=500)

## it returns a tuple with the t-statistic & the p-value
stats.ttest_ind(rvs1,rvs2),stats.ttest_ind(rvs1,rvs2, equal_var = False)


(Ttest_indResult(statistic=0.26833823296238857, pvalue=0.788494433695651),
 Ttest_indResult(statistic=0.26833823296238857, pvalue=0.7884945274950106))

In [5]:
## ttest_ind underestimates p for unequal variances:


rvs3 = stats.norm.rvs(loc=5, scale=20, size=500)
stats.ttest_ind(rvs1, rvs3),stats.ttest_ind(rvs1, rvs3, equal_var = False)


(Ttest_indResult(statistic=-0.46580283298287956, pvalue=0.6414582741343561),
 Ttest_indResult(statistic=-0.46580283298287956, pvalue=0.6414964624656874))

In [6]:
## When n1 != n2, the equal variance t-statistic is no longer equal to the unequal variance t-statistic:


rvs4 = stats.norm.rvs(loc=5, scale=20, size=100)
stats.ttest_ind(rvs1, rvs4),stats.ttest_ind(rvs1, rvs4, equal_var = False)


(Ttest_indResult(statistic=-0.9988253944278285, pvalue=0.3182832709103878),
 Ttest_indResult(statistic=-0.6971257058465435, pvalue=0.4871692772540187))

In [7]:
## T-test with different means, variance, and n:


rvs5 = stats.norm.rvs(loc=8, scale=20, size=100)
stats.ttest_ind(rvs1, rvs5),stats.ttest_ind(rvs1, rvs5, equal_var = False)


(Ttest_indResult(statistic=-1.467966985449067, pvalue=0.14263895620529113),
 Ttest_indResult(statistic=-0.9436597361713308, pvalue=0.3474417033479409))