# Unit 7.1 Two sample t-test:
## Working with the equations to calculate the test statistic and the P-value 

You find in statistical textbooks and online often the detailed decription of the t-test. Things get often confusing when you work with more than one book or online resources because the mathematical notations are different and the equations are slightly different.

In this exercise we will compare the results from Scipy's function _ttest_ind_ (in package _scipy.stats_) to the results from the equations given in Appendix A of John Townend's Practical Statistics book, and the equations that the Collaborative Statistics textbook presents in Chapter 10, where they discuss the two-sample t-test (reading the sections 10.1-10.3 is recommended).


In [1]:
import numpy as np
from scipy.stats import ttest_ind, norm
import matplotlib.pyplot as plt

sample1=np.array([100,107,115,96,102,110,109,103,99,106])
sample2=np.array([95,90,102,100,98,90,100,95,88,105])



## 1 Townend's book Appendix A: Unpaired t-test example data


Equations: 


(a) Pooled estimate of variance ('pooled variance')

$\Large s_p^2 = \frac{(n_1-1)\,s_1^2 + (n_2-1)\,s_2^2}{n_1+n_2-2}$


(b) Test statistic

$\Large t_{df}= \frac{|\,\bar{x_1} -\bar{x_2}\,|}{\sqrt{s_p^2\,(\frac{1}{n_1} +\frac{1}{n_2}})}$

The symbol '| |' indicates the absolute value of the difference in the mean(Python function _np.abs_)

(c) Degrees of Freedom

$ df = n_1+n_2-2$



In [2]:
m1, s1, n1 =np.mean(sample1) , np.std(sample1,ddof=1) , np.size(sample1)
m2, s2, n2 =np.mean(sample2) , np.std(sample2,ddof=1) , np.size(sample2)

print ("Summary statistics : mean,     variance,   sample size")
print (72*'-')
# see note of the formatting of float and integer numbers at the end of this notebook.
print (f"sample data set 1  : {m1:8.4f}, {s1**2:8.4f} , {n1:4d}")
print (f"sample data set 2  : {m2:8.4f}, {s2**2:8.4f} , {n2:4d}")


Summary statistics : mean,     variance,   sample size
------------------------------------------------------------------------
sample data set 1  : 104.7000,  33.3444 ,   10
sample data set 2  :  96.3000,  32.2333 ,   10


In [None]:
# code for calculation of 
# (1) absolute difference in the mean
# (2) degrees of freedom
# (3) pooled variance
# (4) the test statistic (t-value)

In [3]:
from scipy.stats import ttest_ind
from scipy.stats.distributions import t 

# code for calculation of the p-values
# and for using the scipy function ttest_ind

# we will need the function t.cdf to look up the pvalue for our tvalue 
# with a function call like this pvalue=t.cdf(tvalue,df) 


## 2. Collaborative Statistics (Section 10.2)

(a) Estimate the standard error for the difference in the means (Eq 10.1)

$\Large se_{diff} = \sqrt{\frac{(s_1)^2}{n_1}+\frac{(s_2)^2}{n_2}} = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$

Note: This is the Gaussian error propagation equation that you may have encountered in other classes already (e.g. ENV327)

(b) Test statistic (Equation 10.2)

$\Large t_{df} = \frac{|\bar{x_1}-\bar{x_2}|}{\sqrt{\frac{s_1^2}{n_1} +\frac{s_2^2}{n_2}}}$

Note: The population means $\mu_1$ and $\mu_2$ that the authors used in their equation are assumed to be identitical ( $\mu_1-\mu_2 = 0)$. That's the null hypothesis test assumption that we always make, so we can ignore them in the traditional t-test situation. Further, we conduct a two-sided test. We are looking for large differences either negative values or positive values, so the absolute difference is used.

(c) The degrees of freedom estimate (Equation 10.3):

$\Large  df =  \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{1}{n_1-1}\, \left(\frac{s_1^2}{n_1}\right)^2 + \frac{1}{n_2-1}\, \left(\frac{s_2^2}{n_2}\right)^2}$


This is a complicated-looking equation, but it only depends on the samples sizes and the standard deviations. 
Tip: use helpful variables to store partial terms of the calculation, for example calculate the numerator (top) and denominator (bottom) of the fraction separately and then finally divide. 

Inspection of the equations shows that the terms

$\frac{s_1^2}{n_1}$ and $\frac{s_2^2}{n_2}$ are repeatedly used.
We assign them to two variables _help1_ and _help2_.


In [None]:
# code for calculation of the degrees of freedom with the equations from 
# the collaborative statistics book 

---
## Summary and Conclusion

- We can reproduced the results of the scipy.stats function _ttest-ind_. At large samples sizes and when the standard deviations in the two samples are nearly the same, the t-test results are giving us essentially the same numerical results.

- It is important to remember that the p-value returned from the function _ttest_ind_ is for a two-sided test.
You can use it directly in comparison with your 2-sided alpha value (pvalue<0.05 to reject the null hypothesis).

- You can use the sample sizes, means, and standard deviations (6 numerical values) to calculate p-values even if you don't have the full data available. That would allow you to validate results published in research reports, for example.


---
### Appendix: f-string formatting of real and integer numbers in strings

We use f-strings for combining text with numbers in strings:
the syntax is f"string with number x={x:fmt}" where
x is a variable (type float, integer) and :fmt 
represents the formatting instruction how to print the number
floating point numbers you can specify the rounding precision

#### Examples
_{x:.4f}_  creates a real number representation of the value in x rounded to four decimals

_{x:10.4f}_  creates a real number representation of the value in x rounded to four decimals
and total character length of 10 (e.g., *'___19.1234'*). 

For integer you can use _{i:4}_ or _{i:04}_ to control the presentation of the integer values (likewise _{i:4d}_ or _{i:04d}_)

In [None]:
#Examples:
x=19.1234131
print(f"x={x:.4f}")
print(f"x={x:10.4f}")
i=14
print(f"i={i:4}")
print(f"i={i:04}")