<a href="https://colab.research.google.com/github/justinphan/Student-T-Test/blob/master/T_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Reference: https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f

https://www.khanacademy.org/math/ap-statistics/summarizing-quantitative-data-ap/more-standard-deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance

https://www.statisticshowto.datasciencecentral.com/p-value/

T-test (Student T's test) is one of the most popular procedures in statistics. 
The T Test compare two averages and tell how difference they are.

Example 1: A drug company wants to test their new cancer drug to see how effective it is at improving life expectancy. The drug is tested on two groups. The first group called control group (they take placebo) have the average of life expectancy of 5+ and the second group given this new group have the life expectancy of 6+. So the scientist use the Student's t-test to know the probability of the results.

Example 2: The next time you have a cold, you take a medicine. You do a survey on your friends and they all said they recover sooner after taking the medicine. You want to know if the effect happned by chance of it will repeated.

**T-score**

$tscore=\frac{difference between two groups}{difference there is within the groups}$

The higher tscore, the more difference between group and vice versa.

Every t-value is paired with a p-value. **P-value** is written as a decimal. They tell if your data did not occur by chance. The lower p-value, the better. Usually the threshold is 0.05 to be considered valid data. For example, 0.02 indicates that there is 2% that the results happen by chance.

3 types of t-test:


1. **  Independent Sample t-test**: comparing means of two groups
2.   **Paried sample t-test**: comparing means of same groups but at different times (say, one month apart)
3. **One sample t-test**: comparing the mean against its known mean.

Step:


1.   Determine confidence interval $\alpha$. The typical value of it is 0.05
2.   Determine degree of freedom
$df=n_x+n_y-2$
3.   Calculate t-statistic:
$\frac{M_x-M_y}{\sqrt{S_x^2/n_x+S_y^2/n_y}}$ 
M: mean; n= number of elements; S: unbiased deviation
4.   Calculate t-value (calculated internally in the example below) and p-value












In [5]:
## Import the packages
import numpy as np
from scipy import stats


## Define 2 random distributions
#Sample Size
N = 10
#Gaussian distributed data with mean = 2 and var = 1
a = np.random.randn(N) + 2
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)


## Calculate the Standard Deviation
#Calculate the variance to get the standard deviation

#For unbiased max likelihood estimate we have to divide the var by N-1, 
#and therefore the parameter ddof = 1
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)

#std deviation
s = np.sqrt((var_a + var_b)/2)
s



## Calculate the t-statistics
t = (a.mean() - b.mean())/(s*np.sqrt(2/N))

print(t)

## Compare with the critical t-value
#Degrees of freedom
df = 2*N - 2
print(df)

#p-value after comparison with the t 
print(stats.t.cdf(t,df=df))
p = 1 - stats.t.cdf(t,df=df)


print("t = " + str(t))
print("p = " + str(2*p))
### You can see that after comparing the t statistic 
## with the critical t value (computed internally) we get a good p value of 0.0005 and thus we reject the null hypothesis and thus it proves that the mean of the two distributions are different and statistically significant.


## Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(p2))

5.168290171760298
18
0.9999677040156773
t = 5.168290171760298
p = 6.459196864549988e-05
t = 5.168290171760298
p = 6.45919686454931e-05
