# Lab | Inferential statistics - T-test & P-value

### Instructions
1. We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [2]:
# importing dependencies 
import pandas as pd
import numpy as np
import math as m
import scipy.stats as stats

In [3]:
# importing the data
machine_data = pd.read_csv("machine.txt", encoding='utf-16', sep='\t')

machine_data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


**Setting up our experiment & hypotheses**

* H0: The two venting machines are performing the same (μ_New_Machine = μ_Old_Machine)
* H1: The new venting machine is performing better than the old machine. In this case that would mean that the new machine is faster therefore (μ_New_Machine < μ_Old_Machine)
* We define control = Old Machine and treatment = New Machine.
* We define significance level `a = 0.05`.

**Assumptions**

* We assume that here is sufficient evidence to conduct the t-test as the instructions of the lab state. This means the following:
<br>
    
     * The two samples data groups are independent.
     * The data in both samples follow any normal distribution.
     * Homogeneity assumption (the two samples have similar variances).
     
**Choose the appropriate test**

* Our sample are independent and small in size. Additionally, the assumptions of normal distribution and homogeneity are a priori fulfilled per the instructions of the Lab. Thus, it seems that an independent one-sided t test is the best choice here. 

    * We select `equal_var=True` due to our assumption that there are equal population variances
    * We choose `alternative="greater"` since we want to check whether the mean distribution of sample one (*control*) is larger then the mean distribution of the second sample (*treatment*).
    * We choose `p_value/2` since this is an one-sided t test.

In [5]:
machine_data.columns

Index(['New machine', '  Old machine'], dtype='object')

In [10]:
# creating control and treatment arrays for the t test
control = np.array(machine_data['  Old machine'])
treatment = np.array(machine_data['New machine'])

control, treatment #sanity check

(array([42.7, 43.6, 43.8, 43.3, 42.5, 43.5, 43.1, 41.7, 44. , 44.1]),
 array([42.1, 41. , 41.3, 41.8, 42.4, 42.8, 43.2, 42.3, 41.8, 42.7]))

In [34]:
# setting up the t test
ttest, p_value = stats.ttest_ind(control, treatment, equal_var=True, alternative="greater")
print("pvalue: ", round(p_value, 5))

print("Since our hypothesis is one sided >> pvalue one sided", p_value/2)
if p_value/2 < 0.05:
    print("Reject H0: mean sample of the old machine is larger than the mean sample of the new machine")
else:
    print("No evidence to reject the null hypothesis")

pvalue:  0.00161
Since our hypothesis is one sided >> pvalue one sided 0.0008027856251936289
Reject H0: mean sample of the old machine is larger than the mean sample of the new machine


## Conclusions

Our null hypothesis was that the two machines were going to perform the same. Alternatively, we hypothesized that our new machine will perform faster than the older one. As a result we collected packing times from both machines. The null hypothesis supported, thus, the mean distribution of the two machines is the same. Our alternative hypothesis is that our older machine wil present a greatermean distribution due to its slower packing times.

After running our tests we can conclude that we **do have** sufficient statistical evidence to reject the null hypothesis aka claim that the mean between the two data groups is different (*p* = 0.00161 for *a* = 0.05). Additionally, our t test confirms that the mean distribution for our control group (Old Machine) is larger than the mean distribution of the treatment group (New Machine). 

In the context of this experiment we would conclude that this a desirable result since it seems that the new machine we bought to pack jars is actually faster than our older machine. Therefore, it was a good idea to invest buying a less time-consuming machine for our factor.

# Second Problem

2. An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file `files_for_lab/student_gpa.txt`. At the 5% significance level, does the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

* Test statistics can be calculated as: link to the image - Test statistics calculation for Unpooled Variance Case

* Degrees of freedom is `(n1-1)+(n2-1)`.

In [17]:
# importing the data
student_data = pd.read_csv("student_gpa.txt", sep='\t')

student_data

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


**Setting up our experiment & hypotheses**

* H0: The mean GPAs of sophomores and juniors at the university do not differ (μ_Sophomores = μ_Juniors)
* H1: The mean GPAs of sophomores and juniors at the university do differ. In this case we won't assume directionality (μ_Sophomores ≠ μ_Juniors).
* We define `groupS` = Sophomores and `groupJ` = Juniors.
* We define significance level `a = 0.05`.

**Assumptions**

* We assume that here is sufficient evidence to conduct the t-test as the instructions of the lab state. However we need to keep an eye on the fact that we can't pool variances. This means the following:
<br>
    
     * The two samples data groups are independent.
     * The data in both samples follow any normal distribution.
     
**Choose the appropriate test**

* Just like the first problem we wil go for a t test. Our samples are independent too but we cannot assume homogeneity that easily now. As a result we will perform a Welch’s t-test since it does not assume equal population variance. This time will also go for a two-sided t test.

    * We select `equal_var=False` for a Welch's t test.
    * We choose `alternative="two-sided"`.
    * We choose `p_value` as our main metric since this is an two-sided t test.

In [18]:
student_data.columns

Index(['Sophomores', '  Juniors'], dtype='object')

In [23]:
# creating control and treatment arrays for the t test
groupS = np.array(student_data['Sophomores'])
groupJ = np.array(student_data['  Juniors'])

groupJ = groupJ[~np.isnan(groupJ)] #removing nans from array so the test can run

groupS, groupJ #sanity check

(array([3.04, 1.71, 3.3 , 2.88, 2.11, 2.6 , 2.92, 3.6 , 2.28, 2.82, 3.03,
        3.13, 2.86, 3.49, 3.11, 2.13, 3.27]),
 array([2.56, 2.77, 2.7 , 3.  , 2.98, 3.47, 3.26, 3.2 , 3.19, 2.65, 3.  ,
        3.39, 2.58]))

In [33]:
# setting up the t test
ttest, p_value = stats.ttest_ind(groupS, groupJ, equal_var=False, alternative="two-sided")
print("pvalue: ", round(p_value, 5))

print("Since our hypothesis is two-sided, pvalue two-sided", p_value)
if p_value < 0.05:
    print("Reject H0: mean sample of Somophores' GPA is different than the mean sample of Juniors' GPA")
else:
    print("No evidence to reject the null hypothesis")

pvalue:  0.36422
Since our hypothesis is two-sided, pvalue two-sided 0.3642180675348571
No evidence to reject the null hypothesis


## Conclusions

Our null hypothesis was that the GPA of the two groups in question (Sophomore and Junior college students) is going to be statistically not different. Alternatively, we hypothesized that it might present a statistically important difference. As a result we collected GPA scores from both groups of students. The null hypothesis supported, thus, the mean distribution of the two groups is the same. Our alternative hypothesis is that the mean distribution of the two groups is the not the same.

After running our tests we can conclude that we **do not** have sufficient statistical evidence to reject the null hypothesis aka that the mean between the two data groups is different (*p* = 0.36422 for *a* = 0.05). 

In the context of this experiment we would conclude that this an interesting result since we may have expected that grades would, in general terms, get better after the inital first year. Further investigation would potentially be useful in order to explore how different the GPA scores of junior and somophore students are (which one is great?) and why (talk to teacher and collect more data from students).