# lab-t-tests-p-values

#### Instructions

#### 1. We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

#### In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Reading the file into python as dataframe
machine_df = pd.read_csv("machine.txt", encoding='utf=16', sep= '\t')
machine_df

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [3]:
# Standardizing header-column names in the dataframe by using only lowercase letters and by replacing the spaces with underscores
machine_df.columns = [machine_df.columns[i].lower().replace('    ','').replace(' ', '_') for i in range(len(machine_df.columns))]
list(machine_df.columns)

['new_machine', 'old_machine']

In [4]:
# the observations per machine in 2 separate variables:
new_machine = machine_df['new_machine']
old_machine = machine_df['old_machine']

In [5]:
# Let's describe the data, we will keep the means and the standard deviations of the machines.
machine_df.describe()
# The New machine's mean is 42.14 and its standard deviation is 0.68
# The Old machine's mean is 43.23 and its standard deviation is 0.75, approximately

Unnamed: 0,new_machine,old_machine
count,10.0,10.0
mean,42.14,43.23
std,0.683455,0.749889
min,41.0,41.7
25%,41.8,42.8
50%,42.2,43.4
75%,42.625,43.75
max,43.2,44.1


#### Step 1: Define the null hypothesis (the opposite from what we check)
#### The goal in classic inferential statistics is to prove the null hypothesis wrong
#### null hypothesis -> H0: μ >= μ0 , means that the new machine does not pack faster on the average than the machine currently used
#### or better μ (New machine's mean: 42.14 cartons) >= μ0 (Old machine's mean: 43.23 cartons)
#### we could also check as null hypothesis -> H0: μ - μ0 >= 0 

In [6]:
# n , mean and standard deviation of the first sample (new_machine)
print('n:', len(new_machine),', mean:', np.mean(new_machine), ', standard deviation: ',np.std(new_machine))


n: 10 , mean: 42.14 , standard deviation:  0.6483826030978941


In [7]:
# n , mean and standard deviation of the second sample (old_machine)
print('n:', len(old_machine), ', mean:', np.mean(old_machine), ', standard deviation: ',np.std(old_machine))

n: 10 , mean: 43.230000000000004 , standard deviation:  0.7114070564732956


In [8]:
print("New machine's mean is {:.2f}".format(np.mean(new_machine)))
print("Old machine's mean is {:.2f}".format(np.mean(old_machine)))
print("New machine's standard deviation is {:.2f}".format(np.std(new_machine)))
print("Old machine's standard deviation is {:.2f}".format(np.std(old_machine)))

New machine's mean is 42.14
Old machine's mean is 43.23
New machine's standard deviation is 0.65
Old machine's standard deviation is 0.71


#### Step 2: Define the alternative hypothesis
#### This means, what if our assumption is not true. 
#### alternative hypothesis -> H1 (or Ha): μ < μ0, means that the new machine  packs faster on the average than the machine currently used
#### or better μ (New machine's mean: 42.14 cartons) <  μ0 (Old machine's mean: 43.23 cartons)
#### we could also check as alternative hypothesis -> H0: μ - μ0 < 0 

#### Step 3: Determine if it is an one-tailed or a two-tailed test, it is an one-tailed test.
#### In this case we are checking if the new machine packs faster on the average than the machine currently used
#### or we are checking if New machine's mean: 42.14 cartons is significantly smaller than (<) the Old machine's mean: 43.23 cartons 
#### according to the sample of the 10 measurements
#### from the first view, we can notice that New machine's mean: 42.14 cartons < μ0 (Old machine's mean: 43.23 cartons)
#### so the new machine is faster than the old one
#### but is it significant faster for specific α-significant level?
#### Let's check it

#### Step 4: Decide a test statistics based on the information available. Assuming data is normally distributed and number of observations are less then 30 and variance is known (we can compute it), we will use a t-test. This test is based on a "t-distribution" which is a normal distribution. If the population variance was not known or the testing sample is less then 30, we use a t-test. T test is based on students t distribution which is very similar to a standard normal distribution except that it is much flatter.

#### Step 5: Level of significance: This defines the rejection region/critical region,
#### it's the probability of making the wrong decision when the null hypothesis is true.
#### we will use α=0,05

#### Step 6: Calculate the test statistic based on the given information
#### we will use t-test

In [9]:
# the Pooled Standard Deviation is:
SDpooled_numerator = ( len(new_machine) - 1 ) * ( np.std(new_machine)**2 ) +  ( len(old_machine) - 1 ) * ( np.std(old_machine)**2 )
SDpooled_no_root = SDpooled_numerator/( len(new_machine) + len(old_machine) - 2)
SDpooled = np.sqrt(SDpooled_no_root)

In [10]:
# the Statistical value t is:
t=(np.mean(new_machine)-np.mean(old_machine))/np.sqrt((SDpooled**2)/(len(new_machine)-1)+(SDpooled**2)/(len(old_machine)-1))

In [11]:
print("The t statistic is: {:.2f}".format(t))

The t statistic is: -3.40


In [12]:
# Percent point function
Zc = st.t.ppf(1-(0.05),df = len(new_machine)+len(old_machine)-2)
Zc

1.7340636066175354

In [13]:
# or easily, we use alternative='less' because it is an one-tailed test
from scipy.stats import ttest_ind, norm
ttest_ind(new_machine, old_machine, alternative='less')

Ttest_indResult(statistic=-3.3972307061176026, pvalue=0.0016055712503872579)

In [14]:
# we can notice that p-value is pvalue=0.0016055712503872579< a=0.05
# So, we reject the null hypothesis and we accept the alternative hypothesis that
# H1 (or Ha): μ < μ0 or that the new machine packs faster on the average than the machine currently used

###    An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

In [15]:
# # Reading the file into python as dataframe
# student_gpa_df = pd.read_csv("student_gpa.txt", sep= '\t')
# student_gpa_df

In [16]:
# # Standardizing header-column names in the dataframe by using only lowercase letters and by replacing the spaces with underscores
# student_gpa_df.columns = [student_gpa_df.columns[i].lower().replace('  ','').replace(' ', '_') for i in range(len(student_gpa_df.columns))]
# list(student_gpa_df.columns)

In [17]:
# # the observations per sample in 2 separate variables:
# sophomores = student_gpa_df['sophomores']
# juniors = student_gpa_df['juniors']

In [18]:
# # n , mean and standard deviation of the first sample (Sophomores)
# print('n:', len(student_gpa_df['sophomores']),', mean:', np.mean(student_gpa_df['sophomores']), ', standard deviation: ',np.std(student_gpa_df['sophomores']))

In [19]:
# # n , mean and standard deviation of the second sample (Juniors)
# print('n:', len(student_gpa_df['juniors']),', mean:', np.mean(student_gpa_df['juniors']), ', standard deviation: ',np.std(student_gpa_df['juniors']))
# # we have n=17 because of the NaN values but they do not affect the process

In [20]:
# null hypothesis -> H0: μ = μ0 , means that the mean GPAs of sophomores and juniors at the university are similar
# or better μ (sophomores mean: 2.84) = μ0 (juniors mean: 2.98)
# alternative hypothesis -> H1 (or Ha): μ ≠ μ0, means that the mean GPAs of sophomores and juniors at the university differ
# It is a two-tailed test and we will use t test
# as level of significance we will use α=0,05

In [21]:
# # or easily
# from scipy.stats import ttest_ind, norm
# ttest_ind(sophomores, juniors)