In [1]:
import pandas as pd 
from scipy import stats as st


In [2]:
!ls files_for_lab

anova_lab_data.xlsx machine.txt         student_gpa.txt
lab_data.png        pokemon.csv


## One-tailed t-test
In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on average than the machine currently used. To test that hypothesis, the times each machine takes to pack ten cartons are recorded. The results are in seconds in the tables in the file `files_for_lab/machine.txt`.
   Assume that there is sufficient evidence to conduct the t-test, does the data provide sufficient evidence to show if one machine is better than the other?

In [3]:
machine_df = pd.read_csv("files_for_lab/machine.txt", sep='\t', encoding='UTF16')
machine_df.head()

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5


Null Hypothesis = New machine packs faster than old machine. 

H0: $\mu^{new} < \mu^{old}$

H1: $\mu^{new} > \mu^{old}$

Reject area is on the right

Reject null hypothesis if p-value is more than 1-$\alpha$

In [4]:
# cleaning column names
machine_df.columns
machine_df = machine_df.rename(columns={'    Old machine':'old_machine', 'New machine':'new_machine'})

In [5]:
sig_level = 0.05
print("Sample mean (new machine) is: ")
print(machine_df['new_machine'].mean())
print("new machine standard deviation is {}".format(machine_df['new_machine'].std()))
print("Population mean (old machine) is: ")
print(machine_df['old_machine'].mean())
print("old machine standard deviation is {}".format(machine_df['old_machine'].std()))


t, pval = st.ttest_1samp(machine_df['new_machine'], popmean=machine_df['old_machine'].mean(),
                         alternative='greater')
print("statistic is {}".format(t))
print("pvalue is {}".format(pval))


Sample mean (new machine) is: 
42.14
new machine standard deviation is 0.6834552736727638
Population mean (old machine) is: 
43.230000000000004
old machine standard deviation is 0.7498888806572157
statistic is -5.04331853503833
pvalue is 0.999651681196162


pvalue is 0.999, which is more than $\alpha$ (0.05). This means the null hypothesis cannot be rejected and there is sufficient evidence tha the the new machine is on average faster than the old machine

In [6]:
poke_df = pd.read_csv("files_for_lab/pokemon.csv")
poke_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


---

H0: Defense and attack scores are equal

H1: Defense and attack scores are not equal 

two-tailed test

In [7]:
t, pval = st.ttest_rel(poke_df['Attack'], poke_df['Defense'], alternative='two-sided')
print("t_statistic is {:.2f}".format(t))
print("pvalue is {:.4f}".format(pval))

t_statistic is 4.33
pvalue is 0.0000


Since pval > 0.025, it therefore falls in the rejection area, and the null hypothesis can be rejected.

Therefore, there is a significant different between Pokemon Attack and Defense scores for each pokemon. 

--- 
# Inferential Statistics - ANOVA

- Null hypothesis
- Alternate hypothesis
- Level of significance
- Test statistic
- P-value
- F table

In [11]:
import openpyxl
data = pd.read_excel("files_for_lab/anova_lab_data.xlsx")
data.head()

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71


In [19]:
data.columns
data = data.rename(columns={'Power ':'power', 'Etching Rate':'etching_rate'})
data.head()

Unnamed: 0,power,etching_rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71


In [33]:
powers = list(data.power.value_counts().index)

sample1 = data[data['power'] == powers[0]]['etching_rate']
sample2 = data[data['power'] == powers[1]]['etching_rate']
sample3 = data[data['power'] == powers[2]]['etching_rate']

print("{} has a mean of {:.2f}".format(powers[0], sample1.mean()))
print("{} has a mean of {:.2f}".format(powers[1], sample2.mean()))
print("{} has a mean of {:.2f}".format(powers[2], sample3.mean()))

160 W has a mean of 5.79
180 W has a mean of 6.24
200 W has a mean of 8.32


Null hypothesis (H0): Changing the power of the plasma beam has no effect on the etching rate by the machine --> The samples for differing powers have the same mean etching rates

Alternate hypothesis (H1): Changing the power of the plasma beam has an effect on the etching rate by the machine.  --> The samples for differing powers have different mean etching rates

Significance level is 0.05

In [41]:
signif_level = 0.05
F_statistic, pvalue = st.f_oneway(sample1, sample2, sample3)
print("F statistic is {:.2f}".format(F_statistic))
print("P value is {}".format(pvalue))

F statistic is 36.88
P value is 7.506584272358903e-06


In [38]:
# degrees of freedom. d1 = k - 1. d2 = n - k
d1 = data['power'].nunique() - 1
d2 = len(data) - data['power'].nunique()
d1, d2

(2, 12)

In [42]:
1 - st.f.cdf(F_statistic, dfn=d1, dfd=d2)
# 1 - cdf(F_statistic) is the same as the pvalue. 

7.5065842723986975e-06

In [44]:
if pvalue < 0.025 or pvalue > 0.0975:
    print("There is statistically significant difference between the samples.")
else:
    print("There is no sufficient evidence to suggest there is statistically siginificant difference between the samples")

There is statistically significant difference between the samples.
