

# Lab | Inferential statistics - T-test & P-value

## Part 1

1. *One tailed t-test* - In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file `files_for_lab/ttest_machine.xlsx.`.
   Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

2. *Matched Pairs Test* - In this challenge we will compare dependent samples of data describing our Pokemon (file `files_for_lab/pokemon.csv`). Our goal is to see whether there is a significant difference between each Pokemon's different stats. Our hypothesis is that the Stats between legendary/non-legendary, generation 1/generation 2, single type/dual type are all the same.


# Inferential statistics - ANOVA

Note: The following lab is divided in 2 sections.

## Part 2.1

In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on:
    - Null hypothesis
    - Alternate hypothesis
    - Level of significance
    - Test statistic
    - P-value
    - F table
    


### Context

In this challenge,we will return to the Pokemon dataset.   We want to understand whether there are significant differences among various types of pokemons' Total value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA. (file `files_for_lab/pokemon.csv`)
First let's obtain the unique values of the pokemon types.
Second we will create a list named pokemon_totals to contain the Total values of each unique type of pokemons.
Third we run ANOVA test on pokemon_totals.


- State the null hypothesis
- State the alternate hypothesis
- What is the significance level
- What are the degrees of freedom of model, error terms, and total DoF



## Part 2.2


- What conclusions can you draw from the experiment and why?
- Interpret the ANOVA test result. Is the difference significant?


PART 1

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
ttest_machine= pd.read_csv('files_for_lab/ttest_machine.txt')

In [11]:
ttest_machine

Unnamed: 0,New_machine,Old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [16]:
ttest_machine['New_machine'].mean()
# display(c3_sample.mean())

42.14

In [19]:
ttest_machine[' Old_machine'].mean()

43.230000000000004

In [32]:
from scipy.stats import ttest_ind

# Null Hypothesis (H0): The mean packing time for the new machine is less than or equal to the mean packing time for the old machine.

# Alternative Hypothesis (H1): The mean packing time for the new machine is greater than the mean packing time for the old machine.
            
# significance level is 5%

# Assuming ttest_machine is a DataFrame with 'Old_machine' and 'New_machine' columns
old_machine_times = ttest_machine[' Old_machine']
new_machine_times = ttest_machine['New_machine']

# Perform a one-tailed t-test for independent samples
# Set alternative='greater' since we are testing if the new machine packs faster
tstat, pvalue = ttest_ind(new_machine_times, old_machine_times, alternative='greater')

# Print the results
print("T-statistic:", tstat)
print("P-value:", pvalue)

T-statistic: -3.3972307061176026
P-value: 0.9983944287496127


A one tail test only tests whether the mean packing time for old and new machines are the same or not. Therefore from this information we are unable to determine which is best.

In [34]:
# MATCHED PAIRS TEST
pokemon= pd.read_csv('files_for_lab/pokemon.txt')


In [35]:
pokemon

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


 Our hypothesis is that the Stats between legendary/non-legendary, generation 1/generation 2, single type/dual type are all the same.

In [44]:
# Legendary/non legendary

legendary_pokemon = pokemon[pokemon['Legendary'] == True]['Total']
nonlegendary_pokemon = pokemon[pokemon['Legendary'] == False]['Total']

# Match the lengths of the arrays
min_length = min(len(legendary_pokemon), len(nonlegendary_pokemon))
legendary_pokemon = legendary_pokemon[:min_length]
nonlegendary_pokemon = nonlegendary_pokemon[:min_length]


# Perform a paired t-test
t_statistic, p_value = st.ttest_rel(legendary_pokemon, nonlegendary_pokemon)

# Print the results
print("T-statistic:", t_statistic)
print("P-value:", p_value)

T-statistic: 14.921985191746089
P-value: 1.6288518887793974e-22


In [45]:
# Generation

gen1_pokemon = pokemon[pokemon['Generation'] == 1]['Total']
gen2_pokemon = pokemon[pokemon['Generation'] == 2]['Total']

min_length = min(len(gen1_pokemon), len(gen2_pokemon))
gen1_pokemon = gen1_pokemon[:min_length]
gen2_pokemon = gen2_pokemon[:min_length]

# Perform a paired t-test
t_statistic, p_value = st.ttest_rel(gen1_pokemon, gen2_pokemon)

# Print the results
print("T-statistic:", t_statistic)
print("P-value:", p_value)

T-statistic: -1.1044975280214377
P-value: 0.2719019944282383


Inferential statistics - ANOVA

In [99]:
unique_types =pokemon['Type 1'].unique()

In [100]:
unique_types

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [101]:
# Create a dictionary to store the sum of Total values for each type
type_totals = {}

# Iterate over unique types
for pokemon_type in unique_types:
    # Filter the DataFrame for the current type in either 'Type 1' or 'Type 2'
    type_mask = (pokemon['Type 1'] == pokemon_type) | (pokemon['Type 2'] == pokemon_type)
    
    # Sum the 'Total' values for the current type
    total_sum = pokemon.loc[type_mask, 'Total'].sum()
    
    # Store the result in the dictionary
    type_totals[pokemon_type] = total_sum

In [102]:
type_totals


{'Grass': 39703,
 'Fire': 29895,
 'Water': 54066,
 'Bug': 27326,
 'Normal': 41011,
 'Poison': 24657,
 'Electric': 22242,
 'Ground': 29552,
 'Fairy': 16637,
 'Fighting': 24916,
 'Psychic': 42938,
 'Rock': 26050,
 'Ghost': 20096,
 'Ice': 17763,
 'Dragon': 27088,
 'Dark': 23506,
 'Steel': 23843,
 'Flying': 45837}

In [77]:
type_totals_df= type_totals_df.T

In [75]:
type_totals_df = pd.DataFrame(list(type_totals.items()), columns=['Type', 'Total_Sum']).set_index('Type')


In [92]:
pokemon['type'] = pokemon.groupby('Type 1').cumcount() ##is the new index 

type_totals_df = pokemon.pivot(index='type', columns='Type 1', values='Total')
type_totals_df.columns = ['Type'+str(x) for x in type_totals_df.columns.values]
type_totals_df

Unnamed: 0_level_0,TypeBug,TypeDark,TypeDragon,TypeElectric,TypeFairy,TypeFighting,TypeFire,TypeFlying,TypeGhost,TypeGrass,TypeGround,TypeIce,TypeNormal,TypePoison,TypePsychic,TypeRock,TypeSteel,TypeWater
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0,195.0,525.0,300.0,320.0,323.0,305.0,309.0,580.0,310.0,318.0,300.0,455.0,251.0,288.0,310.0,300.0,510.0,314.0
1,205.0,405.0,420.0,485.0,483.0,455.0,405.0,580.0,405.0,405.0,450.0,580.0,349.0,438.0,400.0,390.0,610.0,405.0
2,395.0,430.0,600.0,325.0,218.0,305.0,534.0,245.0,500.0,525.0,265.0,250.0,479.0,275.0,500.0,495.0,465.0,530.0
3,195.0,330.0,490.0,465.0,245.0,405.0,634.0,535.0,600.0,625.0,405.0,450.0,579.0,365.0,590.0,385.0,380.0,630.0
4,205.0,500.0,590.0,330.0,405.0,505.0,634.0,,435.0,320.0,320.0,330.0,253.0,505.0,328.0,355.0,480.0,320.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,,,,,,,,,,,,,,,,,,314.0
108,,,,,,,,,,,,,,,,,,405.0
109,,,,,,,,,,,,,,,,,,530.0
110,,,,,,,,,,,,,,,,,,330.0


In [93]:
type_totals_df = type_totals_df.fillna("0")

In [96]:
type_totals_df

Unnamed: 0_level_0,TypeBug,TypeDark,TypeDragon,TypeElectric,TypeFairy,TypeFighting,TypeFire,TypeFlying,TypeGhost,TypeGrass,TypeGround,TypeIce,TypeNormal,TypePoison,TypePsychic,TypeRock,TypeSteel,TypeWater
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0,195.0,525.0,300.0,320.0,323.0,305.0,309.0,580.0,310.0,318.0,300.0,455.0,251.0,288.0,310.0,300.0,510.0,314.0
1,205.0,405.0,420.0,485.0,483.0,455.0,405.0,580.0,405.0,405.0,450.0,580.0,349.0,438.0,400.0,390.0,610.0,405.0
2,395.0,430.0,600.0,325.0,218.0,305.0,534.0,245.0,500.0,525.0,265.0,250.0,479.0,275.0,500.0,495.0,465.0,530.0
3,195.0,330.0,490.0,465.0,245.0,405.0,634.0,535.0,600.0,625.0,405.0,450.0,579.0,365.0,590.0,385.0,380.0,630.0
4,205.0,500.0,590.0,330.0,405.0,505.0,634.0,0,435.0,320.0,320.0,330.0,253.0,505.0,328.0,355.0,480.0,320.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,314.0
108,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,405.0
109,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,530.0
110,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,330.0


In [112]:
for col in type_totals_df:
    print('type_totals_df.'+ col)

type_totals_df.TypeBug
type_totals_df.TypeDark
type_totals_df.TypeDragon
type_totals_df.TypeElectric
type_totals_df.TypeFairy
type_totals_df.TypeFighting
type_totals_df.TypeFire
type_totals_df.TypeFlying
type_totals_df.TypeGhost
type_totals_df.TypeGrass
type_totals_df.TypeGround
type_totals_df.TypeIce
type_totals_df.TypeNormal
type_totals_df.TypePoison
type_totals_df.TypePsychic
type_totals_df.TypeRock
type_totals_df.TypeSteel
type_totals_df.TypeWater


In [121]:
# H0: There is no significant difference between the various types of pokemons total value.
# H1: There is a significant difference between the various types of pokemon's total values

# the significance level is 0.05

t_statistic, p_value= st.f_oneway(type_totals_df.TypeBug,
type_totals_df.TypeDark,
type_totals_df.TypeDragon,
type_totals_df.TypeElectric,
type_totals_df.TypeFairy,
type_totals_df.TypeFighting,
type_totals_df.TypeFire,
type_totals_df.TypeFlying,
type_totals_df.TypeGhost,
type_totals_df.TypeGrass,
type_totals_df.TypeGround,
type_totals_df.TypeIce,
type_totals_df.TypeNormal,
type_totals_df.TypePoison,
type_totals_df.TypePsychic,
type_totals_df.TypeRock,
type_totals_df.TypeSteel,
type_totals_df.TypeWater)

# Check if the null hypothesis is rejected
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the various types of Pokemon's total values.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the various types of Pokemon's total values.")


Reject the null hypothesis: There is a significant difference between the various types of Pokemon's total values.


In [119]:
t_statistic

28.674566584818066

In [120]:
p_value

2.254687684597403e-82

In [122]:
# Assuming N is the number of observations for each type (replace it with your actual value)
N = len(type_totals_df.iloc[0])  

# Number of groups (types of Pokemon)
k = len(type_totals_df.columns)

# Calculate degrees of freedom
DoF_total = N - 1
DoF_between = k - 1
DoF_within = DoF_total - DoF_between

print("Degrees of Freedom (Total):", DoF_total)
print("Degrees of Freedom (Between):", DoF_between)
print("Degrees of Freedom (Within):", DoF_within)
print("T-statistic:", t_statistic)
print("P-value:", p_value)

Degrees of Freedom (Total): 17
Degrees of Freedom (Between): 17
Degrees of Freedom (Within): 0
T-statistic: 28.674566584818066
P-value: 2.254687684597403e-82
