# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np


In [16]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
48,Oddish,Grass,Poison,45,50,55,75,65,30,1,False
...,...,...,...,...,...,...,...,...,...,...,...
783,Pumpkaboo Super Size,Ghost,Grass,59,66,70,44,55,41,6,False
784,Gourgeist Average Size,Ghost,Grass,65,90,122,58,75,84,6,False
785,Gourgeist Small Size,Ghost,Grass,55,85,122,58,75,99,6,False
786,Gourgeist Large Size,Ghost,Grass,75,95,122,58,75,69,6,False


In [49]:
# Create two separate dataframes (based on "Dragon" and "Grass")

df_dragon = df[(df["Type 1"] == 'Dragon') | (df["Type 2"] == 'Dragon')]

df_grass = df[(df["Type 1"] == 'Grass') | (df["Type 2"] == 'Grass')]

df_dragon_hp_mean = df_dragon['HP'].mean()
print(f"Average Hit Points of Dragon Pokémon : {round(df_dragon_hp_mean,1)}")

df_grass_hp_mean = df_grass['HP'].mean()
print(f"Average Hit Points of Grass Pokémon  : {round(df_grass_hp_mean,1)}")


Average Hit Points of Dragon Pokémon : 82.9
Average Hit Points of Grass Pokémon  : 66.1


In [22]:
# Two samples from one joint population - 
# Compare HP mean -> Two Sample T-test (two-sided)

# H0: df_dragon_hp_mean = df_grass_hp_mean
# H1: df_dragon_hp_mean != df_grass_hp_mean

alpha = 0.05

In [47]:
# Run t-test

result_hp = st.ttest_ind(df_dragon['HP'], df_grass['HP'], equal_var=False, alternative="two-sided")

print(f"T-statistic: {result_hp.statistic}, P-value: {result_hp.pvalue}")

if result_hp.pvalue < alpha:
    print("Reject H0: There is significant evidence that Dragon Pokémon and Grass Pokémon have different average HP.")
else:
    print("Fail to reject H0: Not enough evidence to conclude that Dragon Pokémon and Grass Pokémon have different average HP.")

T-statistic: 4.097528915272702, P-value: 0.00010181538122353851
Reject H0: There is significant evidence that Dragon Pokémon and Grass Pokémon have different average HP.


#### Interpretation

- A value of approximately 4.10 suggests a significant difference between the Dragon and Grass Pokemons.
- The very small p-value indicates that the difference in mean HP between Dragon and Grass Pokémon is significant.
- Therefore, we reject the Null-Hypothesis.
- This conclusion is statistically significant.

In [51]:
# Two samples from one joint population - 
# Compare HP mean -> Two Sample T-test
# Right-sided / right-tailed T-test)

# H0: df_dragon_hp_mean <= df_grass_hp_mean
# H1: df_dragon_hp_mean > df_grass_hp_mean

alpha = 0.05

In [55]:
# Run t-test

result = st.ttest_ind(df_dragon['HP'], df_grass['HP'], equal_var=False, alternative='greater')

print(f"T-statistic: {result.statistic}, P-value: {result.pvalue}")

# Deciding based on p-value
if result.pvalue < alpha:
    print("Reject H0: There is significant evidence that Dragon Pokémon have greater HP on average than Grass Pokémon.")
else:
    print("Fail to reject H0: Not enough evidence to conclude Dragon Pokémon have greater HP on average than Grass Pokémon.")

T-statistic: 4.097528915272702, P-value: 5.0907690611769255e-05
Reject H0: There is significant evidence that Dragon Pokémon have greater HP on average than Grass Pokémon.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.

In [35]:
df_legendary = df[df['Legendary'] == True]
df_non_legendary = df[df['Legendary'] == False]

df_legendary.shape
df_non_legendary.shape

(735, 11)

In [None]:
# Two samples from one joint population - 
# Two Sample & Two-sided T-test)

# H0: df_legendary_stats_mean = df_nonlegendary_stats_mean
# H1: df_legendary_stats_mean != df_nonlegendary_stats_mean

alpha = 0.05

In [87]:
# Run t-test

results_stats = st.ttest_ind(df_legendary[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']], 
             df_non_legendary[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']], 
             equal_var=False, alternative="two-sided")
print(results_stats)

if results_stats.pvalue.all() < alpha:
    print("Reject H0: There is significant evidence that on average, Legendary Pokemons have different stats when comparing with Non-Legendary")
else:
    print("Fail to reject H0: Not enough evidence to conclude that Legendary Pokemons have different stats when comparing with Non-Legendary.")

TtestResult(statistic=array([ 8.98137048, 10.43813354,  7.63707816, 13.41744998, 10.01569661,
       11.47504445]), pvalue=array([1.00269117e-13, 2.52037245e-16, 4.82699849e-11, 1.55146141e-21,
       2.29493279e-15, 1.04901631e-18]), df=array([79.52467831, 75.88324448, 77.71095111, 74.24631138, 73.25892564,
       81.62110996]))
Fail to reject H0: Not enough evidence to conclude that Legendary Pokemons have different stats when comparing with Non-Legendary.


#### Interpretation

- The t-statistics are all positive, suggesting that in each case, the sample mean for Legendary Pokémon is higher than the sample mean for Non-Legendary Pokémon.
- The magnitude of the  t-statistic values suggests significant differences between the groups for each stat. A higher absolute value of the t-statistic indicates a greater separation between the group means.
- Since all p-values are significantly less than 0.05, there is sufficient evidence to reject the null hypothesis for each statistical test
- All p-values are extremely small, suggesting that the means of these stats are significantly different between Legendary and Non-Legendary Pokémon.

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [18]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [93]:
school_coordinates = np.array((-118, 37))
hospital_coordinates = np.array((-122, 34))

def euclid_distance(df, school_coordinates, hospital_coordinates):
    df['dist_school'] = np.linalg.norm(df[['longitude', 'latitude']].values - school_coordinates, axis=1)
    df['dist_hospital'] = np.linalg.norm(df[['longitude', 'latitude']].values - hospital_coordinates, axis=1)

    # Default distance to 'far'
    df['distance'] = 'far'
    
    # Update distance to 'close' based on proximity to either school or hospital
    df.loc[df['dist_school'] < 0.5, 'distance'] = 'close'
    df.loc[df['dist_hospital'] < 0.5, 'distance'] = 'close'
    
    return df

In [91]:
df_distance = euclid_distance(df, school_coordinates, hospital_coordinates)

df_distance_far = df_distance[df_distance['distance'] == 'far']
df_distance_close = df_distance[df_distance['distance'] == 'close']

df_distance_far.shape # (16995, 12)
df_distance_close.shape # (5, 12)

(433, 12)

# H0: median_house_value_close <= median_house_value_far
# H1: median_house_value_close > median_house_value_far

alpha = 0.05

In [105]:
result_distance = st.ttest_ind(df_distance_close['median_house_value'], df_distance_far['median_house_value'], equal_var=False, alternative="two-sided")

print(f"T-statistic: {result_distance.statistic}, P-value: {result_distance.pvalue}")

if result_distance.pvalue < alpha:
    print("Reject H0: There is significant evidence that house prices close to schools and hospitals are higher than house prices in other areas.")
else:
    print("Fail to reject H0: Not enough evidence to conclude that house prices close to schools and hospitals are less or equal to house prices in other areas.")

T-statistic: 67.83481332243952, P-value: 0.0
Reject H0: There is significant evidence that house prices close to schools and hospitals are higher than house prices in other areas.


- The positive t-statistics value of 67.83 supports the hypothesis that house prices closer to schools are higher than house prices 
in further awayother areas.  
- The magnitude of the  t-statistic value suggests a large difference.
- The p-value is very small, which strongly indicates that there is a statistically significant difference in 
average house prices between these groups.