# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [33]:
#Two sample t-test
#H0: HP_Grass >= HP_Drag
#H1: HP_Grass < HP_Drag

#Filter out pokemon that satisfy both conditions
Grass_Drag_Pokemon = df.loc[
    ((df['Type 1'] == 'Grass') & (df['Type 2'] == 'Dragon')) |
    ((df['Type 1'] == 'Dragon') & (df['Type 2'] == 'Grass'))
]

df_minus_grass_drag = df[df['Name'] != "Mega Sceptile"] #Remove pokemon that are both dragon and grass

#Create two groups
HP_Grass = df_minus_grass_drag.loc[(df_minus_grass_drag['Type 1'] == 'Grass') | (df_minus_grass_drag['Type 2'] == 'Grass'), 'HP']
Drag_Grass = df_minus_grass_drag.loc[(df_minus_grass_drag['Type 1'] == 'Dragon')| (df_minus_grass_drag['Type 2'] == 'Dragon'), 'HP']

#Manual method 
meanG, meanD = np.mean(HP_Grass), np.mean(Drag_Grass)
stdG, stdD = np.std(HP_Grass, ddof=1), np.std(Drag_Grass, ddof=1)  
n1, n2 = len(HP_Grass), len(Drag_Grass)

t_stat_manual = (meanG - meanD) / np.sqrt((stdG**2 / n1) + (stdD**2 / n2))

print(f"Manual T-statistic from manual: {t_stat_manual:.4f}")

#formula method
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(HP_Grass, Drag_Grass, equal_var=False)  

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

print("We reject the null hypothesis and see that the HP of dragon pokemon is significantly higher than Grass")

Manual T-statistic from manual: -4.1049
T-statistic: -4.1049
P-value: 0.0001
We reject the null hypothesis and see that the HP of dragon pokemon is significantly higher than Grass


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [43]:
#H0: Stats_Non_legend = Stats_Legend
#H1: Stats_Non_legend != Stats_Legend

#Two sample t-test

from scipy.stats import ttest_ind

# Select only numeric columns (excluding categorical columns like 'Type 1', 'Type 2', etc.)
numerical_columns = df.select_dtypes(include=['number']).columns

numerical_columns = numerical_columns.drop('Generation')

numerical_columns

Index(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed'], dtype='object')

In [47]:
# Separate Legendary and Non-Legendary Pokémon
Non_legend = df[df['Legendary'] == False]
Legend = df[df['Legendary'] == True]

# Perform t-tests for all numerical columns
results = {}
for col in numerical_columns:
    t_stat, p_value = ttest_ind(Non_legend[col], Legend[col], equal_var=False) 
    results[col] = {'T-statistic': t_stat, 'P-value': p_value}

# Convert results to a DataFrame 
results_df = pd.DataFrame(results).T 

display(results_df)
print("We see that for all stats columns, there is a significant difference in values between legendary and non-legendary")

Unnamed: 0,T-statistic,P-value
HP,-8.98137,1.002691e-13
Attack,-10.438134,2.520372e-16
Defense,-7.637078,4.826998e-11
Sp. Atk,-13.41745,1.551461e-21
Sp. Def,-10.015697,2.294933e-15
Speed,-11.475044,1.049016e-18


We see that for all stats columns, there is a significant difference in values between legendary and non-legendary


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [49]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [57]:
school_loc = {'longitude': -122, 'latitude': 37}
hospital_loc = {'longitude': -118, 'latitude': 34}

df['dist_to_hospital'] = np.sqrt(
    (df['longitude'] - hospital_loc['longitude'])**2 +
    (df['latitude'] - hospital_loc['latitude'])**2
)

df['dist_to_school'] = np.sqrt(
    (df['longitude'] - school_loc['longitude'])**2 +
    (df['latitude'] - school_loc['latitude'])**2
)

df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,dist_to_hospital,dist_to_school
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,3.694888,8.187319
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,3.552591,7.966235
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,3.453940,8.143077
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,3.448840,8.154416
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,3.456848,8.183508
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,9.082070,4.233675
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,9.168915,4.332320
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,10.057614,5.358694
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,10.026465,5.322593


In [59]:
df_close = df.loc[
    (df['dist_to_school'] <= 0.5 ) |
    (df['dist_to_hospital'] <= 0.5)
]

df_far = df.loc[
    (df['dist_to_school'] > 0.5 ) |
    (df['dist_to_hospital'] > 0.5)
]

In [67]:
#using a two sample t-test, larger
#H0: mean_house_value_close <= mean_house_value_far
#H1: mean_house_value_close > mean_house_value_far

t_stat, p_value = ttest_ind(df_close['median_house_value'], df_far['median_house_value'], equal_var=False, alternative='greater')  

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

print("We reject the null hypothesis, the mean value of houses close to schools or hospitals is higher")

T-statistic: 24.4756
P-value: 0.0000
We reject the null hypothesis, the mean value of houses close to schools or hospitals is higher
