# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [7]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [8]:
'''
I posit that the mean HP of pokemon with Dragon as either their type one or type two will be higher than the mean of pokemon with Grass as one of their types.

Set the hypothesis
H0: mu grass_pkn >= mu dragon_pkm
H1: mu grass_pkn < mu dragon_pkm

'''
alpha = 0.05

#calculating averages to compare
dragon_pkm = df[(df["Type 1"]=="Dragon") | (df["Type 2"]=="Dragon")]["HP"]
grass_pkm = df[(df["Type 1"]=="Grass") | (df["Type 2"]=="Grass")]["HP"]
dragon_pkm_mean = dragon_pkm.mean()
grass_pkm_mean = grass_pkm.mean()

shapiro_dragon = st.shapiro(dragon_pkm)
shapiro_grass = st.shapiro(grass_pkm)

print("Shapiro-Wilk Test Results:")
print(f"Dragon Pokémon HP normality p-value: {shapiro_dragon.pvalue}")
print(f"Grass Pokémon HP normality p-value: {shapiro_grass.pvalue}")

# Perform the t-test
# Use independent t-test assuming unequal variance (Welch's t-test)
t_stat, p_value = st.ttest_ind(dragon_pkm, grass_pkm, alternative='greater', equal_var=False)

print("\nT-Test Results:")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Decision
if p_value < alpha:
    print("\nReject the null hypothesis (H0).")
    print("Conclusion: The mean HP of Dragon-type Pokémon is significantly higher than that of Grass-type Pokémon.")
else:
    print("\nFail to reject the null hypothesis (H0).")
    print("Conclusion: There is not enough evidence to support that the mean HP of Dragon-type Pokémon is higher than that of Grass-type Pokémon.")


Shapiro-Wilk Test Results:
Dragon Pokémon HP normality p-value: 0.1185675564327498
Grass Pokémon HP normality p-value: 0.04023071193635741

T-Test Results:
T-statistic: 4.097528915272702
P-value: 5.0907690611769255e-05

Reject the null hypothesis (H0).
Conclusion: The mean HP of Dragon-type Pokémon is significantly higher than that of Grass-type Pokémon.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [15]:
'''
H0​: The mean stat for Legendary Pokémon is equal to Non-Legendary Pokémon.
H1H1​: The mean stat for Legendary Pokémon is different from Non-Legendary Pokémon.

Significance Level: 0.05

'''
alpha = 0.05

import scipy.stats as stats

# Grouping stats by Legendary and Non-Legendary Pokémon
legendary_stats = df[df["Legendary"] == True][["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]]
non_legendary_stats = df[df["Legendary"] == False][["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]]

# Function to perform t-test
def compare_stats_ttest(legendary, non_legendary, alpha=0.05):
    results = {}
    for stat in legendary.columns:
        # Normality Test
        p_leg = stats.shapiro(legendary[stat]).pvalue
        p_non_leg = stats.shapiro(non_legendary[stat]).pvalue

        # Variance Test
        p_var = stats.levene(legendary[stat], non_legendary[stat]).pvalue

        # Perform t-test (Welch's if variances differ)
        t_stat, p_value = stats.ttest_ind(
            legendary[stat], non_legendary[stat], 
            equal_var=p_var > 0.05
        )
        
        # Store results
        results[stat] = {
            "normality_legendary_p": p_leg,
            "normality_non_legendary_p": p_non_leg,
            "variance_p": p_var,
            "t_statistic": t_stat,
            "p_value": p_value,
            "significant": p_value < alpha
        }
    return results

# Run the comparison
results = compare_stats_ttest(legendary_stats, non_legendary_stats)

# Print results
for stat, result in results.items():
    print(f"Stat: {stat}")
    print(f"  Normality p-value (Legendary): {result['normality_legendary_p']:.3f}")
    print(f"  Normality p-value (Non-Legendary): {result['normality_non_legendary_p']:.3f}")
    print(f"  Variance p-value (Levene's test): {result['variance_p']:.3f}")
    print(f"  T-Test Statistic: {result['t_statistic']:.3f}")
    print(f"  P-Value: {result['p_value']:.3f}")
    if result['significant']:
        print(f"  Conclusion: Significant difference at 5% level")
    else:
        print(f"  Conclusion: No significant difference at 5% level")
    print()


Stat: HP
  Normality p-value (Legendary): 0.009
  Normality p-value (Non-Legendary): 0.000
  Variance p-value (Levene's test): 0.576
  T-Test Statistic: 8.036
  P-Value: 0.000
  Conclusion: Significant difference at 5% level

Stat: Attack
  Normality p-value (Legendary): 0.038
  Normality p-value (Non-Legendary): 0.000
  Variance p-value (Levene's test): 0.994
  T-Test Statistic: 10.397
  P-Value: 0.000
  Conclusion: Significant difference at 5% level

Stat: Defense
  Normality p-value (Legendary): 0.004
  Normality p-value (Non-Legendary): 0.000
  Variance p-value (Levene's test): 0.320
  T-Test Statistic: 7.181
  P-Value: 0.000
  Conclusion: Significant difference at 5% level

Stat: Sp. Atk
  Normality p-value (Legendary): 0.624
  Normality p-value (Non-Legendary): 0.000
  Variance p-value (Levene's test): 0.370
  T-Test Statistic: 14.191
  P-Value: 0.000
  Conclusion: Significant difference at 5% level

Stat: Sp. Def
  Normality p-value (Legendary): 0.008
  Normality p-value (Non-Le

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [16]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [None]:
import math

def distance_hospital(long, lat):
    distance_hospital = math.sqrt((long + 122)**2 + (lat - 34)**2)
    return distance_hospital

def distance_school(long, lat):
    distance_school = math.sqrt((long + 118)**2 + (lat - 37)**2)
    return distance_school

df['distance_to_hospital'] = df.apply(lambda row: distance_hospital(row['longitude'], row['latitude']), axis=1)
df['distance_to_school'] = df.apply(lambda row: distance_school(row['longitude'], row['latitude']), axis=1)

df_close = df[(df['distance_to_hospital'] < 0.5) | (df['distance_to_school'] < 0.5)]["median_house_value"]
df_far = df[(df['distance_to_hospital'] >= 0.5) | (df['distance_to_school'] >= 0.5)]["median_house_value"]

4523     74200.0
5596    108300.0
5597    104700.0
6776     94500.0
6904     80600.0
Name: median_house_value, dtype: float64

In [58]:
'''
We posit that the mean value of houses close to a hospital or school will be higher than the mean value of houses far from a hospital or school

Set the hypothesis
H0: mu df_far >= mu df_close
H1: mu df_far < mu df_close

'''
alpha = 0.05

shapiro_close = st.shapiro(df_close)
shapiro_far = st.shapiro(df_far)

print("Shapiro-Wilk Test Results:")
print(f"Houses close normality p-value: {shapiro_close.pvalue}")
print(f"Houses far normality p-value: {shapiro_far.pvalue}")

# Perform the t-test
# Use independent t-test assuming unequal variance (Welch's t-test)
t_stat, p_value = st.ttest_ind(df_close, df_far, alternative='greater', equal_var=False)

print("\nT-Test Results:")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Decision
if p_value < alpha:
    print("\nReject the null hypothesis (H0).")
    print("Conclusion: The mean value of houses close to hospitals or schools is significantly higher than those far away.")
else:
    print("\nFail to reject the null hypothesis (H0).")
    print("Conclusion: There is not enough evidence to support that houses closer to a school or hospital have a higher value.")


Shapiro-Wilk Test Results:
Houses close normality p-value: 0.5485392407287775
Houses far normality p-value: 1.3435407157576457e-70

T-Test Results:
T-statistic: -17.169161722319306
P-value: 0.9999738667251178

Fail to reject the null hypothesis (H0).
Conclusion: There is not enough evidence to support that houses closer to a school or hospital have a higher value.


  res = hypotest_fun_out(*samples, **kwds)
