# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import statsmodels.api as sm
import seaborn as sns

from statsmodels.multivariate.manova import MANOVA
from scipy.stats import pearsonr
from scipy import stats
from scipy.stats import boxcox

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [3]:
df['Type 1'].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [4]:
def show_statistical_test(statistic: float, alpha: float, n: int, distribution: str=["t-student","normal"], alternative: str=["two-sided","lower","greater"]):

    if distribution not in ["t-student","normal"]:
        raise TypeError("Sorry, only 't-student', and 'normal' distributions are acepted")

    if alternative not in ["two-sided","lower","greater"]:
        raise TypeError("Sorry, only 'two-sided', 'lower', and 'greated' are acepted valued for the alternative")

    if not isinstance(statistic, float):
        raise TypeError("Sorry, the data type for the statistic must be float")

    if not isinstance(alpha, float):
        raise TypeError("Sorry, the data type for alpha must be float")

    if not isinstance(n, int):
        raise TypeError("Sorry, the data type for n must be int")

    x_values = np.linspace(-3, 3)

    if distribution == "t-student":

        y_values = st.t.pdf(x_values, df=n-1)

        if alternative == "two-sided": # Computing the critical values

            lower_critical_value = st.t.ppf(alpha/2, df=n-1)
            upper_critical_value = st.t.ppf(1-(alpha/2), df=n-1)

            x_values1 = np.linspace(-3, lower_critical_value)
            y_values1 = st.t.pdf(x_values1, df=n-1)

            x_values2 = np.linspace(upper_critical_value, 3)
            y_values2 = st.t.pdf(x_values2, df=n-1)

        elif alternative == "lower":

            critical_value = st.t.ppf(alpha, df=n-1)

            x_values1 = np.linspace(-3, critical_value)
            y_values1 = st.t.pdf(x_values1, df=n-1)

        elif alternative == "greater":

            critical_value = st.t.ppf(1-alpha, df=n-1)

            x_values2 = np.linspace(critical_value, 3)
            y_values2 = st.t.pdf(x_values2, df=n-1)

    elif distribution == "normal":

        y_values = st.norm.pdf(x_values)

        if alternative == "two-sided": # Computing the critical values

            lower_critical_value = st.norm.ppf(alpha/2)
            upper_critical_value = st.norm.ppf(1-(alpha/2))

            x_values1 = np.linspace(-3, lower_critical_value)
            y_values1 = st.norm.pdf(x_values1)

            x_values2 = np.linspace(upper_critical_value, 3)
            y_values2 = st.norm.pdf(x_values2)

        elif alternative == "lower":

            critical_value = st.norm.ppf(alpha)

            x_values1 = np.linspace(-3, critical_value)
            y_values1 = st.norm.pdf(x_values1)

        elif alternative == "greater":

            critical_value = st.norm.ppf(1-alpha)

            x_values2 = np.linspace(critical_value, 3)
            y_values2 = st.norm.pdf(x_values2)

    df = pd.DataFrame({"x": x_values, "pdf": y_values})

    title = f"{distribution} Probability Density Function"

    fig = px.line(df, x="x", y="pdf", title=title)

    if alternative == "two-sided":

        fig.add_vline(x=lower_critical_value, line_color="red")
        fig.add_vline(x=upper_critical_value, line_color="red")

        fig.add_annotation(x=lower_critical_value,y=0,text=f"Lower critical value {lower_critical_value: .2f}",xref="x",yref="paper",yanchor="bottom")
        fig.add_annotation(x=upper_critical_value,y=0,text=f"Upper critical value {upper_critical_value: .2f}",xref="x",yref="paper",yanchor="bottom")

        fig.add_scatter(x=x_values1, y=y_values1,fill='tozeroy', mode='none' , fillcolor='red')
        fig.add_scatter(x=x_values2, y=y_values2,fill='tozeroy', mode='none' , fillcolor='red')

    elif alternative == "lower":

        fig.add_vline(x=critical_value, line_color="red")
        fig.add_annotation(x=critical_value,y=0,text=f"Critical value {critical_value: .2f}",xref="x",yref="paper",yanchor="bottom")

        fig.add_scatter(x=x_values1, y=y_values1,fill='tozeroy', mode='none' , fillcolor='red')

    elif alternative == "greater":

        fig.add_vline(x=critical_value, line_color="red")
        fig.add_annotation(x=critical_value,y=0,text=f"Critical value {critical_value: .2f}",xref="x",yref="paper",yanchor="bottom")

        fig.add_scatter(x=x_values2, y=y_values2,fill='tozeroy', mode='none' , fillcolor='red')

    fig.add_vline(x=statistic)
    fig.add_annotation(x=statistic,y=0,text=f"Statistic {statistic: .2f}",xref="x",yref="paper",yanchor="bottom")

    fig.update_layout(title_text=f'{distribution} Probability Density Function', title_x=0.5)

    fig.update_layout(showlegend=False)

    fig.show()

- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [5]:
#code here

# Null hypothesis: H0 μ_Dragon ≤ μ_Grass (HP for Dragon-type is less than or equal to Grass-type.)
# Alternative hypothesis (H₁): μ_Dragon > μ_Grass (HP for Dragon-type is greater than Grass-type.)

alpha = 0.05 # 5% significance level

dragon_hp = df[df["Type 1"]=='Dragon']["HP"]
grass_hp = df[df["Type 1"]=='Grass']["HP"]

dragon_hp_mean = dragon_hp.mean()
grass_hp_mean = grass_hp.mean()

dragon_hp_std = dragon_hp.std(ddof=1)
grass_hp_std = grass_hp.std(ddof=1)

n1 = len(dragon_hp)
n2 = len(grass_hp)

statistic = (dragon_hp_mean - grass_hp_mean) / np.sqrt( (dragon_hp_std**2/n1) + (grass_hp_std**2/n2))
print(statistic)
print()
show_statistical_test(statistic, alpha, n1+n2-2, distribution="t-student", alternative="two-sided")
print()
st.ttest_ind(dragon_hp, grass_hp, equal_var=False, alternative="two-sided")

3.3349632905124063






TtestResult(statistic=np.float64(3.3349632905124063), pvalue=np.float64(0.0015987219490841197), df=np.float64(50.83784116232685))

Comment: With a p-value < 0.05, we reject the null hypothesis. This means there is statistically significant evidence, at the 5% level, that Dragon-type Pokemons have higher HP than Grass-type Pokemons.

In [6]:
# Perform independent t-test (one-sided, Dragon > Grass)
# By default, scipy's ttest_ind does two-sided. For one-sided, divide p-value by 2 and check the direction.
t_stat, p_value_two_sided = st.ttest_ind(dragon_hp, grass_hp, equal_var=False)

# Adjust for one-sided test
if t_stat > 0:
    p_value_one_sided = p_value_two_sided / 2
else:
    p_value_one_sided = 1 - (p_value_two_sided / 2)

print(f"T-statistic: {t_stat:.4f}")
print(f"Two-sided p-value: {p_value_two_sided:.4f}")
print(f"One-sided p-value: {p_value_one_sided:.4f}")

T-statistic: 3.3350
Two-sided p-value: 0.0016
One-sided p-value: 0.0008


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [14]:
#code here

# Null hypothesis: H0 μ_Legendary = μ_Non-Legenday
# Alternative hypothesis (H₁): μ_Legendary != μ_Non-Legenday

df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")

# alpha = 0.05 # 5% significance level

def compare_stats_ttest(df1, df2, cols=["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"], alpha=0.05):
    results = []
    for col in cols:
        group1 = df1[col].dropna()
        group2 = df2[col].dropna()

        group1_mean = df1[col].mean()
        group2_mean = df2[col].mean()
        
        group1_std = df1[col].std(ddof=1)
        group2_std = df2[col].std(ddof=1)
        
        n1 = len(group1)
        n2 = len(group2)

        stat = (group1_mean - group2_mean) / np.sqrt( (group1_std**2/n1) + (group2_std**2/n2))
        print(f"Stat {col} {stat}")
        #show_statistical_test(stat, alpha, n1+n2-2, distribution="t-student", alternative="two-sided")
        
        t_stat, p_val = st.ttest_ind(group1, group2, equal_var=False)
        st.ttest_ind(group1, group2, equal_var=False, alternative="two-sided")

        conclusion = "Reject H₀ (difference)" if p_val < alpha else "Fail to reject H₀ (no difference)"
        results.append({
            "Group": col,
            "Mean Legendary": group1_mean,
            "Mean Non-Legendary": group2_mean,
            "T-stat": t_stat,
            "p-value": p_val,
            "Conclusion (alpha=0.05)": conclusion
        })
    return pd.DataFrame(results)

legendary_df = df[df["Legendary"] == True]
nonlegendary_df = df[df["Legendary"] == False]

# Run the function
summary = compare_stats_ttest(legendary_df, nonlegendary_df)
print(summary)

Stat HP 8.981370483625048
Stat Attack 10.438133539322205
Stat Defense 7.637078164784619
Stat Sp. Atk 13.417449984138461
Stat Sp. Def 10.015696613114878
Stat Speed 11.475044446314431
     Group  Mean Legendary  Mean Non-Legendary     T-stat       p-value  \
0       HP       92.738462           67.182313   8.981370  1.002691e-13   
1   Attack      116.676923           75.669388  10.438134  2.520372e-16   
2  Defense       99.661538           71.559184   7.637078  4.826998e-11   
3  Sp. Atk      122.184615           68.454422  13.417450  1.551461e-21   
4  Sp. Def      105.938462           68.892517  10.015697  2.294933e-15   
5    Speed      100.184615           65.455782  11.475044  1.049016e-18   

  Conclusion (alpha=0.05)  
0  Reject H₀ (difference)  
1  Reject H₀ (difference)  
2  Reject H₀ (difference)  
3  Reject H₀ (difference)  
4  Reject H₀ (difference)  
5  Reject H₀ (difference)  


In [11]:
alpha = 0.05 # 5% significance level

df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")

leg_hp = df[df["Legendary"] == True]["HP"]
nonleg_hp = df[df["Legendary"] == False]["HP"]

leg_hp_mean = leg_hp.mean()
nonleg_hp_mean = nonleg_hp.mean()

leg_hp_std = leg_hp.std(ddof=1)
nonleg_hp_std = nonleg_hp.std(ddof=1)

n1 = len(leg_hp)
n2 = len(nonleg_hp)

statistic = (leg_hp_mean - nonleg_hp_mean) / np.sqrt( (leg_hp_std**2/n1) + (nonleg_hp_std**2/n2))
print(statistic)
print()
show_statistical_test(statistic, alpha, n1+n2-2, distribution="t-student", alternative="two-sided")
print()
st.ttest_ind(leg_hp, nonleg_hp, equal_var=False, alternative="two-sided")

8.981370483625048






TtestResult(statistic=np.float64(8.981370483625046), pvalue=np.float64(1.0026911708035284e-13), df=np.float64(79.52467830799894))

Comment:
Using independent two-sample t-tests for each stat (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed), we compared Legendary and Non-Legendary Pokemons. For each stat, if the p-value is less than 0.05, we conclude there is a statistically significant difference between the two groups for that stat. The summary table above shows which stats are significantly different at the 5% significance level.

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [15]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [25]:
# Null Hypothesis (H₀): Houses close to a school or hospital have the same mean price as houses farther away.
# Alternative Hypothesis (H₁): Houses close to a school or hospital have a higher mean price than houses farther away.

# 1. Euclidean Distance Function

# We need to calculate, for each house, the distance to the school and hospital:
# - School: (-118, 34)
# - Hospital: (-122, 37)

def euclidean_distance(lon1, lat1, lon2, lat2):
    return np.sqrt((lon1 - lon2)**2 + (lat1 - lat2)**2)

# 2. Calculate Distances and Define 'Close' Houses

# Add two columns: distance to school and to hospital.  
# A house is "close" if **either** distance is less than 0.5.

school_coords = (-118, 34)
hospital_coords = (-122, 37)

df["dist_school"] = euclidean_distance(df["longitude"], df["latitude"], school_coords[0], school_coords[1])
df["dist_hospital"] = euclidean_distance(df["longitude"], df["latitude"], hospital_coords[0], hospital_coords[1])

# We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.
df["is_close"] = (df["dist_school"] < 0.5) | (df["dist_hospital"] < 0.5)

# Let's check how many are close/far
print(df["is_close"].value_counts())

# 3. Exploratory Analysis

close_prices = df[df["is_close"] == True]["median_house_value"]
far_prices = df[df["is_close"] == False]["median_house_value"]

print(f"Close to school/hospital: n = {len(close_prices)} - Mean price = {close_prices.mean()}")
print(f"Far from school/hospital: n = {len(far_prices)} - Mean price = {far_prices.mean()}")

close_prices_std = close_prices.std(ddof=1)
far_prices_std = far_prices.std(ddof=1)

t_stat, p_val = st.ttest_ind(close_prices, far_prices, equal_var=False)
print(t_stat,p_val)

statistic = (close_prices.mean() - far_prices.mean()) / np.sqrt( (close_prices_std**2/n1) + (far_prices_std**2/n2))
print(statistic)
print()
show_statistical_test(statistic, alpha, n1+n2-2, distribution="t-student", alternative="two-sided")
print()
st.ttest_ind(close_prices, close_prices, equal_var=False, alternative="two-sided")

alpha = 0.05
if p_val < alpha:
    print("Result: Reject H₀. Houses close to school or hospital are significantly more expensive.")
else:
    print("Result: Fail to reject H₀. No significant price difference detected.")


is_close
False    10171
True      6829
Name: count, dtype: int64
Close to school/hospital: n = 6829 - Mean price = 246951.98213501245
Far from school/hospital: n = 10171 - Mean price = 180678.44105790975
37.992330214201516 3.0064957768592614e-301
4.582619331156695




Result: Reject H₀. Houses close to school or hospital are significantly more expensive.


Comment: Reject H₀. Houses close to school or hospital are significantly more expensive.