# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [14]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [15]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [16]:
# Filter the dataset for Dragon and Grass type Pokemon
dragon_hp = df[df['Type 1'] == 'Dragon']['HP']
grass_hp = df[df['Type 1'] == 'Grass']['HP']

# Perform an independent one-sample t-test (one-tailed)
t_stat, p_value = st.ttest_ind(dragon_hp, grass_hp, alternative='greater')

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. Dragon-type Pokemon have, on average, more HP than Grass-type Pokemon.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to say that Dragon-type Pokemon have more HP than Grass-type Pokemon.")

T-statistic: 3.590444254130357
P-value: 0.0002567969150153481
Reject the null hypothesis. Dragon-type Pokemon have, on average, more HP than Grass-type Pokemon.


Based on the results of the hypothesis test, we can draw the following conclusions:

1. **T-statistic**: The t-statistic is 3.590, which indicates the number of standard deviations the sample mean difference is away from the null hypothesis mean difference (which is zero in this case).

2. **P-value**: The p-value is 0.000257, which is much smaller than the significance level of 0.05.

3. **Decision**: Since the p-value is less than 0.05, we reject the null hypothesis.

4. **Conclusion**: There is strong evidence to support the alternative hypothesis that Dragon-type Pokemon have, on average, more HP than Grass-type Pokemon. This means that the observed difference in HP between Dragon-type and Grass-type Pokemon is statistically significant and unlikely to have occurred by random chance.

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [17]:
# Filter the dataset for Legendary and Non-Legendary Pokemon
legendary = df[df['Legendary'] == True]
non_legendary = df[df['Legendary'] == False]

# List of stats to compare
stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# Perform an independent two-sample t-test for each stat
results = {}
for stat in stats:
    t_stat, p_value = st.ttest_ind(legendary[stat], non_legendary[stat], alternative='two-sided')
    results[stat] = {'t_stat': t_stat, 'p_value': p_value}

# Print the results
for stat, result in results.items():
    print(f"{stat} - T-statistic: {result['t_stat']}, P-value: {result['p_value']}")

# Interpret the results
alpha = 0.05
for stat, result in results.items():
    if result['p_value'] < alpha:
        print(f"Reject the null hypothesis for {stat}. Legendary Pokemon have different {stat} compared to Non-Legendary Pokemon.")
    else:
        print(f"Fail to reject the null hypothesis for {stat}. There is not enough evidence to say that Legendary Pokemon have different {stat} compared to Non-Legendary Pokemon.")

HP - T-statistic: 8.036124405043928, P-value: 3.3306476848461913e-15
Attack - T-statistic: 10.397321023700622, P-value: 7.827253003205333e-24
Defense - T-statistic: 7.181240122992339, P-value: 1.5842226094427259e-12
Sp. Atk - T-statistic: 14.191406210846289, P-value: 6.314915770427265e-41
Sp. Def - T-statistic: 11.03775106120522, P-value: 1.8439809580409597e-26
Speed - T-statistic: 9.765234331931898, P-value: 2.3540754436898437e-21
Reject the null hypothesis for HP. Legendary Pokemon have different HP compared to Non-Legendary Pokemon.
Reject the null hypothesis for Attack. Legendary Pokemon have different Attack compared to Non-Legendary Pokemon.
Reject the null hypothesis for Defense. Legendary Pokemon have different Defense compared to Non-Legendary Pokemon.
Reject the null hypothesis for Sp. Atk. Legendary Pokemon have different Sp. Atk compared to Non-Legendary Pokemon.
Reject the null hypothesis for Sp. Def. Legendary Pokemon have different Sp. Def compared to Non-Legendary Pokem

Based on the results of the hypothesis tests for each stat (HP, Attack, Defense, Sp. Atk, Sp. Def, Speed), we can draw the following conclusions:

1. **HP**:
   - **T-statistic**: 8.036
   - **P-value**: 3.33e-15
   - **Conclusion**: The p-value is much smaller than the significance level of 0.05. We reject the null hypothesis and conclude that Legendary Pokemon have different HP compared to Non-Legendary Pokemon.

2. **Attack**:
   - **T-statistic**: 10.397
   - **P-value**: 7.83e-24
   - **Conclusion**: The p-value is much smaller than the significance level of 0.05. We reject the null hypothesis and conclude that Legendary Pokemon have different Attack compared to Non-Legendary Pokemon.

3. **Defense**:
   - **T-statistic**: 7.181
   - **P-value**: 1.58e-12
   - **Conclusion**: The p-value is much smaller than the significance level of 0.05. We reject the null hypothesis and conclude that Legendary Pokemon have different Defense compared to Non-Legendary Pokemon.

4. **Sp. Atk**:
   - **T-statistic**: 14.191
   - **P-value**: 6.31e-41
   - **Conclusion**: The p-value is much smaller than the significance level of 0.05. We reject the null hypothesis and conclude that Legendary Pokemon have different Sp. Atk compared to Non-Legendary Pokemon.

5. **Sp. Def**:
   - **T-statistic**: 11.038
   - **P-value**: 1.84e-26
   - **Conclusion**: The p-value is much smaller than the significance level of 0.05. We reject the null hypothesis and conclude that Legendary Pokemon have different Sp. Def compared to Non-Legendary Pokemon.

6. **Speed**:
   - **T-statistic**: 9.765
   - **P-value**: 2.35e-21
   - **Conclusion**: The p-value is much smaller than the significance level of 0.05. We reject the null hypothesis and conclude that Legendary Pokemon have different Speed compared to Non-Legendary Pokemon.

### Overall Conclusion:
For all six stats (HP, Attack, Defense, Sp. Atk, Sp. Def, Speed), the p-values are significantly smaller than the significance level of 0.05. Therefore, we reject the null hypothesis for each stat. This indicates that Legendary Pokemon have significantly different stats compared to Non-Legendary Pokemon across all these attributes. The differences are statistically significant and unlikely to have occurred by random chance.

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [19]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [20]:
# Define the distance function
def get_distance(coord1, coord2):
    return np.sqrt((coord1[0] - coord2[0])**2 + (coord1[1] - coord2[1])**2)

# Define coordinates for school and hospital
school_coords = (-118, 37)
hospital_coords = (-122, 34)

# Calculate distances to school and hospital
df['distance_to_school'] = df.apply(lambda row: get_distance((row['longitude'], row['latitude']), school_coords), axis=1)
df['distance_to_hospital'] = df.apply(lambda row: get_distance((row['longitude'], row['latitude']), hospital_coords), axis=1)

# Define proximity to school or hospital
threshold = 0.50
df['close_to_school_or_hospital'] = (df['distance_to_school'] < threshold) | (df['distance_to_hospital'] < threshold)

# Divide dataset into houses close and far from either a school or hospital
close_houses = df[df['close_to_school_or_hospital'] == True]['median_house_value']
far_houses = df[df['close_to_school_or_hospital'] == False]['median_house_value']

# Perform an independent two-sample t-test
t_stat, p_value = st.ttest_ind(close_houses, far_houses, alternative='greater')

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. Houses close to either a school or a hospital are more expensive.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to say that houses close to either a school or a hospital are more expensive.")

T-statistic: -2.2146147257665834
P-value: 0.9866001334644356
Fail to reject the null hypothesis. There is not enough evidence to say that houses close to either a school or a hospital are more expensive.


Based on the results of the hypothesis test, we can draw the following conclusions:

1. **T-statistic**: The t-statistic is -2.215, which indicates that the mean price of houses close to either a school or a hospital is lower than the mean price of houses far from a school or hospital. The negative value suggests that the mean price of houses close to a school or hospital is less than the mean price of houses far from them.

2. **P-value**: The p-value is 0.987, which is much larger than the significance level of 0.05.

3. **Decision**: Since the p-value is greater than 0.05, we fail to reject the null hypothesis.

4. **Conclusion**: There is not enough evidence to support the alternative hypothesis that houses close to either a school or a hospital are more expensive. This means that the observed difference in house prices between houses close to a school or hospital and those far from them is not statistically significant. The difference could be due to random chance rather than a true effect.

### Overall Conclusion:
The hypothesis test results indicate that there is no significant difference in the mean prices of houses close to either a school or a hospital compared to those far from them. Therefore, we cannot conclude that proximity to a school or hospital has a significant impact on house prices based on the given data.