# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np

In [33]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [34]:
#code here
df_dragon = df[ (df["Type 1"] == "Dragon" ) | ( df["Type 2"] == "Dragon" ) ]["HP"]
df_grass = df[ (df["Type 1"]=="Grass") | (df["Type 2"]=="Grass")]["HP"]

#Set the hypothesis

#H0: HP dragon <= HP grass
#H1: HP dragon > HP grass

#significance level = 0.05

st.ttest_ind(df_dragon, df_grass, equal_var=False, alternative = "greater") 
# JAV: NO, HP dragon is not <= to HP grass (pvalue<<< 0.05)

TtestResult(statistic=4.097528915272702, pvalue=5.090769061176926e-05, df=77.58086781513519)

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [35]:
df['Legendary'].dtypes

dtype('bool')

In [37]:
#code here
attributes = ['HP','Attack', 'Defense','Speed', 'Sp_Atk','Sp_Def']


df.rename(columns = {'Sp. Atk':'Sp_Atk', 'Sp. Def':'Sp_Def'}, inplace = True)
#Set the hypothesis

# For each attribute of attributes:
#H0: attribute non_legendary = attribute legendary
#H1: attribute non_legendary != attribute legendary

alpha = 0.05

for attribute in attributes:
    print(f'We are looking at attribute: {attribute}')
    
    df_non_legend = df[df["Legendary"]== True][attribute]
    df_legend = df[df["Legendary"]== False ][attribute]
    
    p_value = st.ttest_ind(df_non_legend, df_legend, equal_var=False)[1]
    
    print(f' p_value = {p_value}')
    if p_value > alpha:
        print("We are not able to reject the null hypothesis")
    else:
        print("We reject the null hypotesis")
    
# Errors for attack and defense (possibly null values)
# Otherwise for each attribute the assumption is not true... there are differences in the attributes between non_Legend and Legend



We are looking at attribute: HP
 p_value = 1.0026911708035284e-13
We reject the null hypotesis
We are looking at attribute: Attack
 p_value = 2.520372449236646e-16
We reject the null hypotesis
We are looking at attribute: Defense
 p_value = 4.8269984949193316e-11
We reject the null hypotesis
We are looking at attribute: Speed
 p_value = 1.049016311882451e-18
We reject the null hypotesis
We are looking at attribute: Sp_Atk
 p_value = 1.5514614112239812e-21
We reject the null hypotesis
We are looking at attribute: Sp_Def
 p_value = 2.2949327864052826e-15
We reject the null hypotesis


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [3]:
# !pip install geopy

In [4]:
# Info from chatGPT:

from geopy.distance import geodesic

# Coordinates of the school and hospital
school_coords = (37, -118)  # Latitude, Longitude
hospital_coords = (34, -122)  # Latitude, Longitude

# Calculate the distance
distance = geodesic(school_coords, hospital_coords).kilometers
print("Distance between school and hospital:", round(distance,2), "km")


Distance between school and hospital: 492.35 km


In [5]:
[df['latitude'][0],df['longitude'][0]]

[34.19, -114.31]

In [8]:
temp_list = []
for i in range(len(df)):
    data = [df['latitude'][i], df['longitude'][i]]
    temp_list.append(data)
    
df['pair_loc'] = pd.DataFrame({'pair_loc': temp_list})


In [9]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,pair_loc
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,"[34.19, -114.31]"
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,"[34.4, -114.47]"
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,"[33.69, -114.56]"
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,"[33.64, -114.57]"
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,"[33.57, -114.57]"
...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,"[40.58, -124.26]"
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,"[40.69, -124.27]"
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,"[41.84, -124.3]"
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,"[41.8, -124.3]"


In [11]:
df['dist_school'] = df['pair_loc'].apply(lambda x: geodesic(school_coords, x))
df['dist_hospital'] = df['pair_loc'].apply(lambda x: geodesic(hospital_coords, x))


6904     28.396168218663806 km
6776      32.67155475393458 km
4523      40.19903358161046 km
5596      42.00394782466469 km
5597     44.089773899885806 km
                 ...          
16991      752.110988645483 km
16987     752.3896357626232 km
16998     760.1536300500362 km
16939     761.9100273072842 km
16997     763.1587856199142 km
Name: dist_school, Length: 17000, dtype: object

In [None]:
df['dist_school'].sort_values()

In [60]:
close = (df['dist_school'] < 0.5) | (df['dist_hospital'] < 0.5)
far = (df['dist_school'] > 0.5) & (df['dist_hospital'] > 0.5)

df_close = df[close]['median_house_value']
df_far = df[far]['median_house_value']

#Set the hypothesis

#H0: median_house_value close < median_house_value far
#H1: median_house_value close >= median_house_value far

st.ttest_ind(df_close, df_far, equal_var=False, alternative = "greater")

TtestResult(statistic=nan, pvalue=nan, df=nan)

In [14]:
# other tech euclid

# Coordinates of the school and hospital
school = (37, -118)  # Latitude, Longitude
hospital = (34, -122)  # Latitude, Longitude

def distance(pointA, pointB):
    return np.sqrt((pointA[0]-pointB[0])**2 + (pointA[1]-pointB[1])**2)




In [15]:
distance(school, hospital)

5.0

In [26]:
df['new_dist_school'] = df['pair_loc'].apply(lambda x: distance(school, x))
df['new_dist_hospital']= df['pair_loc'].apply(lambda x: distance(hospital, x))

closer = (df['new_dist_school'] < 0.5) | (df['new_dist_hospital'] < 0.5)
farther = (df['new_dist_school'] > 0.5) & (df['new_dist_hospital'] > 0.5)

df_closer = df[closer]['median_house_value']
df_farther = df[farther]['median_house_value']

#Set the hypothesis

#H0: median_house_value close < median_house_value far
#H1: median_house_value close >= median_house_value far

p_value= st.ttest_ind(df_closer, df_farther, equal_var=False, alternative = "greater") # could use 'less' for fun next time...

alpha = 0.05
print(f'H0: median_house_value close < median_house_value far')
print(f' p_value stuff = {p_value}')

if p_value[1] > alpha:
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypotesis")
# conclusion: pvalue


H0: median_house_value close < median_house_value far
 p_value stuff = TtestResult(statistic=-17.174167998688404, pvalue=0.9999738999071939, df=4.145382282040222)
We are not able to reject the null hypothesis


In [27]:
df[['median_house_value', 'median_income','new_dist_school', 'new_dist_hospital']].sort_values('new_dist_school')[:30]

Unnamed: 0,median_house_value,median_income,new_dist_school,new_dist_hospital
6904,80600.0,2.325,0.315753,4.718019
6776,94500.0,2.75,0.344819,4.872258
4523,74200.0,2.0357,0.363456,4.75101
5596,108300.0,2.1212,0.393573,5.080837
5597,104700.0,2.2574,0.411461,4.637812
7771,105800.0,1.8325,0.537587,4.938522
7849,116100.0,1.9722,0.538145,4.924388
8010,180400.0,4.375,0.546717,4.902948
8009,54400.0,1.7388,0.553173,4.909786
8254,139700.0,4.337,0.58258,4.894834


In [22]:
df_short = df[['median_house_value', 'median_income','new_dist_school', 'new_dist_hospital']]