# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [2]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [4]:
df_p = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df_p

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

hypothesis 

H0: d_hp = g_hp

H1: d_hp ≠ g_hp

In [1]:
# significance level

alpha = .05

In [14]:
# collecting the data 

d_hp = df_p[df_p['Type 1'] == 'Dragon']['HP']
d_hp.shape

(32,)

In [15]:
g_hp = df_p[df_p['Type 1'] == 'Grass']['HP']
g_hp

0       45
1       60
2       80
3       80
48      45
      ... 
718     56
719     61
720     88
740     66
741    123
Name: HP, Length: 70, dtype: int64

In [17]:
# shorter way dont have to do all the above
stat, p_value = st.ttest_1samp(d_hp, g_hp.mean())
print(f'stat:{stat:.4f}')
print('p_value:',p_value)

stat:3.8134
p_value: 0.0006118359957847815


# Conclusion

we reject H0 and we are unable to reject H1. p_value is less than alpha.

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


Hypothesis 

H0: l_stats = non_l_stats

H1: l_stats ≠ non_l_stats

In [20]:
alpha_l = 0.05
poke_stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

In [49]:
l_stats = df_p[df_p['Legendary'] == True][poke_stats]
l_stats

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
156,90,85,100,95,125,85
157,90,90,85,125,90,100
158,90,100,90,125,85,90
162,106,110,90,154,90,130
163,106,190,100,154,100,130
...,...,...,...,...,...,...
795,50,100,150,100,150,50
796,50,160,110,160,110,110
797,80,110,60,150,130,70
798,80,160,60,170,130,80


In [50]:
non_l_stats = df_p[df_p['Legendary'] == False][poke_stats]
non_l_stats

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,45,49,49,65,65,45
1,60,62,63,80,80,60
2,80,82,83,100,100,80
3,80,100,123,122,120,80
4,39,52,43,60,50,65
...,...,...,...,...,...,...
787,85,100,122,58,75,54
788,55,69,85,32,35,28
789,95,117,184,44,46,28
790,40,30,35,45,40,55


In [51]:
st.ttest_ind(l_stats,non_l_stats, equal_var=False)

# We are assuming that variance between the legendary and 
# regular pokemon is NOT equal, because regular pokemon come in
# multiple levels of evolution, with higher evolutions being significantly 
# more powerful than lower evolutions. On the other hand, legendary pokemon 
# come in their highest form, and so we expect one legendary pokemon to be more 
# comparable to another in terms of power.

TtestResult(statistic=array([ 8.98137048, 10.43813354,  7.63707816, 13.41744998, 10.01569661,
       11.47504445]), pvalue=array([1.00269117e-13, 2.52037245e-16, 4.82699849e-11, 1.55146141e-21,
       2.29493279e-15, 1.04901631e-18]), df=array([79.52467831, 75.88324448, 77.71095111, 74.24631138, 73.25892564,
       81.62110996]))

# Conclusion 

the p_values are less then our sig level (alpha). so we reject the H0 (Null hypothesis), the stats between legendary and non legendary pokemon are different


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [53]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

# Alejandro's Function to calculate Euclidean distance
def euclidean_distance(x1, y1, x2, y2):
    return np.sqrt((x1 - x2)**2 + (y1 - y2)**2)

In [62]:
# functions Define function to compute distance from school
def school_dist(lat1, lon1):
    "This function takes provided latitudinal and longitudinal coordinates and computes their euclidean distance from a school whose position is (-118,37)."
    # Converts lat and lon to 'Cartesian coordinates' - ChatGPT recommended I do this. 

    x1 = lon1 * np.cos(lat1)
    y1 = lat1
    school_lon = -118 * np.cos(37)  # uses longitude and latitude provided for school location data
    school_lat = 37                 # uses latitude provided for school location data

    # Computes euclidean distance from a given latitude and longitude to the school. 
    distance = np.sqrt((x1 - school_lon)**2 + (y1 - school_lat)**2)
    return distance

In [63]:
# Define function to compute distance from hospital
def hosp_dist(lat1, lon1):
    "This function takes provided latitudinal and longitudinal coordinates and computes their euclidean distance from a hospital whose position is (-122,34)."
    # Converts lat and lon to 'Cartesian coordinates' - ChatGPT recommended I do this. 

    x1 = lon1 * np.cos(lat1)
    y1 = lat1
    hosp_lon = -122 * np.cos(34)  # uses longitude and latitude provided for school location data
    hosp_lat = 34                 # uses latitude provided for school location data

    # Computes euclidean distance from a given latitude and longitude to the school. 
    distance = np.sqrt((x1 -hosp_lon)**2 + (y1 - hosp_lat)**2)

    return distance

In [64]:
# Creates a new column in df that contains each house's euclidean distance from the school. 

df['school_dist'] = school_dist(df['latitude'],df['longitude'])

close_school = df[df['school_dist']<0.5]

In [65]:
# Creates a new column in df that contains each house's euclidean distance from the hospital. 

df['hosp_dist'] = hosp_dist(df['latitude'],df['longitude'])

close_hosp = df[df['hosp_dist']<0.5]

In [68]:
# Creates subsets of data based on nearness to the school or hospital. 

close_house = df[(df['hosp_dist']<0.5) | (df['school_dist']<0.5)]['median_house_value']

far_house = df[(df['hosp_dist']>0.5) & (df['school_dist']>0.5)]['median_house_value']

In [69]:
# Runs a two sample t-test comparing prices of houses close to either the school or hospital with houses that are not close to either. 

# We assume that the variance of house prices in both categories is nearly equal because we have no indication that houses close to either or far from both bear any different qualities from each other. 

stat, p_value = st.ttest_ind(close_house,far_house)

print("Stat:", stat, " p_value:", p_value)

if p_value < 0.05:
    print("Reject null hypothesis. There is a statistically significant difference in house prices between houses close to either the hospital or the school and houses that are close to neither.")
else:
    print("Fail to reject the null. We find no statistically significant difference between houses close to either the school or the hospital and houses that are close to neither.")
if stat > 0:
    print("Houses close to either the school or the hospital are more expensive than houses that are not.")
else:
    print("Houses close to either the school or the hospital are less expensive than houses that are not.")

Stat: 2.190010046812842  p_value: 0.02853704837377385
Reject null hypothesis. There is a statistically significant difference in house prices between houses close to either the hospital or the school and houses that are close to neither.
Houses close to either the school or the hospital are more expensive than houses that are not.
