# SQLalchemy Challenge

## Bonus Temperature Analysis I

       Kate Spitzer
       
       
       In this notebook, we are attempting to see if the average observed temperature in Hawaii in June differs statistically
       from the average observed temperature in December.
       
       We read the Hawaii data from hawaii_measurements.csv into a Pandas DataFrame, and converted the date column from
       a string to a datatime object.  The dataset was cleaned by dropping rows containing null data.  We then pulled all
       June datapoints into another DataFrame, and the December datapoints into a third DataFrame.
       
       The temperature data for June and December were then pulled into lists in preparation for our t-test.
       

In [17]:
# define environment
import pandas as pd
from datetime import datetime as dt
from statistics import mean

In [18]:
# read in CSV file into dataframe
df = pd.read_csv("Resources/hawaii_measurements.csv")

# display a sample of the dataframe
df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [19]:
# inspect columns and datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19550 entries, 0 to 19549
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   station  19550 non-null  object 
 1   date     19550 non-null  object 
 2   prcp     18103 non-null  float64
 3   tobs     19550 non-null  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 611.1+ KB


In [20]:
# Convert the date column format from string to datetime
df["date"] = pd.to_datetime(df["date"])

# inspect columns and datatypes again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19550 entries, 0 to 19549
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   station  19550 non-null  object        
 1   date     19550 non-null  datetime64[ns]
 2   prcp     18103 non-null  float64       
 3   tobs     19550 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 611.1+ KB


In [21]:
# set the date column as the DataFrame index
df.set_index("date", inplace=True)

In [22]:
# inspect columns and datatypes again
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 19550 entries, 2010-01-01 to 2017-08-23
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   station  19550 non-null  object 
 1   prcp     18103 non-null  float64
 2   tobs     19550 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 610.9+ KB


In [23]:
# display a sample of the current state of the dataframe
df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,USC00519397,0.08,65
2010-01-02,USC00519397,0.0,63
2010-01-03,USC00519397,0.0,74
2010-01-04,USC00519397,0.0,76
2010-01-06,USC00519397,,73


In [24]:
# we want to clean the dataset further by dropping
# rows with any null data
clean_df = df.dropna(how="any")

# display table stats
clean_df.describe()

Unnamed: 0,prcp,tobs
count,18103.0,18103.0
mean,0.160644,72.994863
std,0.468746,4.512107
min,0.0,53.0
25%,0.0,70.0
50%,0.01,73.0
75%,0.11,76.0
max,11.53,87.0


### Compare June and December data across all years 

In [25]:
from scipy import stats

In [26]:
# Filter data for june and display a sample
june_df = clean_df[clean_df.index.month == 6]
june_df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-06-01,USC00519397,0.0,78
2010-06-02,USC00519397,0.01,76
2010-06-03,USC00519397,0.0,78
2010-06-04,USC00519397,0.0,76
2010-06-05,USC00519397,0.0,77


In [27]:
# Filter data for december and display a sample
dec_df = clean_df[clean_df.index.month == 12]
dec_df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-12-01,USC00519397,0.04,76
2010-12-03,USC00519397,0.0,74
2010-12-04,USC00519397,0.0,74
2010-12-06,USC00519397,0.0,64
2010-12-07,USC00519397,0.0,64


In [28]:
# create collections of temperature data
june_temps = june_df["tobs"].to_list()
dec_temps = dec_df["tobs"].to_list()

In [29]:
# Identify the average temperature for June
print(round(mean(june_temps), 2))

74.89


In [30]:
# Identify the average temperature for December
print(round(mean(dec_temps), 2))

70.93


In [31]:
# Run paired t-test
t_val, p_val = stats.ttest_ind(june_temps, dec_temps)
print(f"t-stat: {t_val}, p_value: {p_val}")

t-stat: 30.865349991562194, p_value: 9.8415346259008e-182


### Analysis

    An independent t-test was run to compare the average temperatures in June vs. December.  An independent, or
    unpaired, test was chosen because we are taking two independent sets of temperatures, having different
    constraints (June vs. December).
    
    The results of our t-test tell us that there is a statisical different between these 2 means. Because the
    p-value is very small, it indicates that it is very unlikely that the difference between the 2 means happened
    by chance.  The distance of t-stat value from 0 is an indicator of how far from the null hypothesis our results
    are.