In [157]:
#import packages
import numpy as np
import pandas as pd
from datetime import datetime
from datetime import date 
import seaborn as sns
from matplotlib import pyplot as plt

import duckdb, sqlalchemy

%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


## Research Questions

1. Is there a significant difference between historical weather trends when moving about 100 miles away from a central location?
2. Is it possible to accurately model and predict the weather based on another location’s weather?
3. Does latitude, longitude, or elevation have the most significant impact on temperature? On precipitation?
4. How does proximity to a lake impact historical trends? Is there a significant difference between the variation in temperature or precipitation for Erie and Ithaca compared to the other locations?



## Data Cleaning



We obtained our data through 3 data requests to weather.gov. First, we requested Ithaca's data, and then we decided to expand out analysis to include locations north, east, south, and west of Ithaca. Our second data pull included the following cities: Watertown (North), Bloomsburg (South), Cobleskill (East), and Avoca (West). Avoca did not contain any temperature data, so we decided to request data for Erie as a replacement. The cell below, reads in the csv files obtained from weather.gov.

In [158]:
#read in csvs
ithaca = pd.read_csv("Ithaca.csv")
others = pd.read_csv("AdjacentCities.csv")
west = pd.read_csv("Erie.csv")
#print(ithaca.head())
#print(west.head())
others = others.dropna(axis=0,subset=['STATION'])
#print(others.head())

  ithaca = pd.read_csv("Ithaca.csv")
  others = pd.read_csv("AdjacentCities.csv")
  west = pd.read_csv("Erie.csv")


Since certain locations contained multiple weather stations, we created the column Location to group the entries as shown in the cell below.

In [159]:
#Add column to indicate location (relative to ithaca)
ithaca["Location"] = "Central"
west['Location'] = "West"
#print(others['NAME'].unique())
locations = []
for i in others['NAME']:
    if "WATERTOWN" in i:
        locations.append('North')
    elif "BLACK RIVER" in i:
        locations.append("North")
    elif "ESPY" in i:
        locations.append("South")
    elif "BLOOMSBURG" in i:
        locations.append("South")
    elif "COBLESKILL" in i:
        locations.append("East")
    else:
        locations.append('N/A')
others['Location'] = locations
print(others['Location'].unique())


['North' 'South' 'East']


Certain cities had more data attributes available than others, so to maintain cohesion, the cell below identifies common columns between the three data sets. We then dropped any uncommon columns and concatenated the three datasets to form one dataframe containing all relevant data.

In [160]:
#Check all columns match before concatenation
columns_to_keep = []

for i in ithaca.columns.tolist():
    if (i in others.columns.tolist()) & (i in west.columns.tolist()) :
        columns_to_keep.append(i)
ithaca_good = ithaca[columns_to_keep]
others_good = others[columns_to_keep]
west_good = west[columns_to_keep]
        
if (ithaca_good.columns.tolist() == others_good.columns.tolist()) & (ithaca_good.columns.tolist() == west_good.columns.tolist()):
    print("Columns match: proceed")
    final_df = pd.concat([ithaca_good,others_good,west_good])
else:
    print(ithaca.columns)
    print(others.columns)
    
#print(final_df)

Columns match: proceed


In the cell below, we checked to make sure that none of the columns contained only null values. All columns contained at least one relevant datapoint, but if this was not the case, we also included code that would drop an entirely null column.

In [161]:
#Check for null values in columns
print(final_df.columns)
#If a column has only null values, drop the column
null_cols = []
for c in final_df.columns:
    if final_df[c].isnull().all():
        null_cols.append(c)
if len(null_cols) == 0:
    print("No null columns")
else:
    print('Null columns:', null_cols)
    final_df = final_df.drop(null_cols)

Index(['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE', 'DAPR',
       'DAPR_ATTRIBUTES', 'MDPR', 'MDPR_ATTRIBUTES', 'PRCP', 'PRCP_ATTRIBUTES',
       'SNOW', 'SNOW_ATTRIBUTES', 'SNWD', 'SNWD_ATTRIBUTES', 'TMAX',
       'TMAX_ATTRIBUTES', 'TMIN', 'TMIN_ATTRIBUTES', 'WESD', 'WESD_ATTRIBUTES',
       'WESF', 'WESF_ATTRIBUTES', 'WT05', 'WT05_ATTRIBUTES', 'Location'],
      dtype='object')
No null columns


The attributes columns in the weather.gov data contains a string that contains multiple attribute indicators concatenated together. "Trace" is an attribute that we believe will be relevant to our analysis and is represented by a "T" in the attribute columns. We used this to create new binary columns that indicate whether or not there was a trace of precipitation or snow on a given day.

In [162]:
#Create binary columns to indicate if there was trace of precipitation for snow and rain
precip_binary = []
snow_binary = []
for p in final_df['PRCP_ATTRIBUTES']:
    #print(type(p))
    if (type(p) == str):
        if 'T' in p:
            precip_binary.append(1)
        else:
            precip_binary.append(0)
    else:
        precip_binary.append(0)
        
for s in final_df['SNOW_ATTRIBUTES']:
    if (type(s) == str):
        if 'T' in s:
            snow_binary.append(1)
        else:
            snow_binary.append(0)
    else:
        snow_binary.append(0)
    
final_df['PrecipTrace'] = precip_binary
final_df['SnowTrace'] = snow_binary

With the addition of the binary columns, we no longer need the initial attributes columns and drop them from the dataframe below.

In [163]:
#Drop attributes columns now that we have created the binary columns
final_df = final_df.drop(columns = ['PRCP_ATTRIBUTES', 'SNOW_ATTRIBUTES'])

In addition, we converted the data column to datetime.

In [164]:
#Convert date column to datetime
final_df['DATE'] = pd.to_datetime(final_df['DATE'], format = '%m/%d/%Y')
#print(final_df['DATE'])

In some cases, there were multiple stations within the same location category that recorded data. To account for this, we created an aggregated dataframe that groups by location and date and takes the average of the temperature, location, elevation, and precipitation data. This aggregated dataframe takes the maximum of the binary columns to indicate if there was a trace anywhere within the location. We wanted to maintain the binary property of the columns in order to perform logistic regression in the final phase of the project.

In [167]:
#Aggregate dataframe to account for locations with multiple stations taking recordings on the same day
%sql agg_df << select DATE, Location, avg(LATITUDE) as Latitude, avg(LONGITUDE) as Longitude, avg(ELEVATION) as Elevation, AVG(TMAX) as MaxTemp, AVG(TMIN) as MinTemp,AVG(PRCP) as Precipitation, AVG(SNOW) as Snowfall, MAX(PrecipTrace) as PrecipTrace, MAX(SnowTrace) as SnowTrace from final_df group by Location, Date
print(agg_df)

Returning data to local variable agg_df
            DATE Location   Latitude  Longitude  Elevation  MaxTemp  MinTemp  \
0     2010-02-11     West  42.104302 -80.054233    271.325     25.0     19.0   
1     2010-02-12     West  42.104302 -80.054233    271.325     25.0     17.0   
2     2010-03-01     West  42.104302 -80.054233    271.325     34.0     29.0   
3     2010-03-02     West  42.104302 -80.054233    271.325     31.0     27.0   
4     2010-03-18     West  42.104302 -80.054233    271.325     59.0     34.0   
...          ...      ...        ...        ...        ...      ...      ...   
34889 2008-05-13    North  43.976100 -75.875300    151.500     68.0     41.0   
34890 2007-09-25    North  43.976100 -75.875300    151.500      NaN      NaN   
34891 2007-09-07    North  43.976100 -75.875300    151.500      NaN      NaN   
34892 2010-04-05    North  43.976100 -75.875300    151.500     63.0     45.0   
34893 2010-12-28    North  43.976100 -75.875300    151.500     19.0      0.0   


We also split the aggregated dataframe into smaller dataframes based on location.

In [169]:
#Create 5 separate data frames based on location for individual analyses when needed
central = agg_df[agg_df['Location']=="Central"]
north = agg_df[agg_df['Location']=="North"]
south = agg_df[agg_df['Location']=="South"]
east = agg_df[agg_df['Location']=="East"]
west = agg_df[agg_df['Location']=="West"]

## Data Description



## Data Limitations

Missing values: The most consistent quantitative observations available for each location and date are temperature, elevation, precipitation, and snowfall. However, there are many other columns containing other types of observations, which have a significant amount of null values. This may make it hard to incorporate these other observations if there is not enough data, and could lead to a less accurate model.

Different length of time frames: Five cities are included in this analysis and new dataset, however different amounts of data exist for each city. Ideally 20 years of history would be included for each city, however this was not available, so there is less data to use to train a model for certain cities, which could impact accuracy of results.

Limited scope of cities: Only a total of 5 cities are included in this dataset, so conclusions from this analysis are not necessarily applicable to other areas. This is especially true because this dataset is centered around Ithaca, which is a Northeastern climate, however in a desert setting the results to these research questions likely will not be applicable based on this analysis.

Attributes are not very descriptive: The attributes for different weather events are not very descriptive. This could make it difficult to incorporate the attribute columns into an analysis as it is unclear what some of the different attributes mean, especially in the context of the research questions.


## Exploratory Data Analysis

In [None]:
#side by side linear regression of consumption vs grades for math and portuguese sets
#histogram for grades/consumption/absences