# Final Project

## Part 1: Data Collection 

We decided to analyze weather data in each longitude and lagitude for each date from 2016 to 2021. We are using the data from 2016-2020 to create a model that predicts future weather on particular dates. We will be consolidating these predictions for all the dates in 2021 and then comparing them against the actual weather conditions in 2021 to see how accurate our model is. 

The relevance of this prediction model is to create a way for people to easily predict what the weather looks like throughout the year in particular areas and thus figure out if they want to move to that particular region or not. This is particularly effective for people who have seasonal effective disorders and would prefer particular climates over others. 

We got our data from Kaggle after looking for datasets that have types of weather for each date in different locations. 

## Part 2: Data Management/Representation

First we have to import the necessary libraries that we need to load the dataset. We are using pandas, numpy, matplotlib.pyplot, zipfile, and just one method exists from os.path. Pandas is used for the DataFrame object since that is an easy way to store tabular data. Numpy is used for its math functionality and mathplotlib.pyplot is used to plot graphs demonstrating relationships between variables in our data. We use the zipfile import to unzip our file with the data in it and lastly, we use the exists method from os.path to see if a file previously exists in our directory. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile 
from os.path import exists

Since the data is in a csv file inside the "archive.zip" file, we have to unzip it and load it into a DataFrame using the pandas read_csv method. First we check if the .csv file already exists in this directory so we do not need to unzip and extract it again.

In [2]:
# unzip archive.zip only if csv file is not already in the 
if (not exists('./WeatherEvents_Jan2016-Dec2021.csv')):
    zipfile.ZipFile('./archive.zip', 'r').extractall('.')
    
weather_data = pd.read_csv('WeatherEvents_Jan2016-Dec2021.csv')

# display data
weather_data.head()

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


Now we have to clean the data. I am first going to drop columns that we do not need (ex: EventId, StartTime, etc). Then we rearranged and renamed the columns so that it was easier to read the dataset.

In [3]:
weather_data = weather_data.drop(columns=['EventId', 'StartTime(UTC)', 'EndTime(UTC)', 'Precipitation(in)'])
weather_data = weather_data[['City', 'County', 'State', 'ZipCode', 'LocationLat', 'LocationLng', 'Type', 'Severity', 'TimeZone', 'AirportCode']]
weather_data.columns = ['City', 'County', 'State', 'Zipcode', 'Latitude', 'Longitude', 'Weather_Type', 'Severity', 'TimeZone', 'AirportCode']

weather_data.head()

Unnamed: 0,City,County,State,Zipcode,Latitude,Longitude,Weather_Type,Severity,TimeZone,AirportCode
0,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
1,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
2,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
3,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
4,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V


Now to deal with any missing variables, we are dropping any rows that have NaN or "" values since those rows cannot be used to make our prediction and so it would be easier for us to just not have them. We also have enough data to supplement the values that are going to be lost by dropping rows with missing values.

In [4]:
print("Previous # Rows: " + str(len(weather_data.index)) + "\n")

weather_data = weather_data.dropna()
weather_data = weather_data[weather_data.City != ""]
weather_data = weather_data[weather_data.County != ""]
weather_data = weather_data[weather_data.State != ""]
weather_data = weather_data[weather_data.Weather_Type != ""]
weather_data = weather_data[weather_data.Severity != ""]
weather_data = weather_data[weather_data.TimeZone != ""]
weather_data = weather_data[weather_data.AirportCode != ""]

print("Current # Rows: " + str(len(weather_data.index)))

Previous # Rows: 7479165

Current # Rows: 7419931


## Exploratory Data Analysis

## Hypothesis testing

## Communication of Insights Attained