# Final Project

## Part 1: Data Collection 

We decided to analyze weather data in each longitude and lagitude for each date from 2016 to 2021. We are using the data from 2016-2020 to create a model that predicts future weather on particular dates *???for specific areas of location???*. We will be consolidating these predictions for all the dates in 2021 and then comparing them against the actual weather conditions in 2021 to see how accurate our model is. 

The relevance of this prediction model is to create a way for people to easily predict what the weather looks like throughout the year in particular areas and thus figure out if they want to move to that particular region or not. This is particularly effective for people who have seasonal effective disorders and would prefer particular climates over others. 

We got our data from Kaggle after looking for datasets that have types of weather for each date in different locations. 

## Part 2: Data Management/Representation

First we have to import the necessary libraries that we need to load the dataset. We are using pandas, numpy, matplotlib.pyplot, zipfile, and just one method exists from os.path. Pandas is used for the DataFrame object since that is an easy way to store tabular data. Numpy is used for its math functionality and mathplotlib.pyplot is used to plot graphs demonstrating relationships between variables in our data. We use the zipfile import to unzip our file with the data in it and lastly, we use the exists method from os.path to see if a file previously exists in our directory. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile 
from os.path import exists
import folium

Since the data is in a csv file inside the "archive.zip" file, we have to unzip it and load it into a DataFrame using the pandas read_csv method. First we check if the .csv file already exists in this directory so we do not need to unzip and extract it again.

In [2]:
# unzip archive.zip only if csv file is not already in the 
if (not exists('./WeatherEvents_Jan2016-Dec2021.csv')):
    zipfile.ZipFile('./archive.zip', 'r').extractall('.')
    
weather_data = pd.read_csv('WeatherEvents_Jan2016-Dec2021.csv')

# display data
weather_data.head()

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


Now we have to clean the data. *First, we will extract the date of a weather event from the StartTime(UTC) column, which includes the starting date and time of the given event occurence. Time will not be so relevant to us for an event as will date, as we plan on using the date of the events (and the locations) to train our algorithm(s) on.* I am first going to drop columns that we do not need (ex: EventId, StartTime, etc). Then we rearranged and renamed the columns so that it was easier to read the dataset.

In [6]:
#weather_data.columns
weather_data.iloc[3]['StartTime(UTC)']
weather_data = weather_data.rename(columns={"StartTime(UTC)" : "StartTime"}) 

In [7]:
# Creating seperate date column without time aspect to it
weather_data['Date'] = weather_data.apply(lambda row: (row.StartTime.split(" ")[0]), axis=1)

In [8]:
weather_data.head()

Unnamed: 0,EventId,Type,Severity,StartTime,EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode,Date
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0,2016-01-06
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0,2016-01-07
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0,2016-01-07
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0,2016-01-08
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0,2016-01-08


In [9]:
weather_data = weather_data.drop(columns=['EventId', 'StartTime', 'EndTime(UTC)', 'Precipitation(in)'])
weather_data = weather_data[['Date','City', 'County', 'State', 'ZipCode', 'LocationLat', 'LocationLng', 'Type', 'Severity', 'TimeZone', 'AirportCode']]
weather_data.columns = ['Date','City', 'County', 'State', 'Zipcode', 'Latitude', 'Longitude', 'Weather_Type', 'Severity', 'TimeZone', 'AirportCode']

weather_data.head()

Unnamed: 0,Date,City,County,State,Zipcode,Latitude,Longitude,Weather_Type,Severity,TimeZone,AirportCode
0,2016-01-06,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
1,2016-01-07,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
2,2016-01-07,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
3,2016-01-08,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V
4,2016-01-08,Saguache,Saguache,CO,81149.0,38.0972,-106.1689,Snow,Light,US/Mountain,K04V


Now to deal with any missing variables, we are dropping any rows that have NaN or "" values since those rows cannot be used to make our prediction and so it would be easier for us to just not have them. We also have enough data to supplement the values that are going to be lost by dropping rows with missing values.

In [None]:
# I added dropping empty string values for date - AD

In [11]:
print("Previous # Rows: " + str(len(weather_data.index)) + "\n")

weather_data = weather_data.dropna()
weather_data = weather_data[weather_data.City != ""]
weather_data = weather_data[weather_data.Date != ""]
weather_data = weather_data[weather_data.County != ""]
weather_data = weather_data[weather_data.State != ""]
weather_data = weather_data[weather_data.Weather_Type != ""]
weather_data = weather_data[weather_data.Severity != ""]
weather_data = weather_data[weather_data.TimeZone != ""]
weather_data = weather_data[weather_data.AirportCode != ""]

print("Current # Rows: " + str(len(weather_data.index)))

#### RUN ALL CELLS AGAIN TO ENSURE THAT THE TWO NUMBERS BELOW ARE DIFFERENT - I ran mine in different orders
### to test things for the date extracting, so the numbers are the same below rn, but they shouldn't be.

Previous # Rows: 7419931

Current # Rows: 7419931


## Exploratory Data Analysis

### 1) Map Visualization of Most-Common Weather Type

As we can see above, we have almost 750000 rows of data that we can analyze and use, so let's explore the data. 
These are the columns of the dataset we will be working with:

In [12]:
weather_data.columns

Index(['Date', 'City', 'County', 'State', 'Zipcode', 'Latitude', 'Longitude',
       'Weather_Type', 'Severity', 'TimeZone', 'AirportCode'],
      dtype='object')

We can notice that there are State, Latitude/Longitude, City, and County columns - these can give us geographic insight into the spread of our data. We can also see that there's a column called Weather_Type - let's explore the values it contains.

In [13]:
weather_data['Weather_Type'].unique()

array(['Snow', 'Fog', 'Cold', 'Storm', 'Rain', 'Precipitation', 'Hail'],
      dtype=object)

Looking at the values of the Weather_Type column, we can see that noticeably there are various types of weather events: Snow, Fog, Cold, Storm, Rain, Precipitation, and Hail.

To get a view for the spread and number of occurences of different types of weather events, we decided to create a map to tag the different weather events and their placements on the map of US. However, since there are almost 750000 sets of latitude and longitude coordinates, our visualization of all those tagged weather events on a map of the USA will get cluttered almost immediately, and have no purpose to us for analysis. Let's first try analyze one of the United States Census Bureau regions: the North East region of USA, division 1 (New England) - which includes the following states: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont. (https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States#Census_Bureau-designated_regions_and_divisions)

However, before we start plotting, lets check how many events of weather occurred for each of these states to ensure that even this one region can be plotted clearly and be able to be interpreted properly. 

In [14]:
neweng_states = ["CO", "ME", "MA", "NH", "RI", "VT"]
for state in neweng_states:
    curr_state_count = weather_data[weather_data["State"] == state]["State"].count()
    print("{}'s number of occurred/tracked weather events is {}".format(state, curr_state_count))

CO's number of occurred/tracked weather events is 161208
ME's number of occurred/tracked weather events is 79042
MA's number of occurred/tracked weather events is 107782
NH's number of occurred/tracked weather events is 59317
RI's number of occurred/tracked weather events is 22399
VT's number of occurred/tracked weather events is 50141


We can see above that there are over 20,000+ recorded weather event occurences for each state in the New England region. Having over thousands of weather events for even just one state will clutter our interactive map for visualization, so let's narrow down to just one random state - say Maine.

First, we created an interactive map using the Folium library. The Folium library uses _PUT HERE_ to do _PUT HERE_. We centered the map around the geographic center of Maine (45.253333, -69.233333). (https://en.wikipedia.org/wiki/List_of_geographic_centers_of_the_United_States)

In [73]:
map_neweng = folium.Map(location = [45.2533, -69.2333], zoom_start = 6.5)
#map_neweng

In [None]:
## come back to https://deparkes.co.uk/2016/06/03/plot-lines-in-folium/ to add state lines thicker??

We decided to visualize the most common types of weather events that occurred in Maine 2018 for each airport in Maine (mid point of 2016 to 2020 for sake of not cluttering our map) with markers on our map for each most-common unique type of weather (Weather_Type). We will designate Snow to be a red marker, Fog to be a blue marker, Cold to be a green marker, Storm to be a beige marker, Rain to be a purple marker, Precipitation to be a orange marker, and Hail to be a pink marker. Let's see if there were any clusters of most common weather types of events that occurred in Maine 2018 throughout its airports.

In [74]:
# Marks the highest occurence type of weather event for each airport for the whole year of 2018 for Maine

airports = weather_data[weather_data["State"] == "ME"]["AirportCode"].unique()
# Only iterating over Maine's airports in all of 2018
for airport in airports:
    
    # values contains the counts of the given airport's types of weather events for 2018 in Maine
    values = weather_data[weather_data["AirportCode"] == airport]['Weather_Type'].value_counts() 
    # max_weathertype is the most common weather type in 2018 for that airport
    max_weathertype = values.idxmax() 
    
    lat = weather_data[weather_data["AirportCode"] == airport].iloc[0]['Latitude']
    long = weather_data[weather_data["AirportCode"] == airport].iloc[0]['Longitude']
    
    if max_weathertype == "Snow":
            folium.Marker(location=[lat, long],
                              icon=folium.Icon(color='red')).add_to(map_neweng)
    elif max_weathertype == "Fog":
            folium.Marker(location=[lat, long],
                              icon=folium.Icon(color='purple')).add_to(map_neweng)
    elif max_weathertype == "Cold":
            folium.Marker(location=[lat, long],
                              icon=folium.Icon(color='green')).add_to(map_neweng)
    elif max_weathertype == "Storm":
            folium.Marker(location=[lat, long],
                              icon=folium.Icon(color='beige')).add_to(map_neweng)
    elif max_weathertype == "Rain":
            folium.Marker(location=[lat, long],
                              icon=folium.Icon(color='blue')).add_to(map_neweng)
    elif max_weathertype == "Precipitation":
            folium.Marker(location=[lat, long],
                              icon=folium.Icon(color='orange')).add_to(map_neweng)
    else: # max_weathertype == "Hail"
            folium.Marker(location=[lat, long],
                              icon=folium.Icon(color='pink')).add_to(map_neweng)


In [75]:
map_neweng

Analysis: Looking at the map, we can see that for 2018, Maine's airports, and hence mainly metropolitan areas (DOUBLE CHECK), have almost all most-common weather types as Rain, with only one airport experiencing general Precipitation as its most common weather type for 2018. COME BACK

**IT IS TOO MUCH TO DO ALL STATES VISUALIZATION - I NEED TO CHOOSE A COUPLE MORE STATES TO INDIVIDUALLY VISUALIZE FOR COMPARISON TO MAINE LIKE CALI AND KANSAS SO YOU CAN IGNORE THE CODE BELOW** <br> <br>
We can attempt to do a similar visualization for all of the 50 states of USA to see if the location (latitude and longitude) affects the most-common weather type. We began by creating an interactive map centered on the center coordinates of the United States.

In [70]:
map_usa = folium.Map(location = [39.8283, -98.5795], zoom_start = 4)
#map_usa

### 2) Another Data Analysis

### 3) And another

## Hypothesis testing

## Communication of Insights Attained