# Will it be an early Spring?

On February 2<sup>nd</sup> every year Punxsutawney Phil makes a prediction about if there will be an early Spring or if Winter will continue for 6 more weeks (till about mid-March). He is however not very accurate (well, according to [The Inner Circle](https://www.groundhog.org/inner-circle) he is 100% correct but the human handler may not interpret his response correctly). The overall goal is to be able to predict if it will be an early Spring.

For this project you must go through most steps in the checklist. You must write responses for all items however sometimes the item will simply be "does not apply". Some of the parts are a bit more nebulous and you simply show that you have done things in general (and the order doesn't really matter). Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Do not do the final part (launching the product) and your presentation will be done as information written in this document in a dedicated section, no slides or anything like that. It should however include the best summary plots/graphics/data points.

You are intentionally given very little information thus far. You must communicate with your client (me) for additional information as necessary. But also make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself).

Each group from 200-level and 300-level sections with the best results on the 10% of the data that I kept for myself will earn +5 extra credit (if multiple groups are close points may be given to multiple groups).

Frame The Problem
----

**1. Define the objective in business terms.**  
    ACME Seed company is trying to understand weather patterns for their new corn seed product. The company needs to guarantee if there will be an early spring based upon the farmers product yield. If weather permitting, the farmers can get 2 full harvests with the ACME corn seeds. Our objective is to understand when there will be an early spring and when there won't be.  
    
**2. How will your solution be used?**  
    If our model can successfully predict if there will be an early spring (before March 15th) the company will be able to send out a guarantee for the seeds making sales flourish.

**3. What are the current solutions/workarounds (if any)?**   
    Current solutions to knowing about early springs are very iffy. Weather is an always changing and hard to calculate thing. We are currently using Farmer's Almanacs, Meterologist Models, and Groundhogs to predict weather trends.  

**4. How should you frame this problem (supervised/unsupervised, online/offline, ...)?**  
    This is going to be a Supervised (Regression Based) problem with a most likely offline system. Supervised because of the historical data being put into our model and Regression based because of the dynamic changing rates of weather patterns. On top of this for the moment we are keeping it to an offline system because of the not needed constant input of new data to update.  

**5. How should performance be measured? Is the performance measure aligned with the business objective?**   
    Based on the ideology that we will be trying to guarantee ACME seeds that there will be an early spring. Performance will be measured by the Recall score of our model because of the True Positive nature that guarantees a result. We will disregard false negatives and say with 100% certainty that True Positives will be the best results for both the Company and the Farmers.  

**6. What would be the minimum performance needed to reach the business objective?**  
    Guaranteed Early Springs with very high certainty. There is a slight tolerance for error in missing a few early springs, but in contrast to that we do not want to inform the company that there will be an early spring if it actually in reality is still winter. No direct miminum performance has been classified but we hold high standards.  

**7. What are comparable problems? Can you reuse experience or tools?**  
    There are instances in other weather machine learning problems that could be useful such as prediciting percipitation patterns, but other than that not too much direct comparisons.  

**8. Is human expertise available?**     
    Humans on their own have almost no ability to be able to predict the weather. there may be Meterologist who know a bit more but prediciting a whole season is not an easy feat.  

**9. How would you solve the problem manually?**  
    This is definetly not a problem that is a good one to try and solve manually. You can really dedicate your life to understanding weather patterns and logging them, but machine learning is the best way to go about this problem.  
    
**10. List the assumptions you (or others) have made so far. Verify assumptions if possible.**  
    One assumption could be that we need to wait at least until the beginning of february every year to make the prediction. We must also assume that march 15th will be the date of guarantee every year.  
    

In [14]:
import numpy as np
import matplotlib.pylab as plt
import pandas as pd
import scipy as sp
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import VotingClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingRegressor, StackingClassifier

from sklearn.metrics import accuracy_score, mean_squared_error

Get the Data
--

**1. List the data you need and how much you need**  
We need data from January 1st to February 2nd. The data needs to be for each day. The data must contain as many features relevant to the weather as possible. We also need to know which years in the past were early springs or not. Our data should go back as far as possible.  

**2. Find and document where you can get that data**  
Done. Provided by an intern.  

**3. Get access authorizations**  
Done.  

**4. Create a workspace (with enough storage space)**  
Done. Visual Studio Code Jupyter Notebooks

**5. Get the data**  
Done.  

**6. Convert the data to a format you can easily manipulate (without changing the data itself)**

In [78]:
def load_weather_data():
    """
    Loads the CSV file which contains our data for weather.
    """
    return pd.read_csv('weather.csv')

In [79]:
def load_phil_data():
    """
    Loads the CSV file which contains our data for phil's predictions.
    """
    return pd.read_csv('phil_pred.csv')

In [80]:
def load_spring_data():
    """
    Loads the CSV file which contains our data for actuality of season.
    """
    return pd.read_csv('early_spring.csv')

In [81]:
weather_data = load_weather_data()
phil_data = load_phil_data()
spring_data = load_spring_data()

**7. Ensure sensitive information is deleted or protected (e.g. anonymized)**   
Not needed.

**8. Check the size and type of data (time series, geographical, ...)**  
weather_data:
We have 7 features, 6 of which are floats. The date feature is a string. There are 2211 entries in total.

phil_data (groundhog's predictions):
There are 2 features. One is an int and the other is a bool. There are 60 entries in total.

spring_data (which years were early spring):
There are 2 features. One is an int and the other is a bool. There are 67 entries in total.


In [19]:
weather_data.info()
weather_data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2211 entries, 0 to 2210
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           2211 non-null   object 
 1   max_temp       2167 non-null   float64
 2   min_temp       2170 non-null   float64
 3   avg_temp       2160 non-null   float64
 4   precipitation  2208 non-null   float64
 5   snowfall       2198 non-null   float64
 6   snowdepth      2174 non-null   float64
dtypes: float64(6), object(1)
memory usage: 121.0+ KB


Unnamed: 0,max_temp,min_temp,avg_temp,precipitation,snowfall,snowdepth
count,2167.0,2170.0,2160.0,2208.0,2198.0,2174.0
mean,36.19197,18.410138,27.284028,0.111475,0.463889,2.503059
std,12.252389,12.755078,11.704472,0.228599,1.196003,4.282987
min,-6.0,-26.0,-15.0,0.0,0.0,0.0
25%,28.0,10.0,19.375,0.0,0.0,0.0
50%,36.0,20.0,28.0,0.01,0.0,1.0
75%,44.0,28.0,35.125,0.12,0.5,3.0
max,75.0,51.0,60.0,2.56,21.0,30.0


In [20]:
weather_data['date'].apply(lambda x: type(x) == str).all()

True

In [21]:
phil_data.info()
phil_data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   year        60 non-null     int64
 1   prediction  60 non-null     bool 
dtypes: bool(1), int64(1)
memory usage: 668.0 bytes


Unnamed: 0,year
count,60.0
mean,1984.266667
std,21.866984
min,1947.0
25%,1966.75
50%,1983.5
75%,2003.25
max,2021.0


In [53]:
spring_data.info()
spring_data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   year          67 non-null     int64
 1   early_spring  67 non-null     bool 
dtypes: bool(1), int64(1)
memory usage: 731.0 bytes


Unnamed: 0,year
count,67.0
mean,1983.059701
std,22.206934
min,1947.0
25%,1964.5
50%,1982.0
75%,2002.5
max,2021.0


**9. Sample a test set, put it aside, and never look at it (no data snooping!)**  

In [82]:
#this line is used for converting strings to datetimes
weather_data['date'] = weather_data['date'].astype('datetime64[ns]')

In [83]:
#function that groups date sets of 33 into individual years
def convert_dates_to_year():
    weather_data['year'] = weather_data['date'].dt.year

    days = []
    for year in range(0, 67):
        for day in range (0, 33):
            days.append(day)

    weather_data['day_of_year'] = days

In [84]:
convert_dates_to_year()

In [90]:
pivot_weather = weather_data.pivot(index='year', columns='day_of_year')
pivot_weather

Unnamed: 0_level_0,date,date,date,date,date,date,date,date,date,date,...,snowdepth,snowdepth,snowdepth,snowdepth,snowdepth,snowdepth,snowdepth,snowdepth,snowdepth,snowdepth
day_of_year,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1947,1947-01-01,1947-01-02,1947-01-03,1947-01-04,1947-01-05,1947-01-06,1947-01-07,1947-01-08,1947-01-09,1947-01-10,...,,,,,,,,,,1.000
1948,1948-01-01,1948-01-02,1948-01-03,1948-01-04,1948-01-05,1948-01-06,1948-01-07,1948-01-08,1948-01-09,1948-01-10,...,8.000,10.000,10.000,9.000,9.000,8.000,5.000,5.000,5.000,4.000
1949,1949-01-01,1949-01-02,1949-01-03,1949-01-04,1949-01-05,1949-01-06,1949-01-07,1949-01-08,1949-01-09,1949-01-10,...,0.000,0.000,0.000,0.000,0.000,0.025,0.025,2.000,2.000,2.000
1950,1950-01-01,1950-01-02,1950-01-03,1950-01-04,1950-01-05,1950-01-06,1950-01-07,1950-01-08,1950-01-09,1950-01-10,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,,0.000
1951,1951-01-01,1951-01-02,1951-01-03,1951-01-04,1951-01-05,1951-01-06,1951-01-07,1951-01-08,1951-01-09,1951-01-10,...,0.000,0.025,1.000,1.000,0.025,0.000,0.000,0.025,0.025,0.025
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017,2017-01-01,2017-01-02,2017-01-03,2017-01-04,2017-01-05,2017-01-06,2017-01-07,2017-01-08,2017-01-09,2017-01-10,...,0.025,0.000,0.000,0.025,2.000,2.000,2.000,4.000,1.000,0.000
2018,2018-01-01,2018-01-02,2018-01-03,2018-01-04,2018-01-05,2018-01-06,2018-01-07,2018-01-08,2018-01-09,2018-01-10,...,0.000,0.000,0.000,0.000,0.000,0.000,1.000,1.000,0.025,1.000
2019,2019-01-01,2019-01-02,2019-01-03,2019-01-04,2019-01-05,2019-01-06,2019-01-07,2019-01-08,2019-01-09,2019-01-10,...,0.000,1.000,0.025,1.000,0.025,0.025,0.025,0.000,1.000,2.000
2020,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06,2020-01-07,2020-01-08,2020-01-09,2020-01-10,...,0.000,0.000,0.000,0.025,0.000,0.025,0.025,0.000,0.000,0.025


In [91]:
pivot_weather.columns = ["_".join(str(x) for x in a) for a in pivot_weather.columns.to_flat_index()]

In [92]:
def merge_spring_and_weather_data():
    return pd.merge(pivot_weather, spring_data, on='year', how='inner')

In [98]:
data = merge_spring_and_weather_data()
data

Unnamed: 0,year,date_0,date_1,date_2,date_3,date_4,date_5,date_6,date_7,date_8,...,snowdepth_24,snowdepth_25,snowdepth_26,snowdepth_27,snowdepth_28,snowdepth_29,snowdepth_30,snowdepth_31,snowdepth_32,early_spring
0,1947,1947-01-01,1947-01-02,1947-01-03,1947-01-04,1947-01-05,1947-01-06,1947-01-07,1947-01-08,1947-01-09,...,,,,,,,,,1.000,True
1,1948,1948-01-01,1948-01-02,1948-01-03,1948-01-04,1948-01-05,1948-01-06,1948-01-07,1948-01-08,1948-01-09,...,10.000,10.000,9.000,9.000,8.000,5.000,5.000,5.000,4.000,False
2,1949,1949-01-01,1949-01-02,1949-01-03,1949-01-04,1949-01-05,1949-01-06,1949-01-07,1949-01-08,1949-01-09,...,0.000,0.000,0.000,0.000,0.025,0.025,2.000,2.000,2.000,True
3,1950,1950-01-01,1950-01-02,1950-01-03,1950-01-04,1950-01-05,1950-01-06,1950-01-07,1950-01-08,1950-01-09,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,,0.000,False
4,1951,1951-01-01,1951-01-02,1951-01-03,1951-01-04,1951-01-05,1951-01-06,1951-01-07,1951-01-08,1951-01-09,...,0.025,1.000,1.000,0.025,0.000,0.000,0.025,0.025,0.025,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,2017,2017-01-01,2017-01-02,2017-01-03,2017-01-04,2017-01-05,2017-01-06,2017-01-07,2017-01-08,2017-01-09,...,0.000,0.000,0.025,2.000,2.000,2.000,4.000,1.000,0.000,False
63,2018,2018-01-01,2018-01-02,2018-01-03,2018-01-04,2018-01-05,2018-01-06,2018-01-07,2018-01-08,2018-01-09,...,0.000,0.000,0.000,0.000,0.000,1.000,1.000,0.025,1.000,False
64,2019,2019-01-01,2019-01-02,2019-01-03,2019-01-04,2019-01-05,2019-01-06,2019-01-07,2019-01-08,2019-01-09,...,1.000,0.025,1.000,0.025,0.025,0.025,0.000,1.000,2.000,True
65,2020,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06,2020-01-07,2020-01-08,2020-01-09,...,0.000,0.000,0.025,0.000,0.025,0.025,0.000,0.000,0.025,True


In [96]:
copy_data = data.copy()
train_set, test_set = train_test_split(copy_data, test_size=0.2, random_state=250)

Explore the Data
--

**1. Copy the data for exploration, downsampling to a manageable size if necessary.**  
Downsizing not necessary

In [97]:
early_spring_data = data.copy()

**2. Study each attribute and its characteristics: Name; Type (categorical, numerical, 
bounded, text, structured, ...); % of missing values; Noisiness and type of noise (stochastic, outliers, rounding errors, ...); 
Usefulness for the task; Type of distribution (Gaussian, uniform, logarithmic, ...)**  

In [99]:
early_spring_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67 entries, 0 to 66
Columns: 233 entries, year to early_spring
dtypes: bool(1), datetime64[ns](33), float64(198), int64(1)
memory usage: 122.0 KB


**3. For supervised learning tasks, identify the target attribute(s)**  

**4. Visualize the data**  

**5. Study the correlations between attributes**  

**6. Study how you would solve the problem manually**  

**7. Identify the promising transformations you may want to apply**  

**8. Identify extra data that would be useful (go back to “Get the Data”)**  

**9. Document what you have learned**  
We have 263 rows of data missing.

In [None]:
def read_temperature_data(filename):
    """
    Reads temperature data from the given file. M values are assumed to be
    missing values (returned as nan). T values are trace values and returned as
    0.0025 inches for precipitation and snowfall and 0.025 inches for snowdepth
    (see https://www.chicagotribune.com/news/weather/ct-wea-asktom-0415-20180413-column.html).
    """
    def convert_precipitation(raw):
        return 0.0025 if raw == 'T' else np.nan if raw == 'M' else pd.to_numeric(raw)
    def convert_depth(raw):
        return 0.025 if raw == 'T' else np.nan if raw == 'M' else pd.to_numeric(raw)
    return pd.read_csv(filename, na_values=['M'], parse_dates=[0],
        converters={
            "precipitation":convert_precipitation,
            "snowfall":convert_precipitation,
            "snowdepth":convert_depth,
        })