<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Instructions" data-toc-modified-id="Instructions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Instructions</a></span></li><li><span><a href="#Find-a-new-data-set" data-toc-modified-id="Find-a-new-data-set-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Find a new data set</a></span></li><li><span><a href="#Clean-the-data" data-toc-modified-id="Clean-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Clean the data</a></span><ul class="toc-item"><li><span><a href="#Dropping/replacing-non-numerical-values" data-toc-modified-id="Dropping/replacing-non-numerical-values-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Dropping/replacing non-numerical values</a></span></li></ul></li><li><span><a href="#Split-into-Train-and-Test" data-toc-modified-id="Split-into-Train-and-Test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Split into Train and Test</a></span></li><li><span><a href="#RandomForestRegression" data-toc-modified-id="RandomForestRegression-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>RandomForestRegression</a></span></li><li><span><a href="#GradientBoostingRegression" data-toc-modified-id="GradientBoostingRegression-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>GradientBoostingRegression</a></span></li></ul></div>

***

# Week 10 Item 3

## Instructions

At this point in the class, I hope that you are familiar enough with the methods for an sklearn estimator that you can use a new sklearn estimator, even one you have not seen before. 

Create a Jupyter notebook where you do the following steps:

- [x] Find a new data set, appropriate for regression, that you have not seen before.


- [x] Clean the data by either dropping or replacing non-numerical observations.


- [x] Split the data into a training and testing set.


- [x] Fit either a RandomForestRegressor or a GradientBoostingRegressor to your training data. Assess the score of the estimator that you chose in Step #4 on both the training data and the testing data.

> I know the assignment only asked for one model or the other, but I wanted to practice setting up a GradientBoostingRegressor.


- [x] Submit the Jupyter notebook you created as an attachment to your submission below.

***

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

## Find a new data set

Predicting Temps: [Weather Conditions in World War II](https://www.kaggle.com/smid80/weatherww2/data?select=Summary+of+Weather.csv)

In [2]:
weather_data = pd.read_csv("ww2weather.csv", low_memory=False)
weather_data.head()

Unnamed: 0,STA,Date,Precip,WindGustSpd,MaxTemp,MinTemp,MeanTemp,Snowfall,PoorWeather,YR,...,FB,FTI,ITH,PGT,TSHDSBRSGF,SD3,RHX,RHN,RVG,WTE
0,10001,1942-7-1,1.016,,25.555556,22.222222,23.888889,0,,42,...,,,,,,,,,,
1,10001,1942-7-2,0.0,,28.888889,21.666667,25.555556,0,,42,...,,,,,,,,,,
2,10001,1942-7-3,2.54,,26.111111,22.222222,24.444444,0,,42,...,,,,,,,,,,
3,10001,1942-7-4,2.54,,26.666667,22.222222,24.444444,0,,42,...,,,,,,,,,,
4,10001,1942-7-5,0.0,,26.666667,21.666667,24.444444,0,,42,...,,,,,,,,,,


***

## Clean the data

### Dropping/replacing non-numerical values

The source shows that all there are columns that are 100% null.  We will drop those too.

In [3]:
# View data types
weather_data.dtypes

STA              int64
Date            object
Precip          object
WindGustSpd    float64
MaxTemp        float64
MinTemp        float64
MeanTemp       float64
Snowfall        object
PoorWeather     object
YR               int64
MO               int64
DA               int64
PRCP            object
DR             float64
SPD            float64
MAX            float64
MIN            float64
MEA            float64
SNF             object
SND            float64
FT             float64
FB             float64
FTI            float64
ITH            float64
PGT            float64
TSHDSBRSGF      object
SD3            float64
RHX            float64
RHN            float64
RVG            float64
WTE            float64
dtype: object

In [4]:
# drop columns with object dtypes
weather_data = weather_data.select_dtypes(exclude=[object])
weather_data.dtypes

STA              int64
WindGustSpd    float64
MaxTemp        float64
MinTemp        float64
MeanTemp       float64
YR               int64
MO               int64
DA               int64
DR             float64
SPD            float64
MAX            float64
MIN            float64
MEA            float64
SND            float64
FT             float64
FB             float64
FTI            float64
ITH            float64
PGT            float64
SD3            float64
RHX            float64
RHN            float64
RVG            float64
WTE            float64
dtype: object

In [5]:
# drop columns will all nulls by keeping columns with data
weather_data = weather_data.drop(weather_data.columns[weather_data.isna().any()], axis=1)

In [6]:
weather_data.head()

Unnamed: 0,STA,MaxTemp,MinTemp,MeanTemp,YR,MO,DA
0,10001,25.555556,22.222222,23.888889,42,7,1
1,10001,28.888889,21.666667,25.555556,42,7,2
2,10001,26.111111,22.222222,24.444444,42,7,3
3,10001,26.666667,22.222222,24.444444,42,7,4
4,10001,26.666667,21.666667,24.444444,42,7,5


***

## Split into Train and Test

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X = weather_data.MaxTemp.values.reshape(-1,1)
y = weather_data.MinTemp

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, test_size=.3)

***

## RandomForestRegression

In [10]:
from sklearn.ensemble import RandomForestRegressor

In [11]:
forest = RandomForestRegressor(n_estimators=100, random_state=1)
forest.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=1, verbose=0, warm_start=False)

In [12]:
# train score
forest.score(X_train, y_train)

0.8166129716594113

In [13]:
# test score
forest.score(X_test, y_test)

0.8138039978851664

***

## GradientBoostingRegression

In [14]:
from sklearn.ensemble import GradientBoostingRegressor

In [15]:
gb = GradientBoostingRegressor(n_estimators=100, random_state=1)
#gb.fit(X_train, y_train)
gb.fit(X, y)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=1, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [16]:
gb.score(X_train, y_train)

0.8165035385234691

In [17]:
gb.score(X_test, y_test)

0.8143305941708944