Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)


_______________________________________

Unit 2, Sprint 3, Module 2
Wrangle ML datasets

    Continue to clean and explore your data.
    For the evaluation metric you chose, what score would you get just by guessing?
    Can you make a fast, first model that beats guessing?
    
    
_______________________________________

Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.
Assignment

Complete these tasks for your project, and document your work.

    If you haven't completed assignment #1, please do so first.
    Continue to clean and explore your data. Make exploratory visualizations.
    Fit a model. Does it beat your baseline?
    Try xgboost.
    Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


In [97]:
# Working on a localhost with Jupyter, 
# so my dataset location is localized as well.

source_url = 'https://www.kaggle.com/jsphyg/weather-dataset-rattle-package'

import pandas as pd
import numpy as np
import pandas_profiling

weather = pd.read_csv('../../datasets/weatherAUS.csv') # Local
# weather = pd.read_csv('weatherAUS.csv') # Colab

def cel_to_far(x):
    '''Small function to convert Celsius to Farenheit'''
    x = x * 1.8 + 32
    return float(x)

In [98]:
original = weather.copy()

In [99]:
# pandas_profiling.ProfileReport(weather)

In [100]:
weather['MinTemp'] = weather['MinTemp'].apply(cel_to_far)
weather['MaxTemp'] = weather['MaxTemp'].apply(cel_to_far)

In [101]:
weather.describe()
# It appears Certain features with high percentage 
# of NaNs will need imputing, most likely on the
# means method 
# Columns to be imputed: Evaporation, Sunshine, Cloud9am, Cloud3pm

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RISK_MM
count,141556.0,141871.0,140787.0,81350.0,74377.0,132923.0,140845.0,139563.0,140419.0,138583.0,128179.0,128212.0,88536.0,85099.0,141289.0,139467.0,142193.0
mean,53.93552,73.808212,2.349974,5.469824,7.624853,39.984292,14.001988,18.637576,68.84381,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.987509,21.687235,2.360682
std,11.525909,12.811713,8.465173,4.188537,3.781525,13.588801,8.893337,8.803345,19.051293,20.797772,7.105476,7.036677,2.887016,2.720633,6.492838,6.937594,8.477969
min,16.7,23.36,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4,0.0
25%,45.68,64.22,0.0,2.6,4.9,31.0,7.0,13.0,57.0,37.0,1012.9,1010.4,1.0,2.0,12.3,16.6,0.0
50%,53.6,72.68,0.0,4.8,8.5,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1,0.0
75%,62.24,82.76,0.8,7.4,10.6,48.0,19.0,24.0,83.0,66.0,1022.4,1020.0,7.0,7.0,21.6,26.4,0.8
max,93.02,118.58,371.0,145.0,14.5,135.0,130.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7,371.0


In [102]:
yes_rain = weather[weather['RainTomorrow'] == 'Yes']

In [103]:
yes_rain

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
8,2008-12-09,Albury,49.46,89.42,0.0,,,NNW,80.0,SE,...,9.0,1008.9,1003.6,,,18.3,30.2,No,1.4,Yes
10,2008-12-11,Albury,56.12,86.72,0.0,,,N,30.0,SSE,...,22.0,1011.8,1008.7,,,20.4,28.8,No,2.2,Yes
11,2008-12-12,Albury,60.62,71.06,2.2,,,NNE,31.0,NE,...,91.0,1010.5,1004.2,8.0,8.0,15.9,17.0,Yes,15.6,Yes
12,2008-12-13,Albury,60.62,65.48,15.6,,,W,61.0,NNW,...,93.0,994.3,993.0,8.0,8.0,17.4,15.8,Yes,3.6,Yes
15,2008-12-17,Albury,57.38,69.62,0.0,,,ENE,22.0,SSW,...,82.0,1012.2,1010.4,8.0,1.0,17.2,18.1,No,16.8,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142059,2017-02-10,Uluru,76.64,103.64,0.4,,,WNW,65.0,E,...,24.0,1007.0,1003.6,,,31.5,36.6,No,6.2,Yes
142124,2017-04-17,Uluru,66.74,75.92,0.0,,,W,35.0,ESE,...,91.0,1015.9,1013.9,8.0,8.0,21.3,18.5,No,6.8,Yes
142125,2017-04-18,Uluru,59.36,70.70,6.8,,,ENE,30.0,NE,...,65.0,1016.9,1015.3,3.0,8.0,19.0,21.2,Yes,12.6,Yes
142126,2017-04-19,Uluru,63.86,80.42,12.6,,,S,35.0,E,...,59.0,1018.1,1014.7,7.0,8.0,19.0,26.0,Yes,34.6,Yes


In [104]:
# Impute columns
from sklearn.impute import SimpleImputer
col_to_impute = ['Evaporation', 'Sunshine', 'Cloud9am', 'Cloud9pm']

In [105]:
# Basic exploration of dataset:
weather.describe(exclude='number')

Unnamed: 0,Date,Location,WindGustDir,WindDir9am,WindDir3pm,RainToday,RainTomorrow
count,142193,142193,132863,132180,138415,140787,142193
unique,3436,49,16,16,16,2,2
top,2013-04-21,Canberra,W,N,SE,No,No
freq,49,3418,9780,11393,10663,109332,110316


In [106]:
features = weather.columns

In [107]:
# Target variable, checking baseline for classification:
# 77% days are predicted as "no" for the question "Rain tomorrow?"
weather['RainTomorrow'].value_counts(normalize=True)

No     0.775819
Yes    0.224181
Name: RainTomorrow, dtype: float64

In [108]:
weather['RISK_MM'].value_counts(normalize=True).sort_values()

74.4     0.000007
144.2    0.000007
11.1     0.000007
99.2     0.000007
148.6    0.000007
           ...   
0.8      0.014452
0.6      0.018222
0.4      0.026591
0.2      0.061620
0.0      0.640517
Name: RISK_MM, Length: 681, dtype: float64

In [109]:
weather['Rained'] = [i for i in weather['RISK_MM'] >= 0.3]

In [110]:
weather['Rained'].value_counts(normalize=True)

False    0.703241
True     0.296759
Name: Rained, dtype: float64

In [111]:
pd.crosstab(weather['Rained'], weather['RainToday'])

RainToday,No,Yes
Rained,Unnamed: 1_level_1,Unnamed: 2_level_1
False,86342,13083
True,22990,18372


In [112]:
pd.crosstab(weather['Rained'], weather['RainTomorrow'])

RainTomorrow,No,Yes
Rained,Unnamed: 1_level_1,Unnamed: 2_level_1
False,99996,0
True,10320,31877


In [113]:
weather['RISK_MM'].nunique()

681

In [114]:
weather['RainToday'].value_counts(normalize=True)

No     0.776577
Yes    0.223423
Name: RainToday, dtype: float64

In [115]:
weather['Rainfall'].value_counts()

0.0      90275
0.2       8685
0.4       3750
0.6       2562
0.8       2028
         ...  
74.4         1
60.6         1
7.9          1
145.6        1
128.2        1
Name: Rainfall, Length: 679, dtype: int64

In [116]:

weather['RainTomorrow'].replace({'No': 0, 'Yes': 1}, inplace=True)
weather['RainToday'].replace({'No':0, 'Yes': 1}, inplace=True)

In [130]:
weather['Rained'] = weather['Rained'].map({True: 1, False: 0})
weather

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow,Rained
0,2008-12-01,Albury,56.12,73.22,0.6,,,W,44.0,W,...,1007.7,1007.1,8.0,,16.9,21.8,0.0,0.0,0,0
1,2008-12-02,Albury,45.32,77.18,0.0,,,WNW,44.0,NNW,...,1010.6,1007.8,,,17.2,24.3,0.0,0.0,0,0
2,2008-12-03,Albury,55.22,78.26,0.0,,,WSW,46.0,W,...,1007.6,1008.7,,2.0,21.0,23.2,0.0,0.0,0,0
3,2008-12-04,Albury,48.56,82.40,0.0,,,NE,24.0,SE,...,1017.6,1012.8,,,18.1,26.5,0.0,1.0,0,1
4,2008-12-05,Albury,63.50,90.14,1.0,,,W,41.0,ENE,...,1010.8,1006.0,7.0,8.0,17.8,29.7,0.0,0.2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,2017-06-20,Uluru,38.30,71.24,0.0,,,E,31.0,ESE,...,1024.7,1021.2,,,9.4,20.9,0.0,0.0,0,0
142189,2017-06-21,Uluru,37.04,74.12,0.0,,,E,31.0,SE,...,1024.6,1020.3,,,10.1,22.4,0.0,0.0,0,0
142190,2017-06-22,Uluru,38.48,77.54,0.0,,,NNW,22.0,SE,...,1023.5,1019.1,,,10.9,24.5,0.0,0.0,0,0
142191,2017-06-23,Uluru,41.72,80.42,0.0,,,N,37.0,SE,...,1021.0,1016.8,,,12.5,26.1,0.0,0.0,0,0


In [131]:
# Classification model based on target:
target = 'Rained'

In [132]:
# Exclude "future" features that may cause leakage/target variables:
cols_to_drop = ['RainTomorrow', 'RISK_MM', 'RainToday']


In [119]:
# Since dataset is too imbalanced, I will need to utlize precision/recall
# to evaluate the models predictions, rather than accuracy.

from sklearn.metrics import precision_recall_curve

In [120]:
from sklearn.model_selection import train_test_split
import numpy as np

train, test = train_test_split(weather, train_size=.8, stratify=weather[target], random_state=42)


In [121]:
train2, val = train_test_split(train, train_size=.8, stratify=train[target], random_state=42)

In [133]:
def wrangler(dataset):
    dataset=dataset.copy()
    cols_to_drop = ['RISK_MM', 'RainToday', 'RainTomorrow']
    dataset = dataset.drop(columns=cols_to_drop)
    return dataset

train = wrangler(train)
val = wrangler(val)

KeyError: "['RISK_MM' 'RainToday' 'RainTomorrow'] not found in axis"

In [27]:
import category_encoders as ce
from sklearn.feature_selection import mutual_info_classif
tempencoder = ce.OrdinalEncoder()
tempencoded = tempencoder.fit_transform(weather)


In [28]:
imputed = SimpleImputer(strategy='mean')
dfimputed = imputed.fit_transform(tempencoded)

In [29]:
scores = mutual_info_classif(dfimputed, weather[target])

In [30]:
data = {
    'features': [column for column in weather.columns],
    'scores' : [feat for feat in scores]
}

In [31]:
scoredf = pd.DataFrame.from_dict(data)

In [32]:
scoredf

Unnamed: 0,features,scores
0,Date,0.049304
1,Location,0.013132
2,MinTemp,0.006155
3,MaxTemp,0.013785
4,Rainfall,0.056359
5,Evaporation,0.011773
6,Sunshine,0.06192
7,WindGustDir,0.00702
8,WindGustSpeed,0.025648
9,WindDir9am,0.010498


In [129]:
train

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,Rained
58146,2013-01-14,Bendigo,44.42,79.52,0.6,,,SE,43.0,SE,...,13.0,56.0,25.0,1019.2,1017.5,,,14.4,24.3,False
22206,2013-08-06,NorfolkIsland,53.06,67.10,0.0,2.4,9.2,NE,30.0,E,...,19.0,73.0,66.0,1021.6,1019.3,2.0,2.0,16.9,19.1,False
27740,2012-06-18,Richmond,35.42,67.10,0.0,3.0,,W,22.0,,...,4.0,84.0,43.0,1022.6,1019.8,,,10.2,18.5,False
9960,2011-12-01,CoffsHarbour,67.28,68.18,0.0,7.4,0.0,SSW,61.0,SSW,...,33.0,78.0,91.0,1015.0,1016.8,8.0,8.0,19.9,16.2,True
135075,2015-02-22,AliceSprings,77.00,101.30,0.0,9.8,8.8,NNW,39.0,N,...,17.0,49.0,31.0,1008.1,1004.2,7.0,7.0,30.6,36.5,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123172,2015-08-07,SalmonGums,44.06,60.44,0.0,,,N,30.0,N,...,15.0,67.0,57.0,,,,,10.9,15.0,True
63806,2012-01-06,MelbourneAirport,50.36,74.12,0.0,6.6,10.8,SSW,43.0,SE,...,22.0,54.0,40.0,1019.0,1014.9,6.0,4.0,17.6,21.7,False
83610,2014-01-10,Brisbane,66.92,82.04,1.4,7.0,1.5,E,30.0,SSE,...,9.0,65.0,61.0,1020.4,1018.3,5.0,7.0,25.8,25.9,False
101897,2014-07-07,Nuriootpa,36.50,58.10,0.4,0.9,6.8,N,43.0,N,...,20.0,81.0,51.0,1016.1,1011.3,4.0,7.0,8.8,13.8,False


In [134]:
features = train.columns
X_train = train[features].drop('Rained', axis=1)
y_train = train[target]
X_val = val[features].drop('Rained', axis=1)
y_val = val[target]
X_test = test[features].drop('Rained', axis=1)
y_test = test[target]

In [135]:
X_train

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
58146,2013-01-14,Bendigo,44.42,79.52,0.6,,,SE,43.0,SE,...,20.0,13.0,56.0,25.0,1019.2,1017.5,,,14.4,24.3
22206,2013-08-06,NorfolkIsland,53.06,67.10,0.0,2.4,9.2,NE,30.0,E,...,9.0,19.0,73.0,66.0,1021.6,1019.3,2.0,2.0,16.9,19.1
27740,2012-06-18,Richmond,35.42,67.10,0.0,3.0,,W,22.0,,...,0.0,4.0,84.0,43.0,1022.6,1019.8,,,10.2,18.5
9960,2011-12-01,CoffsHarbour,67.28,68.18,0.0,7.4,0.0,SSW,61.0,SSW,...,33.0,33.0,78.0,91.0,1015.0,1016.8,8.0,8.0,19.9,16.2
135075,2015-02-22,AliceSprings,77.00,101.30,0.0,9.8,8.8,NNW,39.0,N,...,22.0,17.0,49.0,31.0,1008.1,1004.2,7.0,7.0,30.6,36.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123172,2015-08-07,SalmonGums,44.06,60.44,0.0,,,N,30.0,N,...,13.0,15.0,67.0,57.0,,,,,10.9,15.0
63806,2012-01-06,MelbourneAirport,50.36,74.12,0.0,6.6,10.8,SSW,43.0,SE,...,22.0,22.0,54.0,40.0,1019.0,1014.9,6.0,4.0,17.6,21.7
83610,2014-01-10,Brisbane,66.92,82.04,1.4,7.0,1.5,E,30.0,SSE,...,7.0,9.0,65.0,61.0,1020.4,1018.3,5.0,7.0,25.8,25.9
101897,2014-07-07,Nuriootpa,36.50,58.10,0.4,0.9,6.8,N,43.0,N,...,11.0,20.0,81.0,51.0,1016.1,1011.3,4.0,7.0,8.8,13.8


In [145]:
# First model:
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=450, n_jobs=-1, criterion='entropy', max_depth=5, random_state=42)
)

In [146]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['Date', 'Location', 'WindGustDir',
                                      'WindDir9am', 'WindDir3pm'],
                                mapping=[{'col': 'Date',
                                          'data_type': dtype('O'),
                                          'mapping': 2013-01-14       1
2013-08-06       2
2012-06-18       3
2011-12-01       4
2015-02-22       5
              ... 
2008-04-10    3403
2008-01-09    3404
2008-02-16    3405
2007-12-26    3406
NaN             -2
Length: 3407, dtype: int64},
                                         {'col': 'Location',
                                          'data_type': dtype(...
SSE     7
WNW     8
NE      9
NNE    10
SW     11
W      12
NNW    13
S      14
NW     15
ESE    16
WSW    17
dtype: int64},
                                         {'col': 'WindDir3pm',
                                          'data_type': dtype('O'),
                             

In [148]:
y_pred = pipeline.predict(X_val)

In [149]:
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred)

0.8108215023515449

In [152]:
transformer = make_pipeline(
        ce.OrdinalEncoder(),
        SimpleImputer(strategy='mean')
)

X_train_transformed = transformer.fit_transform(X_train)
X_val_transformed = transformer.transform(X_val)

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [153]:
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
        model,
        scoring='accuracy',
        n_iter=5,
        random_state=42
)
permuter.fit(X_val_transformed, y_val);



PermutationImportance(estimator=RandomForestClassifier(n_jobs=-1,
                                                       random_state=42),
                      random_state=42, scoring='accuracy')

In [154]:
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

WindDir9am       0.004624
WindSpeed9am     0.005213
Evaporation      0.005573
WindSpeed3pm     0.007991
WindDir3pm       0.008808
WindGustDir      0.009309
Temp9am          0.009705
Date             0.009731
Location         0.011217
MaxTemp          0.012413
MinTemp          0.013186
Cloud9am         0.015797
Temp3pm          0.015850
Pressure9am      0.028728
Cloud3pm         0.032218
Humidity9am      0.039699
Pressure3pm      0.046231
Sunshine         0.056270
WindGustSpeed    0.058310
Rainfall         0.068727
Humidity3pm      0.159211
dtype: float64

In [155]:
eli5.show_weights(
    permuter,
    top=None,
    feature_names=feature_names
)

Weight,Feature
0.1592  ± 0.0030,Humidity3pm
0.0687  ± 0.0011,Rainfall
0.0583  ± 0.0021,WindGustSpeed
0.0563  ± 0.0018,Sunshine
0.0462  ± 0.0021,Pressure3pm
0.0397  ± 0.0016,Humidity9am
0.0322  ± 0.0013,Cloud3pm
0.0287  ± 0.0006,Pressure9am
0.0158  ± 0.0008,Temp3pm
0.0158  ± 0.0014,Cloud9am


In [161]:
import xgboost
from xgboost import XGBClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    XGBClassifier(n_estimators=200, random_state=42, n_jobs=-1, learning_rate=0.5)
)

pipeline.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['Date', 'Location', 'WindGustDir',
                                      'WindDir9am', 'WindDir3pm'],
                                mapping=[{'col': 'Date',
                                          'data_type': dtype('O'),
                                          'mapping': 2013-01-14       1
2013-08-06       2
2012-06-18       3
2011-12-01       4
2015-02-22       5
              ... 
2008-04-10    3403
2008-01-09    3404
2008-02-16    3405
2007-12-26    3406
NaN             -2
Length: 3407, dtype: int64},
                                         {'col': 'Location',
                                          'data_type': dtype(...
                               colsample_bytree=1, gamma=0, gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.5,
                               max_delta_step=0, max_depth=6,
                  

In [162]:
y_pred_xgb1 = pipeline.predict(X_val)
print('Validation Accuracy:', accuracy_score(y_val, y_pred_xgb1))

Validation Accuracy: 0.918553030636016


In [93]:
from statsmodels.tsa.arima_model import ARIMA


In [None]:
model = ARIMA(weather, order=(1,1,0))
model_fit = model.fit(disp=0)

In [92]:
# from pandas.plotting import autocorrelation_plot

In [124]:
# autocorrelation_plot(weather)