<a href="https://colab.research.google.com/github/kevmanning/DS-Unit-2-Applied-Modeling/blob/master/DS_233_assignment_Kevin_Manning_dspt10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [None]:
!pip install eli5

In [None]:
import numpy as np
import pandas as pd
import pandas_profiling
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
import eli5
from eli5.sklearn import PermutationImportance
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [None]:
from google.colab import files
uploaded= files.upload()

In [None]:
import io
df= pd.read_csv(io.BytesIO(uploaded['nhl_17_18_reg_adv.csv']))

In [None]:
df.columns =['game', 'date', 'home_away', 'opponent', 'goals', 'goals_against',
       'win_loss', 'overtime', 'blank1', 'shots', 'penalty_mins', 'power_play_goals',
       'power_plays', 'short_handed', 'blank2', 'opp_shots', 'opp_penalty_mins',
       'opp_power_play_g', 'opp_power_plays', 'opp_short_handed', 'blank3', 'corsi_for',
       'corsi_against', 'corsi_for_%', 'fenwick_for', 'fenwick_against',
       'fenwick_%', 'face_off_win', 'face_off_loss', 'face_off_%',
       'off_zone_start', 'pdo']

In [None]:
# drop blank columns
df= df.drop(['blank1', 'blank2', 'blank3'], axis= 1)

In [None]:
# check majority class
df['win_loss'].describe()

In [None]:
# visually
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style= 'darkgrid')
ax = sns.countplot(x="win_loss", data=df)

In [None]:
# distribution of target

y= df['win_loss']
print(y.nunique())
print()
print(y.value_counts())
print()
y.value_counts(normalize= True)

In [None]:
# drop unnecessary columns
df.columns

In [None]:
df= df.drop(columns= ['date', 'game'])

In [None]:
# begin to clean up the data
df.isna().sum().sort_values(ascending= False)

# looks like overtime and home/away are the only columns with issues

In [None]:
# fill home/away first
# make home game = 2 (fill NaN)
# make away game = 1 (replace '@' with 1)
df.home_away.value_counts(dropna= False)

In [None]:
df.home_away= df.home_away.replace('@', 'away')
df.home_away= df.home_away.fillna('home')

In [None]:
df['home_away'].replace('@', 1)
df.home_away.value_counts()

In [None]:
# now do overtime
df.overtime.value_counts(dropna= False)

In [None]:
df.overtime= df.overtime.fillna('No')
df.overtime= df.overtime.replace('SO', 'Yes')
df.overtime= df.overtime.replace('OT', 'Yes')

In [None]:
df.overtime.value_counts()

In [None]:
# check to see if data cleaning worked
df.isnull().sum()

In [None]:
target= 'win_loss'
features= df.columns.drop([target])
X= df[features]
y= df[target]

In [None]:
# train/val/test split
# will split it train/test and then split val/test

X_train, X_test_val, y_train, y_test_val= train_test_split(X, y, test_size= .39, random_state= 99)
X_val, X_test, y_val, y_test= train_test_split(X_test_val, y_test_val, test_size= .5, random_state= 99)

In [None]:
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

In [None]:
pipeline= make_pipeline(
    ce.OrdinalEncoder(),
    DecisionTreeClassifier(max_depth=7)
)

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

In [None]:
import graphviz
from sklearn.tree import export_graphviz

tree = pipeline.named_steps['decisiontreeclassifier']

dot_data = export_graphviz(
    tree, 
    out_file=None, 
    feature_names=X_train.columns, 
    class_names=y_train.unique().astype(str), 
    filled=True, 
    impurity=False,
    proportion=True
)

graphviz.Source(dot_data)

In [None]:
# decision tree
dt = pipeline.named_steps['decisiontreeclassifier']
importances = pd.Series(dt.feature_importances_, X_train.columns)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

n = 20
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey');

In [None]:
pipeline2 = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train, score on val
pipeline2.fit(X_train, y_train)
print('Validation Accuracy', pipeline2.score(X_val, y_val))

In [None]:
# random forest
rf = pipeline2.named_steps['randomforestclassifier']
importances2 = pd.Series(rf.feature_importances_, X_train.columns)

# Plot feature importances
%matplotlib inline
import matplotlib.pyplot as plt

n = 20
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances2.sort_values()[-n:].plot.barh(color='grey');

In [None]:
# decision tree ROC
from sklearn.metrics import roc_auc_score
y_pred_proba = pipeline.predict_proba(X_val)[:,-1] # probability for the last class 
roc_auc_score(y_val, y_pred_proba)

In [None]:
# decision tree ROC
from sklearn.metrics import roc_auc_score
y_pred_proba = pipeline.predict_proba(X_val)[:,-1] # probability for the last class 
roc_auc_score(y_val, y_pred_proba)

In [None]:
y_pred= pipeline.predict(X_test)

In [None]:
y_pred

In [None]:
y_test

In [None]:
y_pred == y_test

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test)

In [None]:
X_train.columns

In [None]:
y_pred2= pipeline2.predict(X_test)

In [None]:
column= 'pdo'

pipeline3= make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy= 'median'),
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline3.fit(X_train.drop(columns=column), y_train)
score_without = pipeline3.score(X_val.drop(columns=column), y_val)
print(f'Validation Accuracy without {column}: {score_without}')

In [None]:
feature= 'pdo'
X_val_permuted = X_val.copy()
X_val_permuted[feature]= np.random.permutation(X_val[feature])

In [None]:
score_permuted= pipeline2.score(X_val_permuted, y_val)
print(f'Validation Accuracy with {feature} permuted: {score_permuted}')

In [None]:
transformers= make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy= 'median')
)

X_train_transformed= transformers.fit_transform(X_train)
X_val_transformed= transformers.transform(X_val)

model4= RandomForestClassifier(n_estimators=50, random_state=99, n_jobs=-1)
model4.fit(X_train_transformed, y_train)

In [None]:
transformers= make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy= 'median')
)

X_train_transformed= transformers.fit_transform(X_train)
X_val_transformed= transformers.transform(X_val)

model4= RandomForestClassifier(n_estimators=50, random_state=99, n_jobs=-1)
model4.fit(X_train_transformed, y_train)

In [None]:
features_names= X_val.columns.to_list()
pd.Series(permuter.feature_importances_, features_names).sort_values(ascending= False)

In [None]:
permuter= PermutationImportance(
    model4,
    scoring= 'accuracy',
    n_iter= 5,
    random_state=99
)

permuter.fit(X_val_transformed, y_val)

In [None]:
eli5.show_weights(
    permuter,
    top= None,
    feature_names= features_names

In [None]:
# remove features with zero importance
print('Shape before removing feature ', X_train.shape)

In [None]:
min_imp= 0
mask= permuter.feature_importances_ > min_imp
features= X_train.columns[mask]
X_train= X_train[features]
print('Shape AFTER removing feature ', X_train.shape)

In [None]:
X_val= X_val[features]

pipeline5= make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy= 'median'),
    RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
)

pipeline5.fit(X_train, y_train)
print('Validation accuracy', pipeline5.score(X_val, y_val))

In [None]:
pipeline6= make_pipeline(
    ce.OrdinalEncoder(),
    XGBClassifier(n_estimators= 100, random_state= 99, n_jobs= -1)
)

pipeline6.fit(X_train, y_train)

In [None]:
y_pred3= pipeline6.predict(X_val)
print('Validation Accuracy: ', accuracy_score(y_val, y_pred3))

In [None]:
# encoder= ce.OrdinalEncoder()
# X_train_encoded= encoder.fit_transform(X_train)
# X_val_encoded= encoder.transform(X_val)


# model= XGBClassifier(
#     n_estimators= 1000,
#     max_depth=11,
#     learning_rate= 0.5,
#     n_jobs= -1
# )

# eval_set= [(X_test, y_test)]

# model.fit(X_train_encoded, y_train,
#           eval_set= eval_set,
#           eval_metric= 'merror',
#           early_stopping_rounds=50)

In [None]:
# I get an error every time I run the above XGBClassifier