Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import pandas as pd

# Continue to clean and explore your data. Make exploratory visualizations.

In [31]:
#See LS_DS_232_assignment
df = pd.read_csv('/content/drive/My Drive/Lambda/Unit 2/Basketball Data/eng_shots_log.csv')
df = df.drop(columns='Unnamed: 0')
df.head(1)

Unnamed: 0,GAME_ID,MATCHUP,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,GAME_CLOCK,SHOT_CLOCK,DRIBBLES,TOUCH_TIME,SHOT_DIST,PTS_TYPE,SHOT_RESULT,CLOSEST_DEFENDER,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST,player_name,player_id,MINUTES,SECONDS,SECONDS_IN_GAME
0,21400899,"MAR 04, 2015 - CHA @ BKN",A,W,24,1,1,1:09,10.8,2,1.9,7.7,2,made,"Anderson, Alan",101187,1.3,brian roberts,203148,1,9,2229


# Fit a model. Does it beat your baseline?

In [0]:
#Again, see LS_DS_232_assignment:
#baseline accuracy: 54.8%
#model 1 accuracy: 60.5%
#model 2 accuracy: 60.9%


In [28]:
target = 'SHOT_RESULT'
y = df[target]
y.value_counts(normalize=True)

missed    0.547861
made      0.452139
Name: SHOT_RESULT, dtype: float64

In [32]:
#Can you make a fast, first model that beats guessing?
import numpy as np
from sklearn.model_selection import train_test_split

train = df

# Split df into into train & test sets
train, test = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['SHOT_RESULT'], random_state=42)
train.shape, test.shape

((102455, 22), (25614, 22))

In [33]:
#Split df into train and validation sets
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['SHOT_RESULT'], random_state=42)
train.shape, val.shape

((81964, 22), (20491, 22))

In [0]:
# Get a dataframe with all train columns except the target
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features

In [0]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

In [25]:
#install category encoders
!pip install category_encoders



In [36]:
#Fit a model using a pipeline

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.one_hot.OneHotEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=100, n_jobs= -1, random_state=42)
)

pipeline.fit(X_train, y_train)
print(f'Valdiation accuracy: {pipeline.score(X_val, y_val)}')

Valdiation accuracy: 0.6085598555463374


# Get your model's permutation importances

In [0]:
%%capture
import sys

if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install eli5

import eli5
from eli5.sklearn import PermutationImportance


In [41]:
transformers = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    SimpleImputer()
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [42]:
permuter = PermutationImportance(
    model,
    scoring = 'accuracy',
    n_iter=5,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

PermutationImportance(cv='prefit',
                      estimator=RandomForestClassifier(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       class_weight=None,
                                                       criterion='gini',
                                                       max_depth=None,
                                                       max_features='auto',
                                                       max_leaf_nodes=None,
                                                       max_samples=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fr

In [43]:
feature_names = X_val.columns.to_list()

pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=False)

SHOT_DIST                     0.069318
TOUCH_TIME                    0.011810
CLOSE_DEF_DIST                0.011703
SHOT_CLOCK                    0.003972
PTS_TYPE                      0.003602
MINUTES                       0.002245
player_id                     0.000839
GAME_ID                       0.000068
PERIOD                        0.000029
DRIBBLES                     -0.000508
SHOT_NUMBER                  -0.000586
SECONDS_IN_GAME              -0.000713
LOCATION                     -0.000839
FINAL_MARGIN                 -0.001191
CLOSEST_DEFENDER_PLAYER_ID   -0.001825
W                            -0.002440
SECONDS                      -0.003563
dtype: float64

In [44]:
permuter.feature_importances_std_

array([0.00132777, 0.00156367, 0.00071657, 0.0006193 , 0.00087114,
       0.00119444, 0.00110081, 0.00171283, 0.00059161, 0.00100233,
       0.00096987, 0.00123398, 0.00162884, 0.00157654, 0.00111568,
       0.00047556, 0.00049864])

In [45]:
eli5.show_weights(permuter,
                  top=None, 
                  feature_names = feature_names)

Weight,Feature
0.0693  ± 0.0034,SHOT_DIST
0.0118  ± 0.0022,TOUCH_TIME
0.0117  ± 0.0019,CLOSE_DEF_DIST
0.0040  ± 0.0017,SHOT_CLOCK
0.0036  ± 0.0012,PTS_TYPE
0.0022  ± 0.0033,MINUTES
0.0008  ± 0.0025,player_id
0.0001  ± 0.0027,GAME_ID
0.0000  ± 0.0012,PERIOD
-0.0005  ± 0.0024,DRIBBLES


# Try xgboost.

In [46]:
from xgboost import XGBClassifier

pipeline = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    XGBClassifier(n_estimators= 100, random_state = 42, n_jobs = -1)
)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['LOCATION', 'W'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'LOCATION',
                                          'data_type': dtype('O'),
                                          'mapping': A      1
H      2
NaN   -2
dtype: int64},
                                         {'col': 'W', 'data_type': dtype('O'),
                                          'mapping': W      1
L      2
NaN   -2
dtype: int64}],
                                return_df=True, verbose=0)),
                ('xgbclassifier',
                 XGBClas...ster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                               min_ch

In [48]:
from sklearn.metrics import accuracy_score
y_pred = pipeline.predict(X_val)
print(f'Val accuracy score: {accuracy_score(y_val, y_pred)}')

Val accuracy score: 0.6223219950222049
