_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [1]:
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    ## Install packages in Colab
    !pip install --upgrade category_encoders eli5 pandas-profiling plotly

In [1]:
# Import required packages
import category_encoders as ce
import eli5
import pandas as pd
import sklearn

# Check package versions
from distutils.version import StrictVersion
assert StrictVersion(ce.__version__) >= StrictVersion('2.0.0')
assert StrictVersion(eli5.__version__) >= StrictVersion('0.9.0')
assert StrictVersion(pd.__version__) >= StrictVersion('0.24.2')
assert StrictVersion(sklearn.__version__) >= StrictVersion('0.21.3')

# Read data
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn or xgboost cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree, Random Forest, or Gradient Boosting** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get and plot your model's **feature importances.**

**8.** Get and display your model's **permutation importances.**

**9.** Get your model's **test accuracy.** (One time, at the end.)

### Stretch Goals
- Make 2+ visualizations to explore relationships between features and target.
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Use permutation importances for feature selection.



## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [4]:
import pandas_profiling
profile = df.profile_report()
rejected_variables = profile.get_rejected_variables(threshold=0.9)

In [37]:
df.head(2)

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0


In [5]:
profile



In [6]:
# Find out what to reject immediately unless time permits more analysis
rejected_variables

['period']

In [162]:
# Build feature view for early analysis
view = df.columns.tolist()
drop_set = set(['period', 'player_name', 'game_id', 'game_event_id'])
view = list(set(view)-drop_set)
view

['shot_zone_range',
 'shot_zone_area',
 'shot_made_flag',
 'minutes_remaining',
 'loc_y',
 'htm',
 'vtm',
 'shot_zone_basic',
 'season_type',
 'action_type',
 'shot_type',
 'scoremargin_before_shot',
 'game_date',
 'shot_distance',
 'seconds_remaining',
 'loc_x']

In [163]:
# Retype variables and report initial clean data
df_clean = df[view].copy()
df_clean['game_date'] = pd.to_datetime(df_clean['game_date'])
df_clean.dtypes

shot_zone_range                    object
shot_zone_area                     object
shot_made_flag                      int64
minutes_remaining                   int64
loc_y                               int64
htm                                object
vtm                                object
shot_zone_basic                    object
season_type                        object
action_type                        object
shot_type                          object
scoremargin_before_shot           float64
game_date                  datetime64[ns]
shot_distance                       int64
seconds_remaining                   int64
loc_x                               int64
dtype: object

## Majority Class Report
**If we were to predict the majority class it would be:**

In [164]:
df_clean.shot_made_flag.value_counts(normalize=True) 

0    0.527081
1    0.472919
Name: shot_made_flag, dtype: float64

He misses the shot..sadness.

## 2. Hold out your test set.

>Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

In [165]:
df_clean.game_date.min()

Timestamp('2009-10-28 00:00:00')

In [166]:
from IPython.display import display

def split_train_test_custom(dataframe, date=pd.Timestamp(2018, 10, 1), target='shot_made_flag',
                            drop_list=['game_date', 'htm', 'vtm', 'shot_zone_range', 'shot_zone_area',
                                      'shot_type', 'shot_zone_basic', 'season_type']
                           ):
    '''
    Returns X_train, y_train, X_test, y_test for supplied dataframe and target
    
    inputs:
    dataframe = pd.DataFrame, df to operate on
    date = pd.Timestamp, splits df on date
    target = str, feature name
    '''
    train = dataframe[dataframe.game_date > date]
    test = dataframe[dataframe.game_date <= date]
    train = train.drop(columns=drop_list)
    test = test.drop(columns=drop_list)
    X_train = train.drop(columns=target)
    y_train = train[target]
    X_test = test.drop(columns=target)
    y_test = test[target]
    return X_train, y_train, X_test, y_test

X_train, y_train, X_test, y_test = split_train_test_custom(df_clean)

display(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1709, 7)

(1709,)

(12249, 7)

(12249,)

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [167]:
# Engineer Homecourt Advantage, 'hmc_adv'
df_clean['hmc_adv'] = (df_clean.htm == 'GSW').astype(int)

In [168]:
# Engineer Opponent - Low importance
#df_clean['opp'] = (df_clean.htm + df_clean.vtm).str.split('GSW').str.join('')

In [169]:
# Engineer GameMonth, GameYear
df_clean['year'] = df_clean.game_date.dt.year
df_clean['month'] = df_clean.game_date.dt.month
df_clean['day'] = df_clean.game_date.dt.day

In [170]:
# Engineer GameDay - Low Importance
#df_clean['day_name'] = df_clean.game_date.dt.day_name()

## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn or xgboost cross-validation method.

In [171]:
# train/validate using cross_validation
X_train, y_train, X_test, y_test = split_train_test_custom(df_clean)
display(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1709, 11)

(1709,)

(12249, 11)

(12249,)

## 5. Use a scikit-learn pipeline to encode categoricals and fit a Decision Tree, Random Forest, or Gradient Boosting model.

In [172]:
cat_vars = X_train.dtypes[X_train.dtypes==object].index.tolist()
cat_vars

['action_type']

In [173]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import auc, accuracy_score, confusion_matrix
from category_encoders import OrdinalEncoder
import xgboost as xgb
import numpy as np

# Pipeline with encoder -> fit model
# Fit XGBoost
pipeline = make_pipeline(
    OrdinalEncoder(cols=cat_vars),
    xgb.XGBClassifier(max_depth=3,
                      learning_rate=0.05,
                      verbosity=0,
                      n_estimators=100,
                      num_class=3,
                      objective="multi:softprob",
                      random_state=42)
)

pipeline.fit(X_train, y_train)
pipeline.steps

[('ordinalencoder', OrdinalEncoder(cols=['action_type'], drop_invariant=False,
                 handle_missing='value', handle_unknown='value',
                 mapping=[{'col': 'action_type', 'data_type': dtype('O'),
                           'mapping': Step Back Jump shot                 1
  Driving Floating Jump Shot          2
  Driving Layup Shot                  3
  Driving Reverse Layup Shot          4
  Jump Shot                           5
  Driving Floating Bank Jump Shot     6
  Cutting Layup Shot                  7
  Running Pull-Up Jump Shot           8
  Running Finger Roll Layup Sho...
  Fadeaway Jump Shot                 14
  Reverse Layup Shot                 15
  Cutting Finger Roll Layup Shot     16
  Running Jump Shot                  17
  Turnaround Fadeaway shot           18
  Turnaround Jump Shot               19
  Running Layup Shot                 20
  Jump Bank Shot                     21
  Turnaround Hook Shot               22
  Driving Hook Shot            

In [174]:
# Seeing how lightgbm handles this dataset
import lightgbm as lgb


# target needs to be an integer or float for some regressors

# LightGBM is a powerful package that offers various boosting types from random forest
#   to gradient descent (gbdt)
pipeline_b = make_pipeline(
    OrdinalEncoder(cols=cat_vars),
    lgb.LGBMClassifier(
        boosting_type='dart',
        num_leaves=31,
        learning_rate=0.05,
        n_estimators=100, 
    )         
)
pipeline_b.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['action_type'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'action_type',
                                          'data_type': dtype('O'),
                                          'mapping': Step Back Jump shot                 1
Driving Floating Jump Shot          2
Driving Layup Shot                  3
Driving Reverse Layup Shot          4
Jump Shot                           5
Driving Floating Bank Jump Shot     6
Cutting Layup Shot                  7
Ru...
                 LGBMClassifier(boosting_type='dart', class_weight=None,
                                colsample_bytree=1.0, importance_type='split',
                                learning_rate=0.05, max_depth=-1,
                                min_child_samples=20, min_child_weight=0.001,
                                min

## 6.Get your model's validation accuracy

> (Multiple times if you try multiple iterations.)

In [175]:
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV


param_distributions_a = {
    'xgbclassifier__n_estimators': randint(50, 500), 
    'xgbclassifier__max_depth': randint(5,100),
}

param_distributions_b = {
    'xgbclassifier__n_estimators': randint(50, 500), 
    'xgbclassifier__num_leaves': randint(5,100),
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions_a, 
    n_iter=10, 
    cv=3, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=3
)
search_b = RandomizedSearchCV(
    pipeline_b, 
    param_distributions=param_distributions_b, 
    n_iter=10, 
    cv=3, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=3
)

search.fit(X_train, y_train);
#search_b.fit(X_train, y_train);

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:    8.5s
[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:   19.0s
[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:   24.5s
[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:   36.2s
[Parallel(n_jobs=3)]: Done  30 out of  30 | elapsed:   46.1s finished


In [176]:
print('Best hyperparameters (XGBoost)', search.best_params_)
print('Cross-validation MAE (XGBoost)', -search.best_score_)

Best hyperparameters (XGBoost) {'xgbclassifier__max_depth': 78, 'xgbclassifier__n_estimators': 149}
Cross-validation MAE (XGBoost) 0.4148624926857812


In [177]:
print('Best hyperparameters (LightGBM)', search.best_params_)
print('Cross-validation MAE (LightGBM)', -search.best_score_)

Best hyperparameters (LightGBM) {'xgbclassifier__max_depth': 78, 'xgbclassifier__n_estimators': 149}
Cross-validation MAE (LightGBM) 0.4148624926857812


In [178]:
pipeline_best = search.best_estimator_

## 7. Get and plot your model's feature importances.

In [179]:
importances = pd.DataFrame(X_train.columns)
importances['scores'] = pipeline_best.steps[1][1].feature_importances_
importances.sort_values(by='scores', ascending=False)

Unnamed: 0,0,scores
2,action_type,0.176653
7,hmc_adv,0.150094
1,loc_y,0.081904
6,loc_x,0.080645
4,shot_distance,0.07652
8,year,0.075553
0,minutes_remaining,0.075221
3,scoremargin_before_shot,0.074401
9,month,0.073561
5,seconds_remaining,0.070198


## 8. Get and display your model's permutation importances.

In [184]:
# Plot feature importances
from eli5.sklearn import PermutationImportance
import eli5
X_train = OrdinalEncoder(cols=cat_vars).fit_transform(X_train)
perm = PermutationImportance(pipeline_best).fit(X_train, y_train)
eli5.show_weights(perm, feature_names=X_train.columns.tolist())


Weight,Feature
0.0501  ± 0.0057,loc_x
0.0440  ± 0.0196,minutes_remaining
0.0429  ± 0.0114,loc_y
0.0322  ± 0.0133,day
0.0322  ± 0.0066,scoremargin_before_shot
0.0311  ± 0.0149,shot_distance
0.0290  ± 0.0071,seconds_remaining
0.0133  ± 0.0066,month
0.0042  ± 0.0070,year
0.0016  ± 0.0051,hmc_adv


**A lot of these features are simply not important to the model and should probably be removed**

see drop_list in train_test_split_custom

## 9. Get your model's test accuracy

> (One time, at the end.)

In [181]:
# Check Accuracy - XGBoost Pipeline no tuning
y_pred = pipeline.predict(X_test)
display(confusion_matrix(y_test, y_pred), accuracy_score(y_test, y_pred))

array([[4425, 2020],
       [2638, 3166]], dtype=int64)

0.6197240591068659

In [182]:
# Check Accuracy - XGBoost Pipeline via randomsearchCV
y_pred = pipeline_best.predict(X_test)
display(confusion_matrix(y_test, y_pred), accuracy_score(y_test, y_pred))

array([[4001, 2444],
       [2397, 3407]], dtype=int64)

0.6047840640052249

In [183]:
# Check Accuracy - LDart Pipeline
y_pred = pipeline_b.predict(X_test)
display(confusion_matrix(y_test, y_pred), accuracy_score(y_test, y_pred)) 

array([[4282, 2163],
       [2496, 3308]], dtype=int64)

0.6196424197893705