_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [160]:
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

In [214]:
# Read data
import pandas as pd
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

In [215]:
import category_encoders as ce
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get your model's **test accuracy.** (One time, at the end.)


**8.** Given a **confusion matrix** for a hypothetical binary classification model, **calculate accuracy, precision, and recall.**

### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.



## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [216]:
print(df.shape)
df.head(10)

(13958, 20)


Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0
5,20900015,277,Stephen Curry,2,0,34,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),Less Than 8 ft.,4,39,15,0,2009-10-28,GSW,HOU,Regular Season,4.0
6,20900015,413,Stephen Curry,4,10,26,Pullup Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-64,149,1,2009-10-28,GSW,HOU,Regular Season,-9.0
7,20900015,453,Stephen Curry,4,6,31,Pullup Jump shot,2PT Field Goal,Mid-Range,Right Side Center(RC),16-24 ft.,17,118,123,1,2009-10-28,GSW,HOU,Regular Season,-6.0
8,20900015,487,Stephen Curry,4,2,25,Pullup Jump shot,2PT Field Goal,Mid-Range,Right Side Center(RC),16-24 ft.,20,121,162,1,2009-10-28,GSW,HOU,Regular Season,-9.0
9,20900015,490,Stephen Curry,4,1,47,Pullup Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-125,134,1,2009-10-28,GSW,HOU,Regular Season,-7.0


In [217]:
df.dtypes

game_id                      int64
game_event_id                int64
player_name                 object
period                       int64
minutes_remaining            int64
seconds_remaining            int64
action_type                 object
shot_type                   object
shot_zone_basic             object
shot_zone_area              object
shot_zone_range             object
shot_distance                int64
loc_x                        int64
loc_y                        int64
shot_made_flag               int64
game_date                   object
htm                         object
vtm                         object
season_type                 object
scoremargin_before_shot    float64
dtype: object

In [218]:
#base line
df['shot_made_flag'].value_counts(normalize = True)

0    0.527081
1    0.472919
Name: shot_made_flag, dtype: float64

## 2. Hold out your test set.

>Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

In [219]:
df['game_date'] = pd.to_datetime(df['game_date'], infer_datetime_format = True)
cutoff = pd.to_datetime('2018-10-01')
train = df[df.game_date < cutoff]
test  = df[df.game_date >= cutoff]
print(train.shape, test.shape)
test.tail()

(12249, 20) (1709, 20)


Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
13953,41800403,570,Stephen Curry,4,8,1,Pullup Jump shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,12,3,120,1,2019-06-05,GSW,TOR,Playoffs,-13.0
13954,41800403,573,Stephen Curry,4,7,16,Floating Jump shot,2PT Field Goal,Mid-Range,Right Side(R),8-16 ft.,11,114,-5,0,2019-06-05,GSW,TOR,Playoffs,-14.0
13955,41800403,602,Stephen Curry,4,5,27,Step Back Jump shot,3PT Field Goal,Above the Break 3,Left Side Center(LC),24+ ft.,26,-217,149,0,2019-06-05,GSW,TOR,Playoffs,-17.0
13956,41800403,608,Stephen Curry,4,4,50,Driving Floating Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),Less Than 8 ft.,7,59,49,0,2019-06-05,GSW,TOR,Playoffs,-16.0
13957,41800403,658,Stephen Curry,4,2,47,Jump Shot,3PT Field Goal,Above the Break 3,Left Side Center(LC),24+ ft.,24,-226,104,0,2019-06-05,GSW,TOR,Playoffs,-12.0


In [74]:
# cutoff = pd.to_datetime('2017-10-01')
# train = train[train.game_date < cutoff]
# val = train[train.game_date >= cutoff]
# print(train.shape, val.shape)
# val.head()

(11081, 20) (0, 20)


Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot


In [220]:
train, val = train_test_split(train, train_size=0.80, test_size=0.20,  
                              stratify=train['shot_made_flag'], random_state=42)
train.shape, val.shape

((9799, 20), (2450, 20))

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [221]:
train.describe(exclude = 'number')

Unnamed: 0,player_name,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,game_date,htm,vtm,season_type
count,9799,9799,9799,9799,9799,9799,9799,9799,9799,9799
unique,1,51,2,7,6,5,712,32,32,2
top,Stephen Curry,Jump Shot,2PT Field Goal,Above the Break 3,Center(C),24+ ft.,2013-04-12 00:00:00,GSW,GSW,Regular Season
freq,9799,4742,5155,3884,4292,4576,27,4845,4954,8414
first,,,,,,,2009-10-28 00:00:00,,,
last,,,,,,,2018-06-08 00:00:00,,,


In [222]:
def features(df):
    df = df.copy()
    #combining min and sec in on column
    df['minutes_remaining'] = df['minutes_remaining'] * 60
    df['total_sec'] = df['minutes_remaining'] + df['seconds_remaining']
    
    df['total_game_sec'] = (4 - df['period']) * 12 * 60 + df['total_sec']
    df['shot_type'] = df['shot_type'].replace({'2PT Field Goal': 2, '3PT Field Goal': 3})
    df['shot_zone_range'] = df['shot_zone_range'].replace({'24+ ft.': 1, 'Less Than 8 ft.': 2,
                                                           '16-24 ft.': 3, '8-16 ft.': 4,
                                                           'Back Court Shot': 5})
#     for home in df['htm']:
#         if home == 'GSW':
#             home = 1
#         else:
#             home = 0
    #df['htm'] = df['htm'].replace('GSW':1)
   
    irrelevant = ['game_id', 'game_event_id', 'player_name', 'game_date']
    df = df.drop(columns = irrelevant)
    return df
train = features(train)
val = features(val)
test = features(test)
    
    
    

In [223]:
train.head()

Unnamed: 0,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,htm,vtm,season_type,scoremargin_before_shot,total_sec,total_game_sec
4723,3,60,32,Layup Shot,2,Restricted Area,Center(C),2,1,-13,-2,0,MIL,GSW,Regular Season,17.0,92,812
4213,1,600,47,Pullup Jump shot,3,Above the Break 3,Left Side Center(LC),1,27,-180,205,1,GSW,OKC,Regular Season,-1.0,647,2807
3356,4,360,43,Step Back Jump shot,2,Mid-Range,Left Side(L),4,15,-136,77,1,GSW,SAS,Regular Season,-5.0,403,403
4273,1,300,13,Pullup Bank shot,2,Mid-Range,Left Side Center(LC),3,16,-111,121,1,GSW,POR,Regular Season,-7.0,313,2473
8146,3,480,21,Jump Shot,2,In The Paint (Non-RA),Center(C),2,6,-2,62,0,DET,GSW,Regular Season,-17.0,501,1221


In [224]:
train['shot_zone_range'].value_counts()

1    4576
2    2413
3    1797
4     945
5      68
Name: shot_zone_range, dtype: int64

## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

In [225]:
target = 'shot_made_flag'
X_train = train.drop(columns = target)
y_train = train[target]
X_val = val.drop(columns = target)
y_val = val[target]
X_test = test.drop(columns = target)
y_test = test[target]

## 5. Use a scikit-learn pipeline to encode categoricals and fit a Decision Tree or Random Forest model.

In [226]:

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names = True),
    SimpleImputer(strategy = 'mean'),
    RandomForestClassifier(random_state = 42, n_estimators = 200, n_jobs = -1)
)
pipeline.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('onehotencoder',
                 OneHotEncoder(cols=['action_type', 'shot_zone_basic',
                                     'shot_zone_area', 'htm', 'vtm',
                                     'season_type'],
                               drop_invariant=False, handle_missing='value',
                               handle_unknown='value', return_df=True,
                               use_cat_names=True, verbose=0)),
                ('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan,...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,


In [227]:
from sklearn.model_selection import cross_val_score
k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv = k, 
                        scoring = 'accuracy')
print(f'accuracy score {k} folds: ', scores)

accuracy score 3 folds:  [0.64432201 0.642376   0.65278628]


In [228]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint, uniform
pipeline = make_pipeline(
    ce.OneHotEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state = 42)
)
param_distrib = {
    'simpleimputer__strategy': ['median', 'mean'],
    'randomforestclassifier__n_estimators' : randint(200, 500),
    'randomforestclassifier__max_depth' : [10, 15, None],
    'randomforestclassifier__max_features': uniform(0, 1),
}

In [229]:
search = RandomizedSearchCV(
    pipeline,
    param_distributions = param_distrib,
    n_iter = 5,
    cv = 4,
    scoring = 'accuracy',
    verbose = 10,
    return_train_score = True,
    n_jobs = 7
)
search.fit(X_train, y_train)

Fitting 4 folds for each of 5 candidates, totalling 20 fits


[Parallel(n_jobs=7)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done   4 tasks      | elapsed:   24.0s
[Parallel(n_jobs=7)]: Done  10 out of  20 | elapsed:   40.6s remaining:   40.6s
[Parallel(n_jobs=7)]: Done  13 out of  20 | elapsed:   52.2s remaining:   28.0s
[Parallel(n_jobs=7)]: Done  16 out of  20 | elapsed:  1.0min remaining:   15.0s
[Parallel(n_jobs=7)]: Done  20 out of  20 | elapsed:  1.1min finished


RandomizedSearchCV(cv=4, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('onehotencoder',
                                              OneHotEncoder(cols=None,
                                                            drop_invariant=False,
                                                            handle_missing='value',
                                                            handle_unknown='value',
                                                            return_df=True,
                                                            use_cat_names=False,
                                                            verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            fill_value=None,
 

In [230]:
print('param', search.best_params_)
print('score', search.best_score_)

param {'randomforestclassifier__max_depth': 10, 'randomforestclassifier__max_features': 0.8648964150646811, 'randomforestclassifier__n_estimators': 296, 'simpleimputer__strategy': 'mean'}
score 0.6612905309122424


## 6.Get your model's validation accuracy

> (Multiple times if you try multiple iterations.)

In [231]:
k = 4
scores = cross_val_score(pipeline, X_val, y_val, cv = k, 
                        scoring = 'accuracy')
print(f'accuracy score {k} folds: ', scores)

accuracy score 4 folds:  [0.63458401 0.63784666 0.62908497 0.61437908]


In [232]:
from sklearn.metrics import accuracy_score
pipeline = search.best_estimator_
val_pred = pipeline.predict(X_val)
score = accuracy_score(y_val, val_pred)
print(score)

0.6510204081632653


## 7. Get your model's test accuracy

> (One time, at the end.)

In [234]:
from sklearn.metrics import accuracy_score
pipeline = search.best_estimator_
test_pred = pipeline.predict(X_test)
score = accuracy_score(y_test, test_pred)
print(score)

0.6278525453481568


## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

### Calculate accuracy 

In [235]:
correct_pred = 85 + 36
total_pred = 85 + 58 + 8 + 36
accuracy = correct_pred/total_pred
accuracy

0.6470588235294118

### Calculate precision

In [237]:
true_pos = 36
pred_pos = 58 + 36
precision = true_pos/pred_pos
precision

0.3829787234042553

### Calculate recall

In [238]:
act_pos = 8 + 36
recall = true_pos/act_pos
recall

0.8181818181818182