<a href="https://colab.research.google.com/github/pingao2019/DS-Unit-2-Kaggle-Challenge/blob/master/PAcopy_of_DS_Sprint_Challenge_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [0]:
import numpy as np
import pandas as pd

import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib import cm
from sklearn.linear_model import LinearRegression
                                                  
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# TODO
import category_encoders as ce
#encoder = ce.OneHotEncoder(use_cat_names=True)
#X_train = encoder.fit_transform(X_train)

In [0]:
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

In [0]:
# Read data
import pandas as pd
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

In [0]:
df.head()


Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0


In [0]:
df.shape

(13958, 21)

In [0]:
df.isnull().sum()

game_id                    0
game_event_id              0
player_name                0
period                     0
minutes_remaining          0
seconds_remaining          0
action_type                0
shot_type                  0
shot_zone_basic            0
shot_zone_area             0
shot_zone_range            0
shot_distance              0
loc_x                      0
loc_y                      0
shot_made_flag             0
game_date                  0
htm                        0
vtm                        0
season_type                0
scoremargin_before_shot    0
dtype: int64

In [0]:
df.columns.to_list()

['game_id',
 'game_event_id',
 'player_name',
 'period',
 'minutes_remaining',
 'seconds_remaining',
 'action_type',
 'shot_type',
 'shot_zone_basic',
 'shot_zone_area',
 'shot_zone_range',
 'shot_distance',
 'loc_x',
 'loc_y',
 'shot_made_flag',
 'game_date',
 'htm',
 'vtm',
 'season_type',
 'scoremargin_before_shot']

In [0]:
df=df.drop(['game_event_id',
 'player_name', 'shot_zone_range'], axis=1)

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get your model's **test accuracy.** (One time, at the end.)


**8.** Given a **confusion matrix** for a hypothetical binary classification model, **calculate accuracy, precision, and recall.**

### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.



Target to predict is shot_made_flag. The following  is your baseline accuracy.

## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [0]:
baseline= df['shot_made_flag'].mean()
errors = baseline- df['shot_made_flag']
mean_absolute_error = errors.abs().mean()
print(f'If we just guessed every shoot for ${baseline:,.3f},')
print(f'we would be off by ${mean_absolute_error:,.3f} on average.')

If we just guessed every shoot for $0.473,
we would be off by $0.499 on average.


###We only get above 50% accuracy with the baseline analysis without model if  guessed the majority class for every predictionfitting.

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [0]:
#only one player, so discard this column.
df['player_name'].value_counts()

Stephen Curry    13958
Name: player_name, dtype: int64

In [0]:
# To get new feature: total number of seconds remaining in the game.
df['total_secondsremaining_period']= df['minutes_remaining']*60 +df['seconds_remaining']
df.shape



(13958, 18)

In [0]:
df= df.drop('minutes_remaining', axis= 1)


In [0]:
df= df.drop('seconds_remaining', axis= 1)


In [0]:
df.head()

Unnamed: 0,game_id,period,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot,total_secondsremaining_period
0,20900015,1,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0,685
1,20900015,1,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0,571
2,20900015,1,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0,362
3,20900015,2,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0,589
4,20900015,2,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0,139


In [0]:
df.columns.to_list()
 

['game_id',
 'period',
 'action_type',
 'shot_type',
 'shot_zone_basic',
 'shot_zone_area',
 'shot_distance',
 'loc_x',
 'loc_y',
 'shot_made_flag',
 'game_date',
 'htm',
 'vtm',
 'season_type',
 'scoremargin_before_shot',
 'total_secondsremaining_period']

## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

In [0]:
from datetime import datetime

In [0]:
type(df['game_date'])

pandas.core.series.Series

In [0]:
 
df['game_date'] = pd.to_datetime(df['game_date'])

In [0]:
df['game_date']

0       2009-10-28
1       2009-10-28
2       2009-10-28
3       2009-10-28
4       2009-10-28
           ...    
13953   2019-06-05
13954   2019-06-05
13955   2019-06-05
13956   2019-06-05
13957   2019-06-05
Name: game_date, Length: 13958, dtype: datetime64[ns]

####Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season. You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.

 
2017–18 NBA season began on
Tuesday, October 17
and ended on
Sunday, June 17, 2018

In [0]:
cutoff = pd.to_datetime('2017-10-01')
train = df[((df['game_date'] < cutoff)&
           (df['game_date'] >= '2009-10-01'))]
           
train.shape

(11081, 21)

In [0]:
val=df[((df['game_date'] >  cutoff)&
           (df['game_date'] < '2018-06-17'))]

In [0]:
val.shape

(1168, 21)

###Train/validate/test split: random 80/20% train/validate split.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
train, test = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['game_date'], random_state=42)

train.shape, test.shape 

((8864, 21), (2217, 21))

In [0]:
#import seaborn as sns


In [0]:
# Arrange data into X features matrix and y target vector
features=[
 
 'period',
 'action_type',
 'shot_type',
 'shot_zone_basic',
 'shot_zone_area',
 'shot_zone_range',
 'shot_distance',
 'loc_x',
 'loc_y',
 
 'htm',
 'vtm',
 'season_type',
 'scoremargin_before_shot',
 'total_secondsremaining_period']
target = 'shot_made_flag'
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test =test[target]
 

## 5. Use a scikit-learn pipeline to encode categoricals and fit a Decision Tree or Random Forest model.

In [0]:
y_train.value_counts()

0    4691
1    4173
Name: shot_made_flag, dtype: int64

###Method 1 DecisionTreeClassifier:

In [0]:
from sklearn.tree import DecisionTreeClassifier

dt = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(), 
    DecisionTreeClassifier(random_state=42)
)

dt.fit(X_train, y_train)
score = dt.score(X_val, y_val)
print('Decision Tree, Validation Accuracy', score)
print('Train Accuracy', pipeline.score(X_train, y_train))

Decision Tree, Validation Accuracy 0.5419520547945206
Train Accuracy 1.0


In [0]:
y_pred = pipeline.predict(X_test)

In [0]:
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline
import matplotlib.pyplot as plt

In [0]:
model = pipeline.named_steps['decisiontreeclassifier']
%matplotlib inline
import matplotlib.pyplot as plt
encoder = pipeline.named_steps['onehotencoder']
encoded_columns = encoder.transform(X_val).columns
importances = pd.Series(model.feature_importances_, encoded_columns)
plt.figure(figsize=(10,30))
importances.sort_values().plot.barh()

###Method 2 RandomForestClassifier:

In [0]:
#import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(random_state=0, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))

Train Accuracy 1.0


#### Method 3 with  RandomizedSearchCV.

In [0]:
import category_encoders as ce
import numpy as np

from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [0]:
encoder= ce.OneHotEncoder(use_cat_names=True)
x_train_encoded = encoder.fit_transform(X_train)


In [0]:
y_train.value_counts(normalize=True)

0    0.529219
1    0.470781
Name: shot_made_flag, dtype: float64

In [0]:
pipeline = make_pipeline(
    StandardScaler(),
    
    SimpleImputer(), 
    RandomForestClassifier()
)

param_distributions = {
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestclassifier__max_depth': [10, 20, 30, 40], 
    'randomforestclassifier__min_samples_leaf': [1,3,5]
}


search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=30, 
    cv=3, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(x_train_encoded, y_train );

## 6.Get your model's validation accuracy
 

In [0]:
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.586472602739726


### For randomFrest, Validation Accuracy 0.586472602739726. It is much better than Decision Tree, which Validation Accuracy is 0.5419520547945206.

## 7. Get your model's test accuracy

> (One time, at the end.)

In [0]:
pipeline.fit(X_test, y_test)
print('Test Accuracy', pipeline.score(X_test, y_test))

Test Accuracy 1.0


In [0]:
from sklearn.metrics import plot_confusion_matrix

In [0]:
plot_confusion_matrix(pipeline, X_val, y_val, values_format='.0f', xticks_rotation='vertical');

## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

### Calculate accuracy 

In [0]:
total_predictions = 85 + 58 + 8 + 36 
total_predictions

187

In [0]:
#How many correct predictions were made?
  
correct_predictions=85+36
correct_predictions

121

In [0]:
correct_predictions / total_predictions

0.6470588235294118

###accuracy =64.7%

### Calculate precision

In [0]:
from sklearn.metrics import classification_report
  

In [0]:
correct_predictions_positive =  36

In [0]:
total_predictions_positive =58+36
total_predictions_positive

94

In [0]:
# precision= correct_predictions_positive / total_predictions_positive
precision=36/94
precision

0.3829787234042553

### Precision= 38.3%

### Calculate recall

In [0]:
actual_positive = 8+36
actual_positive

44

In [0]:
#recall_for_positive =correct_predictions_positive / actual_positive
recall_for_positive =36/44
recall_for_positive

0.8181818181818182

### The recall_for_positive is 81.8%.