# Kaggle: Kobe Bryant Shot Selection (Model)

This data contains the location and circumstances of every field goal attempted by Kobe Bryant took during his 20-year career. Your task is to predict whether the basket went in (shot_made_flag).

We have removed 5000 of the shot_made_flags (represented as missing values in the csv file). These are the test set shots for which you must submit a prediction. You are provided a sample submission file with the correct shot_ids needed for a valid prediction.

To avoid leakage, your method should only train on events that occurred prior to the shot for which you are predicting! Since this is a playground competition with public answers, it's up to you to abide by this rule.

https://www.kaggle.com/c/kobe-bryant-shot-selection/data

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import style; style.use('ggplot')

In [2]:
# load data and sort from earliest -> latest game
data = pd.read_csv('data.csv')
data = data.sort_values(by='game_date')
data.head()

Unnamed: 0,action_type,combined_shot_type,game_event_id,game_id,lat,loc_x,loc_y,lon,minutes_remaining,period,...,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,team_id,team_name,game_date,matchup,opponent,shot_id
22901,Jump Shot,Jump Shot,102,29600027,33.9283,-140,116,-118.4098,0,1,...,2PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,1996-11-03,LAL vs. MIN,MIN,22902
22902,Jump Shot,Jump Shot,127,29600031,33.9473,-131,97,-118.4008,10,2,...,2PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,1996-11-05,LAL @ NYK,NYK,22903
22903,Jump Shot,Jump Shot,124,29600044,33.8633,-142,181,-118.4118,8,2,...,3PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,1996-11-06,LAL @ CHH,CHA,22904
22904,Jump Shot,Jump Shot,144,29600044,34.0443,0,0,-118.2698,6,2,...,3PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612747,Los Angeles Lakers,1996-11-06,LAL @ CHH,CHA,22905
22905,Jump Shot,Jump Shot,151,29600044,33.9063,-10,138,-118.2798,5,2,...,2PT Field Goal,Center(C),In The Paint (Non-RA),8-16 ft.,1610612747,Los Angeles Lakers,1996-11-06,LAL @ CHH,CHA,22906


## Plan:
In visualization notebook:
- separate train and test data
    - use all labeled data for train, all unlabeled for test
    - remember to perform all preprocessing on BOTH train and test data together
- explore train data
    - determine trends
    - make visualizations

In model notebook:
- determine relevant features
- preprocess relevant features
    - BOTH train and test together
- determine model type
    - test various models using cross validation
        - break train set into train and validation sets
    - record best model
- make predictions on test set using best model

In [3]:
""" THIS SHOULD BE DONE AFTER PREPROCESSING ALL OF THE DATA """;
# # separate train and test data
# train = data[np.isnan(data.shot_made_flag) == False]
# test = data[np.isnan(data.shot_made_flag)]

# # remove shot_made_flag column from test (all nan)
# test = test.drop(['shot_made_flag'], axis=1)

In [4]:
list(data)

['action_type',
 'combined_shot_type',
 'game_event_id',
 'game_id',
 'lat',
 'loc_x',
 'loc_y',
 'lon',
 'minutes_remaining',
 'period',
 'playoffs',
 'season',
 'seconds_remaining',
 'shot_distance',
 'shot_made_flag',
 'shot_type',
 'shot_zone_area',
 'shot_zone_basic',
 'shot_zone_range',
 'team_id',
 'team_name',
 'game_date',
 'matchup',
 'opponent',
 'shot_id']

## Features to save:
- action_type/combined_shot_type: type of shot
    - yes, only use one
    - one-hot encode from string
- game_event_id/game_id: specific game when shot was taken
    - maybe, only use game_id
        - 1558 values: corresponds to total number of games Kobe played
    - one hot encode from int
- lat/lon: lattitude and longitude
    - maybe -- is this where in the country the game occured?
- loc_x/loc_y: x and y coordinates on court where shot was taken
    - yes
- minutes_remaining
    - yes
    - no preprocessing?
- period: quarter (or overtime, up to triple overtime (7))
    - yes
    - one hot encode from int
- playoffs
    - yes
    - already one hot encoded
- season: date range in years
    - yes
    - one-hot encode from string
- seconds_remaining
    - maybe -- range from 0-59, what is this referring to?
- shot_distance: from basket
    - yes
    - no preprocessing
- shot_type: 2pt or 3pt
    - yes
    - one hot encode
- shot_zone_area: left, right, center
    - yes
    - one hot encode
- shot_zone_basic: paint, mid-range, etc.
    - yes
    - one hot encode
- shot_zone_range
    - maybe (watch for double counting with distance)
    - one hot encode
- team_id/team_name: lakers (single value)
    - no
- game_date: year, month, day
    - no
- matchup/opponent
    - yes
    - matchup: opponent AND home (vs) or away (@)
    - opponent: just opponent
- shot_id: unique value for every shot
    - no

## Label:
- shot_made_flag is the label column

## Preprocessing
Remember to apply preprocessing on ALL of the data.

In [5]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()

In [6]:
# combined_shot_type: one-hot encode from string
combined_shot_type = one_hot_encoder.fit_transform(data.combined_shot_type.values.reshape(-1, 1)).toarray()

In [7]:
# period: one-hot encode from int
period = one_hot_encoder.fit_transform(data.period.values.reshape(-1, 1)).toarray()

In [8]:
# season: one-hot encode from string
season = one_hot_encoder.fit_transform(data.season.values.reshape(-1, 1)).toarray()

In [9]:
# shot_type: one-hot encode from string
shot_type = one_hot_encoder.fit_transform(data.shot_type.values.reshape(-1, 1)).toarray()

In [10]:
# shot_zone_area: one-hot encode from string
shot_zone_area = one_hot_encoder.fit_transform(data.shot_zone_area.values.reshape(-1, 1)).toarray()

In [11]:
# shot_zone_basic: one-hot encode from string
shot_zone_basic = one_hot_encoder.fit_transform(data.shot_zone_basic.values.reshape(-1, 1)).toarray()

In [12]:
# put all relevant features together
X = np.hstack((combined_shot_type,
               period,
               data.playoffs.values.reshape(-1, 1),
               season,
               data.shot_distance.values.reshape(-1, 1),
               shot_type,
               shot_zone_area,
               shot_zone_basic))

# store labels
y = data[np.isnan(data.shot_made_flag) == False].shot_made_flag.values

In [13]:
X.shape, y.shape

((30697, 50), (25697,))

## Train/Validation/Test Split

In [14]:
# train/test split
X_train = X[np.isnan(data.shot_made_flag) == False]
X_test = X[np.isnan(data.shot_made_flag) == True]
y_train = y

In [15]:
# further split train into train/validation
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                                    test_size=0.2, 
                                                    shuffle=True)

In [16]:
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((20557, 50), (5140, 50), (20557,), (5140,))

## Random Forest Model

In [30]:
# accuracy if predicting every shot as a miss
from sklearn.metrics import accuracy_score
accuracy_score(y_val, np.zeros(y_val.shape))

0.5577821011673152

In [31]:
# random forest
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)
accuracy_score(y_val, forest.predict(X_val))

0.5585603112840467

In [32]:
# svm
from sklearn.svm import SVC
svm = SVC(kernel='rbf', gamma='auto')
svm.fit(X_train, y_train)
accuracy_score(y_val, svm.predict(X_val))

0.6118677042801557

In [None]:
# neural net
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(hidden_layer_sizes=(500, 250, 125, 50, 25, 10, 5), max_iter=1000)
nn.fit(X_train, y_train)
accuracy_score(y_val, nn.predict(X_val))