# Kobe Bryant Shot Probability
## Kevin Shain

In [4]:
# Imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import scipy
%matplotlib inline

from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import confusion_matrix

import sys
sys.path.append('/Users/kshain/Documents/Git')
from progressbar import ProgressBar
import time
import warnings
warnings.filterwarnings('ignore')

# Table of contents
* [Import data](#Import-data)
* [Feature engineering](#Feature-engineering)
    * [Location to polar coordinates](#Location-to-polar-coordinates)
    * [Remaining time](#Remaining-time)
    * [Season](#Season)
    * [Home or away](#Home-or-away)
    * [Dropping uninformative data](#Dropping-uninformative-data)
    * [Date to day](#Date-to-day)
    * [Categorical data to indicators](#Categorical-data-to-indicators)
* [Separating the submission data](#Separating-the-submission-data)
* [Log loss function](#Log-loss-function)
* [Modeling](#Modeling)
    * [Random forest classifier](#Random-forest-classifier)
        * [Visualizing the random forest](#Visualizing-the-random-forest)
    * [Extra trees classifier](#Extra-trees-classifier)
    * [Gradient boosting](#Gradient-boosting)
    * [Final model](#Final-model)
* [Submission](#Submission)

# Import data

In [2]:
filename= "data.csv"
rawdata = pd.read_csv(filename)

# Feature engineering

In [3]:
data = rawdata #just to keep the raw data in its original form

## Location to polar coordinates

Polar coordinates would seem to make more sense than cartesian since the shooting area is roughly circularly symmetric around the basket. We have treat the case of `loc_x`=0 separately since we would be taking the arctan of infinity. This arctan is well defined to by $\frac{\pi}{2}$, but it's better to handle the infinity case symbolically instead of with Python.

In [5]:
data['dist'] = np.sqrt(data['loc_x']**2 + data['loc_y']**2)

loc_dist_zero = data['dist'] == 0
data['angle'] = np.empty(len(data))
data['angle'][~loc_dist_zero] = np.arccos(data['loc_x'][~loc_dist_zero] / data['dist'][~loc_dist_zero])
data['angle'][loc_dist_zero] = -np.pi / 2 #angle won't matter if dist=0 anyways

## Remaining time

We have two columns that are representing the same thing, time remaining in the game. We just need to fix the units.

In [6]:
data['remaining_time'] = data['minutes_remaining'] * 60 + data['seconds_remaining']

## Season


The season based on the calendar year when the shots were taken. We want to keep the relative spacing of the years, but absolute doesn't matter. I'll therefore make a column of the year be relative to Kobe's start in the league. 

In [7]:
data['year'] = data['season'].apply(lambda x: int(x.split('-')[0])-1996)
data = data.drop('season',1)

## Home or away

This information is potentially very important and is encoded in the `matchup` data as the '@' for away games or 'vs.' for home games.

In [1]:
data['ishome'] = ['vs.' in data.matchup[t] for t in range(len(data))]

NameError: name 'data' is not defined

## Dropping uninformative data

Some columns now have uninformative data and will not be needed in the model. Some of these are uninformative since they always take on the same value, like `team_name`. Others have information that is entirely encoded in other variables, like `game_id`. We must be careful not to eliminate data too soon, but one can proceed without dropping these columns and experiment to see that they do not have predictive value.

In [None]:
drops = ['shot_id', 'team_id', 'team_name', 'shot_zone_area', 'shot_zone_range', 'shot_zone_basic', \
         'matchup', 'seconds_remaining', 'minutes_remaining', \
         'shot_distance', 'loc_x', 'loc_y', 'game_event_id', 'game_id']
for drop in drops:
    data = data.drop(drop, 1)

## Date to day

It is a bit hard to use the date as an input to any sort of classifier when it is composed of a year, month, and day. Therefore, I strip the time object to yield a day in the season. This is useful in that one anticipates that the idea of when one month ends and another begins is somewhat arbitrary, but there is likely to be value in data about when during the season a game was played.

In [None]:
data['day_In_Season'] = data.game_date.apply(lambda x: time.strptime(x, "%Y-%m-%d").tm_yday-300)
data['day_In_Season'] = data['day_In_Season'].apply(lambda x: x if x >= 0 else x+365)
data = data.drop('game_date',1)

## Categorical data to indicators

First, I want to make a non-dummy dataframe since that is easier to work with in non Random forest contexts.

In [None]:
data_preDummies = data

Now, for each categorical variable in the data, I use the `pandas.get_dummies` function to make indicator variable for each category and afterward, I drop the categorical column.

In [None]:
categorical_vars = ['action_type', 'combined_shot_type', 'shot_type', 'opponent', 'period']
for var in categorical_vars:
    data = pd.concat([data, pd.get_dummies(data[var], prefix=var)], 1)
    data = data.drop(var, 1)

# Separating the submission data

I need to make a submission dataframe that contains only the shots that I am tasked with predicting. These are marked `null` in the dataset so I just find those.

In [None]:
submission = data[pd.isnull(data['shot_made_flag'])]
submission = submission.drop('shot_made_flag', 1)

Also, I need separate the rest of the data, used for training, into features and targets, also known as x and y.

In [None]:
train_all = data[pd.notnull(data['shot_made_flag'])]
train_x = train_all.drop('shot_made_flag', 1)
train_y = train_all.shot_made_flag

# Log loss function

The log loss function is used by Kaggle to grade your predictions. It conveniently quantifies how confidently right or wrong the prediction was, since the goal is to predict a shot percentage, but each shot is either a make (1) or a miss (0).

In [None]:
def logloss(actual, pred):
    epsilon = 1e-15
    pred = scipy.maximum(epsilon, pred) #buffer from 0 so that the log loss isn't infinite
    pred = scipy.minimum(1-epsilon, pred)
    logl = sum(actual*scipy.log(pred) + scipy.subtract(1,actual)*scipy.log(scipy.subtract(1,pred)))
    logl = logl * -1.0/len(actual)
    return logl

# Modeling

## Random forest classifier

The random forest classifier is a great out-of-the-box classifier for this type of data. The random forest is built on decision trees where each node uses only one feature, thus feature normalization is not needed like in an SVM. The real trick to using random forests is choosing the right features to include, and to a lesser extent, selecting the number of trees and maximum depth. The following loops try a variety of number of trees and depths and uses k-fold cross-validation to show the optimal parameters. It turns out that these parameter values are not nearly as important as the features.

In [None]:
print('Finding best n_estimators for RandomForestClassifier...')
min_score = 100000
best_n = 0
scores_n = []
range_n = np.linspace(100,800,num=8).astype(int)
for n in range_n:
    print("the number of trees : {0}".format(n))
    t1 = time.time()
    
    rfc_score = 0.
    rfc = RandomForestClassifier(n_estimators=n)
    for train_k, test_k in KFold(len(train_x), n_folds=10, shuffle=True):
        rfc.fit(train_x.iloc[train_k], train_y.iloc[train_k])
        #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict_proba(train_x.iloc[test_k])
        rfc_score += logloss(train_y.iloc[test_k], pred[:,1]) / 10
    scores_n.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_n = n
        
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(n, t2-t1))
print(best_n, min_score)


# find best max_depth for RandomForestClassifier
print('Finding best max_depth for RandomForestClassifier...')
min_score = 100000
best_m = 0
scores_m = []
range_m = np.linspace(5,40,num=8).astype(int)
for m in range_m:
    print("the max depth : {0}".format(m))
    t1 = time.time()
    
    rfc_score = 0.
    rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n)
    for train_k, test_k in KFold(len(train_x), n_folds=10, shuffle=True):
        rfc.fit(train_x.iloc[train_k], train_y.iloc[train_k])
        #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict_proba(train_x.iloc[test_k])
        rfc_score += logloss(train_y.iloc[test_k], pred[:,1]) / 10
    scores_m.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_m = m
    
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(m, t2-t1))
print(best_m, min_score)

### Visualizing the random forest

Just some simple plots can show how the log loss score can be affected by the parameter selection

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(range_n, scores_n)
plt.ylabel('score')
plt.xlabel('number of trees')

plt.subplot(122)
plt.plot(range_m, scores_m)
plt.ylabel('score')
plt.xlabel('max depth')

## Final model

The random forest classifier is remarkably easy to implement, but yet very powerful. Once the best features and parameters are chosen, the classifier can be fit on the entire training set. Then, the submission set is fed through the trees and the random forest classifier makes a probabilistic predition. 

In [None]:
rfc = RandomForestClassifier(n_estimators=500, max_depth=20)
rfc.fit(train_x, train_y)
pred = rfc.predict_proba(submission)

# Submission


Kaggle gives a sample submission file, so it is easiest to just read that into a DataFrame and modify the `shot_made_flag` with predictions from the random forest classifier.

In [None]:
sub = pd.read_csv("sample_submission.csv")
sub['shot_made_flag'] = pred[:,1]
sub.to_csv("real_submission.csv", index=False)