# AdaBoost + Logistic Regression

In this notebook, I'll attempt to make shot predictions using the AdaBoost classifier with a logisitc regression model as the base estimator.  

I'll process as follow:  
1. import the data processed in features_engineering.ipynb  
2. get a base line performance runing AdaBoost with default settings  
3. select the best features using Recursive Feature Elimination (RFE) with a logistic regression model and testing the model performance for all possible numbers of features  
4. do parameters optimization on the base estimator using the selected best features
5. do parameters optimization on the AdaBoost classifier using the optimized base estimator and the selected best features  
6. predict the missing shots using the optimized AdaBoost classifier
  

In [2]:
import pandas as pd
import numpy as np
import time
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score

In [2]:
df_train = pd.read_pickle('../input/processed_train_data.pickle')

## Get baseline performance

In [3]:
df_train.dropna(subset=['MA200'], inplace=True)

X = np.array(df_train.drop(['game_date','shot_id','shot_made_flag'], axis=1))
y = np.array(df_train['shot_made_flag'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ada = AdaBoostClassifier()
ada.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [4]:
num_fold = 10
num_instances = len(X)
seed = 7
scoring = 'neg_log_loss'

kfold = KFold(n_splits=num_fold, random_state=seed)
cv_results = cross_val_score(ada, X, y, 
                             cv=kfold,
                             scoring=scoring)
score = cv_results.mean()
print(score)

-0.684209357786


## Features Selection
  
I'll use a logistic regression as the base estimator. I do the features selection with this model.

In [5]:
from sklearn.feature_selection import RFE

def get_best_features(model, X, y):
    ''' find the n best features to train the model
    parameters
    ----------
        model    object
            a supervised learning estimator
        X    array
            training set
        y    array
            target values
    output
    ------
        X_best    array
            the array of best features
        mask    array
            an array of boolean values that can be used
            to transform other datasets
            
    todo
    ------
        - add an early stop parameter
    '''
    
    num_fold = 10
    seed = 7
    scoring = 'neg_log_loss'
    best_cv = -100    # the smallest possible score (-100 for log loss)
    best_mask = []
    
    # n is the number of features to select
    # we're testing the features selection script
    # for all possible numbers of features
    # the minimum is 1 
    # the maximum is the total number of features (X.shape[1])
    for n in range(1,X.shape[1] + 1):
        best_flag = ""
        
        # get the n best features
        rfe = RFE(model,n)
        X_new = rfe.fit_transform(X,y)

        # test the model with the n best features
        kfold = KFold(n_splits = num_fold,
                      random_state = seed)

        cv_results = cross_val_score(model, X_new, y,
                                     cv = kfold,
                                     scoring = scoring)
        score = cv_results.mean()

        # If the test returns a best score (i.e. any score higher 
        # than the previous best score), the test score is set
        # as the new best score and the support is saved as the 
        # best mask
        if score > best_cv:
            best_flag = " => This is a best score"
            X_best = X_new
            best_mask = rfe.get_support(False)
            best_cv = score
        
        print('n=' + str(n) + ' : ' + str(score) + best_flag)
            
    print('=============================')
    print('best score = %f' % best_cv)
    print('best n = %f' % X_best.shape[1])  
    
    return X_best, best_mask
    

In [6]:
lr = LogisticRegression()
X_best, mask = get_best_features(lr, X, y)

n=1 : -0.625256179649 => This is a best score
n=2 : -0.604807003924 => This is a best score
n=3 : -0.603600868771 => This is a best score
n=4 : -0.602390234368 => This is a best score
n=5 : -0.602360277344 => This is a best score
n=6 : -0.570243083067 => This is a best score
n=7 : -0.5694956604 => This is a best score
n=8 : -0.569167985706 => This is a best score
n=9 : -0.559615891417 => This is a best score
n=10 : -0.559487102743 => This is a best score
n=11 : -0.559406321205 => This is a best score
n=12 : -0.559295177838 => This is a best score
n=13 : -0.559244614224 => This is a best score
n=14 : -0.559204352139 => This is a best score
n=15 : -0.559092734409 => This is a best score
n=16 : -0.558937681536 => This is a best score
n=17 : -0.558916473584 => This is a best score
n=18 : -0.558896253575 => This is a best score
n=19 : -0.558900295397
n=20 : -0.558908066873
n=21 : -0.558887882338 => This is a best score
n=22 : -0.558687506113 => This is a best score
n=23 : -0.558249620388 =>

In [7]:
mask

array([False, False, False, False, False,  True,  True,  True,  True,
       False,  True, False, False,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True, False,  True,
        True,  True, False,  True,  True,  True,  True, False, False,
        True, False,  True,  True,  True, False,  True, False, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True,  True,  True,  True,  True, False,  True, False,
        True,  True, False,  True, False,  True, False, False, False,
       False,  True,  True, False, False, False, False, False, False,
       False, False,

In [8]:
# for each feature, I get a True/False value. True means best feature
features_masks = pd.DataFrame({
        'feature': df_train.drop(['game_date','shot_id','shot_made_flag'], axis=1).columns,
        'best': mask
    })
features_masks.head()

Unnamed: 0,best,feature
0,False,loc_x
1,False,loc_y
2,False,playoffs
3,False,shot_distance
4,False,seconds_to_end


In [9]:
# Get the list of best features names
# We can use this list later to select those features
# from a dataframe
best_features = features_masks[features_masks['best'] == True]['feature'].values
best_features

array(['MA10', 'MA20', 'MA50', 'MA100', 'action_type_Alley Oop Dunk Shot',
       'action_type_Cutting Layup Shot', 'action_type_Driving Bank shot',
       'action_type_Driving Dunk Shot',
       'action_type_Driving Finger Roll Layup Shot',
       'action_type_Driving Finger Roll Shot',
       'action_type_Driving Floating Bank Jump Shot',
       'action_type_Driving Floating Jump Shot',
       'action_type_Driving Jump shot', 'action_type_Driving Layup Shot',
       'action_type_Driving Reverse Layup Shot',
       'action_type_Driving Slam Dunk Shot', 'action_type_Dunk Shot',
       'action_type_Fadeaway Bank shot', 'action_type_Fadeaway Jump Shot',
       'action_type_Finger Roll Layup Shot',
       'action_type_Finger Roll Shot', 'action_type_Floating Jump shot',
       'action_type_Hook Bank Shot', 'action_type_Hook Shot',
       'action_type_Jump Bank Shot', 'action_type_Jump Hook Shot',
       'action_type_Jump Shot', 'action_type_Layup Shot',
       'action_type_Pullup Bank sho

In [3]:
# skip the features selection which takes way too long
best_features = ['MA10', 'MA20', 'MA50', 'MA100', 'action_type_Alley Oop Dunk Shot',
       'action_type_Cutting Layup Shot', 'action_type_Driving Bank shot',
       'action_type_Driving Dunk Shot',
       'action_type_Driving Finger Roll Layup Shot',
       'action_type_Driving Finger Roll Shot',
       'action_type_Driving Floating Bank Jump Shot',
       'action_type_Driving Floating Jump Shot',
       'action_type_Driving Jump shot', 'action_type_Driving Layup Shot',
       'action_type_Driving Reverse Layup Shot',
       'action_type_Driving Slam Dunk Shot', 'action_type_Dunk Shot',
       'action_type_Fadeaway Bank shot', 'action_type_Fadeaway Jump Shot',
       'action_type_Finger Roll Layup Shot',
       'action_type_Finger Roll Shot', 'action_type_Floating Jump shot',
       'action_type_Hook Bank Shot', 'action_type_Hook Shot',
       'action_type_Jump Bank Shot', 'action_type_Jump Hook Shot',
       'action_type_Jump Shot', 'action_type_Layup Shot',
       'action_type_Pullup Bank shot', 'action_type_Pullup Jump shot',
       'action_type_Putback Dunk Shot',
       'action_type_Putback Slam Dunk Shot',
       'action_type_Reverse Layup Shot',
       'action_type_Reverse Slam Dunk Shot',
       'action_type_Running Bank shot',
       'action_type_Running Finger Roll Layup Shot',
       'action_type_Running Finger Roll Shot',
       'action_type_Running Hook Shot', 'action_type_Running Jump Shot',
       'action_type_Running Reverse Layup Shot',
       'action_type_Running Tip Shot', 'action_type_Slam Dunk Shot',
       'action_type_Step Back Jump shot', 'action_type_Tip Shot',
       'action_type_Turnaround Fadeaway shot',
       'action_type_Turnaround Finger Roll Shot',
       'action_type_Turnaround Hook Shot',
       'action_type_Turnaround Jump Shot', 'combined_shot_type_Bank Shot',
       'combined_shot_type_Dunk', 'combined_shot_type_Hook Shot',
       'combined_shot_type_Jump Shot', 'combined_shot_type_Layup',
       'combined_shot_type_Tip Shot', 'period_2', 'period_3', 'period_4',
       'period_7', 'season_2015-16', 'shot_type_2PT Field Goal',
       'shot_type_3PT Field Goal', 'shot_zone_area_Back Court(BC)',
       'shot_zone_area_Center(C)', 'shot_zone_area_Left Side Center(LC)',
       'shot_zone_area_Right Side Center(RC)',
       'shot_zone_basic_Above the Break 3', 'shot_zone_basic_Backcourt',
       'shot_zone_basic_Left Corner 3', 'shot_zone_basic_Restricted Area',
       'shot_zone_range_Back Court Shot',
       'shot_zone_range_Less Than 8 ft.', 'year_2013', 'year_2015',
       'opponent_BKN', 'opponent_GSW', 'opponent_MIL', 'opponent_NJN',
       'opponent_OKC', 'opponent_SEA', 'opponent_TOR', 'venue_away',
       'venue_home']

In [10]:
X_best = np.array(df_train[best_features])
X_best.shape

(25538, 82)

## Parameters Optimization
  
Next, we get the best parameters for the base estimator and the AdaBoost classifier.  
In both cases, we'll use GridSearchCV

In [11]:
from sklearn.model_selection import GridSearchCV

def get_best_params(model, X, y, param_grid, verbose=0):
    ''' get the parameters that return the best score
    parameters
    ----------
        model object
            a supervised learning estimator
        X    array
            training set
        y    array
            target values
        param_grid    dictionary
            the list of parameters and their possible values
    output
    ------
        best_params    dictionary
            the best value for each parameter in initial list
    '''
    
    scoring = 'neg_log_loss'
    cv = 10
    
    grid = GridSearchCV(model, 
                        param_grid,
                        scoring = scoring,
                        n_jobs = 3,
                        cv = cv,
                        verbose = verbose)
    
    grid.fit(X,y)
    
    print(grid.best_score_)
    print(grid.best_estimator_)

In [12]:
# First I get the parameters I'll use with 
# the base estimator
param_grid = {'solver': ['liblinear'],
              'penalty': ['l1','l2'],
              'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
              'fit_intercept': [True,False],
              'intercept_scaling':[0.001, 0.01, 0.1, 1, 10, 100, 1000]
              }

get_best_params(LogisticRegression(), X_best, y, param_grid, 10)

Fitting 10 folds for each of 196 candidates, totalling 1960 fits


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:    0.7s
[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:    1.1s
[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:    1.6s
[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:    2.0s
[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:    2.4s
[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:    3.0s
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    3.6s
[Parallel(n_jobs=3)]: Done  55 tasks      | elapsed:    4.3s
[Parallel(n_jobs=3)]: Done  66 tasks      | elapsed:    4.9s
[Parallel(n_jobs=3)]: Done  79 tasks      | elapsed:    5.8s
[Parallel(n_jobs=3)]: Done  92 tasks      | elapsed:    6.7s
[Parallel(n_jobs=3)]: Done 107 tasks      | elapsed:    7.7s
[Parallel(n_jobs=3)]: Done 122 tasks      | elapsed:    8.7s
[Parallel(n_jobs=3)]: Done 139 tasks      | elapsed:    9.9s
[Parallel(n_jobs=3)]: Done 156 tasks      | elapsed:   10.8s
[Parallel(n_jobs=3)]: Done 175 tasks      | elapsed:   11.9s
[Parallel(n_jobs=3)]: Do

-0.556973046097
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1000, max_iter=100, multi_class='ovr',
          n_jobs=1, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)


In [13]:
lr = LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1000, max_iter=100, multi_class='ovr',
          n_jobs=1, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [14]:
# Once I get the best parameters for the base estimator,
# I can use it and find the best parameters of the AdaBoost
param_grid = {
    'n_estimators' : [10,50,100],
    'learning_rate' : [0.001, 0.01, 0.1, 1]
}

starttime = time.time()

get_best_params(AdaBoostClassifier(base_estimator = lr), X_best, y, param_grid, 10)

print('Ran in ', time.time()-starttime, 'seconds.')

Fitting 10 folds for each of 12 candidates, totalling 120 fits


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:    1.5s
[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:    3.4s
[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:    8.0s
[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:   18.6s
[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:   39.5s
[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:   48.6s
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:  1.0min
[Parallel(n_jobs=3)]: Done  55 tasks      | elapsed:  1.6min
[Parallel(n_jobs=3)]: Done  66 tasks      | elapsed:  1.8min
[Parallel(n_jobs=3)]: Done  79 tasks      | elapsed:  2.0min
[Parallel(n_jobs=3)]: Done  92 tasks      | elapsed:  2.6min
[Parallel(n_jobs=3)]: Done 107 tasks      | elapsed:  2.8min
[Parallel(n_jobs=3)]: Done 120 out of 120 | elapsed:  3.3min finished


-0.68582485882
AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1000, max_iter=100, multi_class='ovr',
          n_jobs=1, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
          learning_rate=0.001, n_estimators=10, random_state=None)
Ran in  200.38646125793457 seconds.


In [15]:
# I now have the best settings for my model. 
# I can initiate the classifier
ada = AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1000, max_iter=100, multi_class='ovr',
          n_jobs=1, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
          learning_rate=0.001, n_estimators=10, random_state=None)

## Predict Shots
  
We want to prevent leakage, therefore we have to make each prediction using only data prior to the shot.

In [None]:
# import the processed data
df = pd.read_pickle('../input/processed_data.pickle')

# remove game_date
df.drop(['game_date'], axis=1, inplace=True)