### Data Analysis - Basic Predictions + New Data Prep
* Given past stats, project positional contribution ('eff_ratio_pred') and pace ('pace_pred')
* assume games as the average over the last 3 years
* LEBRON_data.csv => LEBRON_data_feng.csv

### what we actually want to do to test our model
* re-prep EVERY year, get _pred by predicting on the 'B' mask
* re-prep when calculating team positional contribution, use predicted values on the 'B' mask, and failing that, the most recent value (or failing that, zero (rookies in 2018).
* save results (replace the _0_yr_ago with _pred)
* LEBRON_data.csv => LEBRON_data_feng.csv
* LEBRON_target.csv => LEBRON_target_feng.csv
* re-train the re-prep set with the 'D' mask
* compare to holdout set to see how it worked out
* use the whole trained 'D' set to predict 2018

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [2]:
# To plot matplotlib figures inline on the notebook
%matplotlib inline

from sklearn.model_selection import train_test_split
#from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.ensemble import RandomForestRegressor

from sklearn.cross_validation import cross_val_score, train_test_split, KFold
from sklearn.grid_search import GridSearchCV



In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [4]:
from luther_common import *

In [5]:
# categories to predict
pred_categories = ['pts_per_g',
         'fg_per_g','fga_per_g',
         'fg3_per_g','fg3a_per_g',
         'ft_per_g','fta_per_g',
         'trb_per_g','blk_per_g',
         'stl_per_g','ast_per_g',
         'tov_per_g'
        ]

In [6]:
# load our predictive and standardization models
from sklearn.externals import joblib
estimators = dict()
standardizers = dict()

for category in pred_categories:
    estimators[category]=joblib.load('naive_linreg_predictor_'+category+'.pkl')
    standardizers[category]=joblib.load('naive_linreg_standardizer_'+category+'.pkl')

In [7]:
# load our data:
X_df = pd.read_csv('LEBRON_data.csv', index_col=0)
y_df = pd.read_csv('LEBRON_target.csv', index_col=0)
pred_df = X_df.copy()

In [8]:
# needs to be mask by 'B'
# need to add predictions to X 

In [9]:
# predict another set of stats for all players: _pred
# assume 82 games
# 'pts_per_g','fg_per_g','fga_per_g','fg3_per_g','fg3a_per_g','ft_per_g','fta_per_g','trb_per_g','blk_per_g','stl_per_g','ast_per_g','tov_per_g'
# 'pace', 'eff_raw_yr', 'eff_ratio_yr'

In [10]:
# level = 'B'
# category = 'ft_per_g'
# #piggyback off the existing mask function to mask the X and y
# ready_X, ready_y = mask_data(category, level, X_df, y_df)
# std_ready_X = standardizers[category].transform(ready_X)
# y_predict = estimators[category].predict(std_ready_X)
# X_df[category+'_pred'] = y_predict

In [11]:
# basic prediction using 'B' mask:
level = 'B'
for category in pred_categories:
    #piggyback off the existing mask function to mask the X and y
    ready_X, ready_y = mask_data(category, level, X_df, y_df)
    std_ready_X = standardizers[category].transform(ready_X)
    y_predict = estimators[category].predict(std_ready_X)
    X_df[category+'_pred'] = y_predict

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  ready_X.drop(excluded_columns, axis=1, inplace=True)


In [12]:
# get a raw efficiency score for the X _pred's and for player_seasons_df:
#    (PTS + REB + AST + STL + BLK − ((FGA − FGM) + (FTA − FTM) + TO)) multiply by g to weight it
X_df['eff_raw_pred'] = (X_df['pts_per_g_pred'] +\
                    X_df['trb_per_g_pred'] +\
                    X_df['ast_per_g_pred'] +\
                    X_df['stl_per_g_pred'] +\
                    X_df['blk_per_g_pred'] -\
                   ((X_df['fga_per_g_pred'] - X_df['fg_per_g_pred']) +\
                    (X_df['fta_per_g_pred'] - X_df['ft_per_g_pred']) +\
                     X_df['tov_per_g_pred'])) * 72#X_df['g_pred'] (assume playing 72 games...)

In [13]:
#Note: we lost a bit of nuance in the data due to row combination, ignore it for the time being
# read in the performance of the player for each season (no predictions, broken down by season)
player_seasons_df = pd.read_csv('player_seasons_list_processed.csv', index_col=0)

In [14]:
%%time

#pre-process; re-label a couple of columns to use
player_seasons_df['season_year_prev'] = player_seasons_df['season_year'].apply(lambda x: x-1)

# now we 1) try to predict a NEW eff_ratio and 2) copy last year's pace over as a prediction of pace (at the end)
for index, player_season in X_df.iterrows():

    #get a df of all the people who played in the position on the same team that year and sum their contribution scores
    teammates_df = player_seasons_df[(player_seasons_df['season_year'] == player_season['season_year_0_ya']) &
                                   (player_seasons_df['team_id'] == player_season['team_id_0_ya']) &
                                   (player_seasons_df['poscat'] == player_season['poscat_0_ya'])
                                  ]

    # merge in raw predictions from X_df
    pred_teammates_df = pd.merge(teammates_df, 
                                 X_df.loc[:,['canonical','season_year_0_ya','eff_raw_pred']], 
                                 how='left', 
                                 left_on=['canonical','season_year'], 
                                 right_on=['canonical','season_year_0_ya'])
    # merge in numbers from last year
    pred_teammates_df = pd.merge(pred_teammates_df,
                                 player_seasons_df.loc[:,['canonical','season_year_prev','eff_raw']],
                                 how='left', 
                                 left_on=['canonical','season_year'], 
                                 right_on=['canonical','season_year_prev'])
    
    #eff_raw_pred (predicted), #eff_raw_x (this year -- should actually not be used), #eff_raw_y (last year)
    # if we didn't have the eff_raw_pred, replace it
    for teammate_index, teammate_season in pred_teammates_df.iterrows():
        if pd.isnull(teammate_season['eff_raw_pred']):
            if not pd.isnull(teammate_season['eff_raw_y']):
                pred_teammates_df.loc[teammate_index, 'eff_raw_pred'] = teammate_season['eff_raw_y']
            else:
                #this player didn't HAVE a previous season..
                pred_teammates_df.loc[teammate_index, 'eff_raw_pred'] = 0
    
    #finally, take contribution score and divide contribution score of position of the team
    num = player_season['eff_raw_pred']
    denom = sum(pred_teammates_df['eff_raw_pred'])
    #careful of division by zero or zero divided by zero...
    
    if denom == 0:
        X_df.loc[index, 'eff_ratio_pred'] = 1
    else:
        X_df.loc[index, 'eff_ratio_pred'] = num/denom
    
#     #some debug code
#     if X_df.loc[index, :].isnull().any():
#         print("NULL:")
#         print(index)
#         print("SEASON:")
#         print(player_season)
#         print("TEAMMATES:")
#         print(pred_teammates_df)
#         print("NUM:")
#         print(num)
#         print("DENOM:")
#         print(denom)



CPU times: user 2min 24s, sys: 3.37 s, total: 2min 27s
Wall time: 2min 29s


In [15]:
#copy over pace from 1 year ago
X_df['team_pace_pred'] = X_df['team_pace_1_ya']

In [18]:
#check for nulls
#X_df[X_df.isnull().any(axis=1)]
X_df.describe()

Unnamed: 0,age_0_ya,age_1_ya,age_2_ya,age_3_ya,ast_per_g_0_ya,ast_per_g_1_ya,ast_per_g_2_ya,ast_per_g_3_ya,blk_per_g_0_ya,blk_per_g_1_ya,...,ft_per_g_pred,fta_per_g_pred,trb_per_g_pred,blk_per_g_pred,stl_per_g_pred,ast_per_g_pred,tov_per_g_pred,eff_raw_pred,eff_ratio_pred,team_pace_pred
count,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,...,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0,7220.0
mean,29.122992,28.122992,27.122992,26.122992,2.244604,2.367124,2.421219,2.38746,0.481451,0.520964,...,1.879827,2.478299,4.170608,0.481451,0.769106,2.244604,1.407738,796.770822,0.384093,93.120701
std,3.588139,3.588139,3.588139,3.588139,2.0237,2.055423,2.083724,2.096636,0.547517,0.573324,...,1.384567,1.752537,2.373635,0.49505,0.415316,1.853734,0.707324,405.938848,0.233422,3.873774
min,21.0,20.0,19.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.481811,-0.528011,-0.048452,-0.06793,-0.019313,-0.419267,-0.164691,-52.843987,-0.053491,82.3
25%,26.0,25.0,24.0,23.0,0.8,0.9,0.9,0.9,0.1,0.177564,...,0.88338,1.213867,2.406289,0.157918,0.471895,0.906169,0.871375,495.451111,0.209581,90.4
50%,29.0,28.0,27.0,26.0,1.6,1.711089,1.8,1.8,0.3,0.3,...,1.530322,2.065164,3.565658,0.318393,0.697584,1.718396,1.277637,722.407998,0.32697,92.5
75%,32.0,31.0,30.0,29.0,3.0,3.2,3.2,3.2,0.6,0.7,...,2.554344,3.346214,5.434084,0.622641,0.98134,3.044303,1.82841,1041.794081,0.506795,95.4
max,43.0,42.0,41.0,40.0,14.5,14.5,14.5,14.5,4.6,4.6,...,8.820232,10.713793,15.351796,3.813393,2.868911,13.103977,4.115314,2599.271049,1.003366,113.7


In [17]:
X_df.to_csv('LEBRON_data_feng.csv')