## Outline 

* Setup 
* Create train, dev, & test set 
* Performance Metric: Precision
* Select a model - choose between RF, XGBoost, & LR 
* Report Generalization error
* Iterate - engineer new features, add more data. Possibly carry this out on a different notebook 

### Setup

In [15]:
import pandas as pd 
import numpy as  np 
import sklearn as sk

matches = pd.read_pickle('../data/interim/matches_interim.pkl')
matches['Year'] = pd.to_datetime(matches.datetime).dt.year
matches.info() 



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1317 entries, 0 to 1316
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   day            1317 non-null   int8          
 1   venue          1317 non-null   int8          
 2   opponent       1317 non-null   int8          
 3   formation      1317 non-null   int8          
 4   hour           1317 non-null   int8          
 5   datetime       1317 non-null   datetime64[ns]
 6   team           1317 non-null   object        
 7   gf_rolling     1317 non-null   float64       
 8   ga_rolling     1317 non-null   float64       
 9   sh_rolling     1317 non-null   float64       
 10  sot_rolling    1317 non-null   float64       
 11  dist_rolling   1317 non-null   float64       
 12  fk_rolling     1317 non-null   float64       
 13  pk_rolling     1317 non-null   float64       
 14  pkatt_rolling  1317 non-null   float64       
 15  target         1317 n

### Create train, dev, & test sets

In [3]:
# # Original 
# # Training set 

# matches2021 = matches.query('season == 2021')

# X_train = matches2021.drop(['target','datetime','team', 'season'], axis = 1)
# Y_train = matches2021['target']

# # Dev Set & Test set 

# matches2022 = matches.query('season == 2022')

# matches2022.drop(['datetime','team','season'], axis = 1, inplace = True)

# from sklearn.model_selection import train_test_split

# X_test, X_dev, Y_test, Y_dev =  train_test_split(matches2022.drop('target', axis = 1), matches2022['target'],
#                                                  test_size = 0.50, shuffle = True, random_state = 42)



# X_test, X_main_test, Y_test, Y_main_test =   train_test_split(matches2022.drop('target', axis = 1), matches2022['target'],
#                                                  test_size = 0.30, shuffle = True, random_state = 42)





In [16]:
# Note: This is experiment 

# Training set 

years = [ 2020, 2021]

matches2021 = matches.query('Year in @years')

X_train = matches2021.drop(['target','datetime','team', 'season','Year'], axis = 1)
Y_train = matches2021['target']

# Dev Set & Test set 

matches2022 = matches.query('Year == 2022')

matches2022.drop(['datetime','team','season', 'Year'], axis = 1, inplace = True)

from sklearn.model_selection import train_test_split

X_test, X_dev, Y_test, Y_dev =  train_test_split(matches2022.drop('target', axis = 1), matches2022['target'],
                                                 test_size = 0.50, shuffle = True, random_state = 42)



# X_test, X_main_test, Y_test, Y_main_test =   train_test_split(matches2022.drop('target', axis = 1), matches2022['target'],
#                                                  test_size = 0.30, shuffle = True, random_state = 42)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  matches2022.drop(['datetime','team','season', 'Year'], axis = 1, inplace = True)


In [7]:
# Storing the datasets 

import pickle 
import os

# os.mkdir('../data/interim/train_test_sets')
X_train.to_pickle('../data/interim/train_test_sets/X_train.pkl')
X_dev.to_pickle('../data/interim/train_test_sets/X_dev.pkl')
X_test.to_pickle('../data/interim/train_test_sets/X_test.pkl')
X_main_test.to_pickle('../data/interim/train_test_sets/X_main_test.pkl')

Y_train.to_pickle('../data/interim/train_test_sets/Y_train.pkl')
Y_dev.to_pickle('../data/interim/train_test_sets/Y_dev.pkl')
Y_test.to_pickle('../data/interim/train_test_sets/Y_test.pkl')
Y_main_test.to_pickle('../data/interim/train_test_sets/Y_main_test.pkl')

In [17]:
### Select A Model 

# Training Random Forest 

from sklearn.ensemble import RandomForestClassifier 

rf = RandomForestClassifier(max_depth=10, random_state=42)
rf.fit(X_train, Y_train)


# Training XGBoost 

import xgboost as xgb 

xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, Y_train)


# Training Logistic Regression 

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42, max_iter= 1000).fit(X_train, Y_train)



In [18]:
# Predicting & Comparing Performance 

# Performance Measure: Precision on dev set

from sklearn.metrics import confusion_matrix, precision_score, recall_score

y_dev_rg_pred = rf.predict(X_dev)
y_dev_xgb_pred = xgb_model.predict(X_dev)
y_dev_lr_pred = lr.predict(X_dev)

rg_precision = precision_score(Y_dev, y_dev_rg_pred)
xgb_precision = precision_score(Y_dev,y_dev_xgb_pred)
lr_precision = precision_score(Y_dev ,y_dev_lr_pred)

best_precision = max(rg_precision,xgb_precision,lr_precision)

rg_precision, xgb_precision, lr_precision, best_precision

(0.5909090909090909,
 0.5526315789473685,
 0.5384615384615384,
 0.5909090909090909)

In [19]:
# Performance Measure: Precision on train set 

y_train_rf_pred = rf.predict(X_train)
y_train_xgb_pred = xgb_model.predict(X_train)
y_train_lr_pred = lr.predict(X_train)


rg_precision_train = precision_score(Y_train, y_train_rf_pred)
xgb_precision_train = precision_score(Y_train, y_train_xgb_pred)
lr_precision_train = precision_score(Y_train, y_train_lr_pred)


rg_precision_train, xgb_precision_train, lr_precision_train

(1.0, 1.0, 0.5595238095238095)

xgb, & rg have high variance problem, lr has high bias problem. LR currently performs best on the dev set but the precision score is 66%. Probably training a single decision tree classifier can help to get a better precision 

In [100]:
from sklearn.linear_model import SGDClassifier 

sgd_clf = SGDClassifier(random_state=42)

sgd_clf.fit(X_train, Y_train)

y_dev_sgd_pred =  sgd_clf.predict(X_dev)
y_train_sgd_pred = sgd_clf.predict(X_train)

sgd_precision = precision_score(y_dev_sgd_pred, Y_dev)
sgd_precision_train = precision_score(Y_train, y_train_sgd_pred)

sgd_precision, sgd_precision_train

(0.044444444444444446, 0.5)

In [101]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=42)

dt_clf.fit(X_train, Y_train)

y_dev_dt_pred =  dt_clf.predict(X_dev)
y_train_dt_pred = dt_clf.predict(X_train)

dt_precision = precision_score(y_dev_dt_pred, Y_dev)
dt_precision_train = precision_score(Y_train, y_train_dt_pred)

dt_precision, dt_precision_train

(0.4, 1.0)

SGD has a high bias and high variance problem, dt has a high variance problem. Logistic Regression is the model to go. 

### Model Selected - Logistic Regression 

#### Notes

* LR Precision on dev = 66%
* LR Precision on train = 52.3 %
* To solve for high bias.
* Optimize feature vector for gradient descent.


In [6]:
# Test data performance for Logistic Regression 

# Load models

import pickle

lr = pickle.load(open('../models/LogisticRegression_v1.pkl','rb'))
xgb_model = pickle.load(open('../models/XGBoost_v1.pkl', 'rb'))
rf = pickle.load(open('../models/RandomForest_v1.pkl','rb'))


In [20]:
# Running the model on test data 


y_test_lr_predict = rf.predict(X_test)

from sklearn.metrics import confusion_matrix, precision_score, recall_score


lr_precision_main = precision_score(Y_test, y_test_lr_predict, average= 'binary')
lr_precision_main # Generalization performance


0.5714285714285714

#### Next Steps

Data 

1.  Preserve the opponent column to enable filtering of predictions.
2.  Add the remaining numerical columns as predictors.
3.  Scale & normalize the data.

Model

1. Conduct Hyperparameter tuning on the newly trained model using RandomizedSearchCV to find the best parameters for the data. 
2. Test the new model against the test_set.
3. If the model performs better use it and report the general performance using all of 2022 data. 



