## Solo Queue Modeling

Using a dataset from LoL's Diamond I to Masters level soloq game stats at 10 minutes that Andre has engineered some features for, I begin my modeling process below.

In [1]:
import pandas as pd
import numpy as np

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, plot_confusion_matrix
from sklearn.linear_model import LogisticRegression

# imports for visualizations
import matplotlib.pyplot as plt
import matplotlib.pylab as pl
import xgboost

# I'm importing a functions.py file that I created so I can reuse a scoring function in the file
import sys
if not 'Notebooks/Individual/Jake' in sys.path:
    sys.path.append('Notebooks/Individual/jake')
from functions import ScoreModel, FeatureImp #featureimp is throwing errors about dfs not being defined

# the below lines are to play an alert sound for when the notebook finishes running
import IPython
sound_file = '../../../archive/sounds/puzzle_solved_jingle.wav'

# I wanted to be able to see all the columns when using .head()
pd.set_option('display.max_columns', None)

#### Reading in Andre's dataset and doing initial checks to make sure everything is there

In [2]:
df = pd.read_csv("../../../archive/with_rates.csv")

df.head()

FileNotFoundError: [Errno 2] No such file or directory: '../../../archive/with_rates.csv'

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# checking class balance
df["blueWins"].value_counts()

#### Everything looks clear so its time to start modeling.  I've decided to use XGBClassifier for my modeling work, and I use gridsearch for hyperparameter tuning.

In [None]:
drop_col = ["blueWins", "gameId", "champions", "blueChamps", "redChamps"]
y = df["blueWins"]
X = df.drop(columns=drop_col, axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=57)

boost_model = XGBClassifier(random_state=57, objective="reg:logistic")

param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [2, 3, 4, 5],
    'min_child_weight': [1, 2, 3, 4, 5, 6],
    'subsample': [0.4, 0.5, 0.6, 0.7],
    'n_estimators': [30, 50, 100]
}

gridsearch = GridSearchCV(boost_model, param_grid, cv=3, scoring="accuracy", n_jobs=1)
gridsearch.fit(X_train, y_train)

# now that the model is fit with gridsearch, I'm going to output the best parameters in case I need to use them for later
best_parameters = gridsearch.best_params_
print("Best Parameters: ")
print(best_parameters)

#### XGBClassifier Scores

In [None]:
print("Training Scores")
print(ScoreModel(gridsearch, X_train, y_train))

print("Test Scores")
print(ScoreModel(gridsearch, X_test, y_test))

# as an aside, I have no idea how to get rid of the Nones below

So our model is overfit and the only parameter that is at one of our limits in the param_grid is n_estimators.  Otherwise, a decent performance overall.

#### After scoring, I was curious to see how a standard LogisticRegression model would perform in comparison to the gradient boosted model above

In [None]:
# the scales below I used during testing, and applying the scaled to the logistic regression below actually made the model worse

#ss = StandardScaler()
#X_train_ss = ss.fit_transform(X_train)
#X_test_ss = ss.transform(X_test)

logreg = LogisticRegression().fit(X_train, y_train)

print("Training Scores")
print(ScoreModel(logreg, X_train, y_train))

print("Test Scores")
print(ScoreModel(logreg, X_test, y_test))

#### The logistic regression performed better than I had anticipated, and is only ever so slightly overfit (0.7 difference in accuracy score)

#### Next I want to analyze the feature importances for our XGBClassifier model

Unfortunately GridSearchCV has no feature importances attribute, but that's why I output the best parameters earlier so I can make a temporary XGBC model with those parameters to get the feature importances that I'm looking for

In [None]:
# learning_rate': 0.05, 'max_depth': 3, 'min_child_weight': 5, 'n_estimators': 100, 'subsample': 0.6

boost_model2 = XGBClassifier(random_state=57, objective="reg:logistic",
                            learning_rate= 0.05, max_depth = 3, min_child_weight = 5,
                            n_estimators = 100, subsample = 0.6)

boost_model2.fit(X_train, y_train)

In [None]:
# i multiply the feature importances and round to make them easier read in our output
features = list(zip(X_train.columns, 100*(np.round(boost_model2.feature_importances_, 4))))
features

#### So we've got some really low feature importances here.  I've decided to remove any features that were below 2 and their opposing team counterpart was also below 2.

After removing those features I make a new XGBC model with gridsearch.

In [None]:
nonimportant = ["blueTowersDestroyed", "redTowersDestroyed", "blueWardsPlaced", "blueWardsDestroyed",
               "blueKills", "blueAssists", "blueCSPerMin", "redWardsPlaced", "redWardsDestroyed",
               "redKills", "redAssists", "redCSPerMin"]

X_non = X.drop(columns=nonimportant, axis=1)

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_non, y, random_state=57)

boost_model_non = XGBClassifier(random_state=57, objective="reg:logistic")

param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [2, 3, 4, 5],
    'min_child_weight': [1, 2, 3, 4, 5, 6],
    'subsample': [0.4, 0.5, 0.6, 0.7],
    'n_estimators': [30, 50, 100]
}

gridsearch_non = GridSearchCV(boost_model, param_grid, cv=3, scoring="accuracy", n_jobs=1)
gridsearch_non.fit(X_train2, y_train2)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
best_parameters = gridsearch.best_params_

print("Best Parameters: ")
print(best_parameters)

##### Worth noting that our parameters here are fairly identical to the previous model

In [None]:
print("Training Scores")
print(ScoreModel(gridsearch_non, X_train2, y_train2))

print("Test Scores")
print(ScoreModel(gridsearch_non, X_test2, y_test2))

#### This model is much better than our first XGBC and is better overall than our Logistic Regression model too!  We're no longer overfit, and improved our scores accross the board.

We're going to follow up with another feature importance inspection.

In [None]:
boost_model_non = XGBClassifier(random_state=57, objective="reg:logistic",
                            learning_rate= 0.05, max_depth = 3, min_child_weight = 5,
                            n_estimators = 100, subsample = 0.6)
boost_model_non.fit(X_train2, y_train2)

features = list(zip(X_train2.columns, 100*(np.round(boost_model_non.feature_importances_, 4))))
features

## Modeling with new features 

Does either team have X impactful champion? (5 different champs)

In [None]:
new_df = pd.read_csv("../../../archive/with_rates_and_spec_gods.csv")
new_df.head()

In [None]:
drop_col = ["blueWins", "gameId", "blueChamps", "redChamps"]
X_new = new_df.drop(drop_col, axis=1)
X_new = X_new.drop(nonimportant, axis=1)
y_new = new_df["blueWins"]

X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, random_state=57)

boost_model_new = XGBClassifier(random_state=57, objective="reg:logistic",
                            learning_rate= 0.05, max_depth = 3, min_child_weight = 5,
                            n_estimators = 100, subsample = 0.6)
boost_model_new.fit(X_train_new, y_train_new)



print("Training Scores")
print(ScoreModel(boost_model_new, X_train_new, y_train_new))

print("Test Scores")
print(ScoreModel(boost_model_new, X_test_new, y_test_new))

#### Here we have another decently performing model, however, we're still overfit to our training data.

Let's check out our feature importances again

In [None]:
features = list(zip(X_train_new.columns, 100*(np.round(boost_model_new.feature_importances_, 4))))
features

In [None]:
IPython.display.Audio(sound_file, autoplay=True, rate=1000)

## Summary

It seems that the first 10 minutes of a game are significantly impactful on the outcome of a match.  My models are consistently producing above a 70% accuracy score, with my best model at 74%.

Even though the first 10 minutes are impactful, our results are only 24% better than random guessing (which would be 50%).  I attribute this to the fact that game time averages around 30 minutes, so there's still plenty of time for the disadvantaged team to make a comeback, and in some rare cases the winning team is still at a disadvantage in terms of overall statistics.  If we were take data from less skilled players, we could expect this variability in wins to be more apparent.

As far as features that are driving our model, team gold and experience leads are the most important, with a couple features like dragons and the average winrate of a team's selected champions being less important, but not something that's necessarily correlated with gold and experience leads.  In the case of dragons, they provide team wide buffs that aren't measured in terms of gold or experience (though killing the dragon awards a small amount of gold and xp).

## Visuals for presentation

In [None]:
xgboost.plot_importance(boost_model_non)