
<img src="https://gamesmea.com/wp-content/uploads/2018/11/IG-810x400.jpg" style="float: right;" width="500" height="100" />



# Winrate Prediction!

# Exploratory Data Analysis  & Prediction!


# Table of contents
***

* [1. Introduction](#1)
* [2. Data set review & preparation](#2)
* [3. Exploratory Data Analysis](#3)
* [4. Outliers](#4)
* [5. Feature engineering](#5)
* [6. Model fitting and selection](#6)
* [7. Conclusion](#7)


# <font color="#00bfff"> 1. Introduction </font>
<a id="1"></a> 
***

**Context**
* League of Legends is a MOBA (multiplayer online battle arena) where 2 teams (blue and red) face off. There are 3 lanes, a jungle, and 5 roles. The goal is to take down the enemy Nexus to win the game.

**Glossary**
* Warding totem: An item that a player can put on the map to reveal the nearby area. Very useful for map/objectives control.
* Minions: NPC that belong to both teams. They give gold when killed by players.
* Jungle minions: NPC that belong to NO TEAM. They give gold and buffs when killed by players.
* Elite monsters: Monsters with high hp/damage that give a massive bonus (gold/XP/stats) when killed by a team.
* Dragons: Elite monster which gives team bonus when killed. The 4th dragon killed by a team gives a massive stats bonus. The 5th dragon (Elder Dragon) offers a huge advantage to the team.
* Herald: Elite monster which gives stats bonus when killed by the player. It helps to push a lane and destroys structures.
* Towers: Structures you have to destroy to reach the enemy Nexus. They give gold.
* Level: Champion level. Start at 1. Max is 18.

**We aim to accomplish the following for this study:**

**1.Identify and visualize which factors contribute to bluewins**

**2.Build a prediction model**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from mlxtend.preprocessing import minmax_scaling
from sklearn.model_selection import train_test_split
pd.options.display.max_rows = None
pd.options.display.max_columns = None
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_score, recall_score

# <font color="#00bfff"> 2. Data set review & preparation
<a id="2"></a> 
***    

In [None]:
#Read data frame
df = pd.read_csv('../input/league-of-legends-diamond-ranked-games-10-min/high_diamond_ranked_10min.csv',delimiter=',')
df.shape

**The df has 9879 rows with 40 attributes. We review this further to identify what attributes will be necessary and what data manipulation needs to be carried out before Exploratory analysis and prediction modelling**

In [None]:
#Date cleaning
#Missing value 
missing_values_count = df.isnull().sum()
missing_values_count

In [None]:
#unqiue count
df.nunique()

In [None]:
#Drop some unnecessary columns. e.g. gameId,blue and red team firstblood, blue and red team EliteMonsters etc are unnecessary and repeated
df = df.drop(['gameId','redFirstBlood', 'redKills', 'redEliteMonsters', 'redDragons','redTotalMinionsKilled',
       'redTotalJungleMinionsKilled', 'redGoldDiff', 'redExperienceDiff', 'redCSPerMin', 'redGoldPerMin', 'redHeralds','redDeaths','redTotalGold','redTotalExperience','redAvgLevel'], axis = 1)

In [None]:
df.head()

In [None]:
df.info()

# <font color="#00bfff"> 3. Exploratory Data Analysis
<a id="3"></a> 
***

**Here our main interest is to get an understanding as to how the given attributes relate to the 'bluewins' status**

In [None]:
labels = 'Bluewins', 'Redwins'
sizes = [df.blueWins[df['blueWins'] == 1].count(), df.blueWins[df['blueWins'] == 0].count()]
colors = ['b','r']
explode = (0,0.1)
fig1,ax1 = plt.subplots(figsize = (7,7))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90,colors = colors)
ax1.axis('equal')
plt.title("Winrate", size = 20)
plt.show()

**Blue and Red winrates are both close to 50%,hence team assign will not influence winrate**

In [None]:
#let's sort correlation of features with blueWins column and drop negetively correlated furture
plt.figure(figsize=(15,6))
dfw = df.corr()['blueWins'].drop(['blueWins','redWardsDestroyed','redWardsPlaced','redTowersDestroyed','redAssists','blueDeaths'])
dfw = dfw.sort_values(ascending=False)
sns.barplot(y=dfw.index, x=dfw)
plt.show()

In [None]:
#Create ranking based on correlation for each feature on scale 0-10, where most important feature gets 10 points.
dfw.apply(lambda x: round(round(20/dfw.max()*x)/2, 1))

From the above analysis, gold and experience are two influential features affecting winrate. Kills, Assists and minionsKilled are following closely.

In [None]:
#correlation matrix
plt.figure(figsize=(17, 12))
sns.heatmap(df.drop('blueWins', axis=1).corr(), annot=True, fmt='.2f', vmin=0);

In [None]:
#Based on the correlation matrix, let's clean the dataset a little bit more to avoid colinearity
df = df.drop(['redAssists','blueGoldPerMin','redTowersDestroyed'], axis = 1)

# <font color="#00bfff"> 4. Outliers
<a id="4"></a> 
***

In [None]:
#We note some outliers. We will remove them if judged not relevant.
# blueWardsPlaced
# blueWardsDestroyed
# blueDeath
# blueTowerDestroyed

In [None]:
#Copy
df1 = df.copy()

**BlueWardsPlaced**

In [None]:
sns.displot(df1['blueWardsPlaced'],kind="ecdf")

We can see in some games,blue players has placed more than 100 within 10 minutes, which is not an usual action. Some players think the game are already lost, placing wards in to wait for the game end due to they cannot surrender within 10 minutes.
We remove any wardplaced more than 100.


In [None]:
#Remove bluewardsplaced more than 100
d1f = df1.loc[df1['blueWardsPlaced'] <= 100]

**blueWardsDestroyed**

In [None]:
sns.displot(df1['blueWardsDestroyed'],kind="ecdf")

For the same reason above, winning player can have free wards from the losing side once the match goes into garbage time.\
We remove any value over 99%.


In [None]:
#Remove everything above 99%
df1 = df1.loc[df1['blueWardsDestroyed'] <= np.quantile(df1['blueWardsDestroyed'],q=0.99)]

**Blue Deaths**

In [None]:
sns.displot(df1['blueDeaths'],kind="ecdf")

In [None]:
df1['blueDeaths'].loc[df1['blueDeaths'] >= 20]

Based on my personal game experience, 22 deaths in 10 mins is not too rare. Some players are in dark mood they may give free kills after they are camped or solokilled. Besides, if you lane opponent is smurf, you are likely to give many kills. Teammate is an esssential part of the game,we cannot guarrenty every player is perfect, so we do not remove this oultlier.\
*camp: To repeatedly gank the same lane\
*smurf: An experienced player who creates a new account for the purposes of being matched against inexperienced players for easy wins.

In [None]:
# How many games do we remove?
print("We've removed {} outliers".format(df.shape[0] - df1.shape[0]))

In [None]:
df1.head()

**Since there is no second Dragon or Herald in 10 mins, we can classify bluedragons and blueheralds as categorical varibales**

Same ananlysis for the relationship between winrate and furtures but we use countplot and boxplot to compare the bule winrate change to red winrate.

In [None]:
 # We review the 'Status' relation with categorical variables and TowerDestroyed
fig, ax2 = plt.subplots(2,2,figsize = (10,10))
sns.countplot(x='blueFirstBlood', hue = 'blueWins',data = df,palette="Set1", ax=ax2[0][0])
sns.countplot(x='blueDragons', hue = 'blueWins',data = df, palette="Set1", ax=ax2[0][1])
sns.countplot(x='blueHeralds', hue = 'blueWins',data = df, palette="Set1", ax=ax2[1][0])
sns.countplot(x='blueTowersDestroyed', hue = 'blueWins',data = df,palette="Set1",  ax=ax2[1][1])

Blue team have higher winrate when these categorical variables are true.

In [None]:
# Relations based on the continuous data attributes
fig,ax3 = plt.subplots(3,2, figsize = (10,10))
sns.boxplot(y='blueKills',x = 'blueWins', hue = 'blueWins',data = df,palette="Set1", ax = ax3[0][0])
sns.boxplot(y='blueDeaths',x = 'blueWins', hue = 'blueWins',data = df,palette="Set1", ax = ax3[0][1])
sns.boxplot(y='blueWardsPlaced',x = 'blueWins', hue = 'blueWins',data = df, palette="Set1",ax = ax3[1][0])
sns.boxplot(y='blueGoldDiff',x = 'blueWins',hue = 'blueWins',data = df,palette="Set1", ax=ax3[1][1])
sns.boxplot(y='blueExperienceDiff',x = 'blueWins',hue = 'blueWins',data = df,palette="Set1", ax=ax3[2][0])
sns.boxplot(y='blueTotalJungleMinionsKilled',x = 'blueWins',hue = 'blueWins',data = df,palette="Set1", ax=ax3[2][1])

# <font color="#00bfff"> 5. Feature engineering
<a id="5"></a> 
***

In [None]:
df1['blueDeaths'].loc[df1['blueDeaths'] == 0] = 0.5

**Since 0 death reflect a significant advantage, we can double this KDA by half the denominator.**\
*KDA = (Kill + Assist)/Death

In [None]:
df1['KDA'] = ((df1.blueKills +df1.blueAssists)/df1.blueDeaths)

In [None]:
# blueward retention ratio can repesent map control
df1 ['bluewardretentionratio'] = (df1.blueWardsPlaced - df1.redWardsDestroyed)/df1.blueWardsPlaced

In [None]:
df_copy = df1.copy()
X_features = df_copy.loc[:, df_copy.columns != 'blueWins']
y_target = df_copy.blueWins
X_train, X_test, y_train, y_test = train_test_split(X_features,y_target, test_size=0.3, random_state=0, stratify=y_target)

In [None]:
X_train.head()

In [None]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train.columns if X_train[cname].nunique() < 10 and 
                        X_train[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]

In [None]:
#Data Standardization
scaler = StandardScaler()
scaler.fit(X_train[numerical_cols])
X_train[numerical_cols] = scaler.transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

In [None]:
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train[my_cols].copy()
X_test = X_test[my_cols].copy()

# <font color="#00bfff"> 6. Model fitting and selection
<a id="6"></a> 
***

In [None]:
# Use Cross-validation.
from sklearn.model_selection import cross_val_score

# Logistic Regression
log_reg = LogisticRegression()
log_scores = cross_val_score(log_reg, X_train, y_train, cv=3)
log_reg_mean = log_scores.mean()

# SVC
svc_clf = SVC()
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=3)
svc_mean = svc_scores.mean()

# KNearestNeighbors
knn_clf = KNeighborsClassifier()
knn_scores = cross_val_score(knn_clf, X_train, y_train, cv=3)
knn_mean = knn_scores.mean()

# Decision Tree
tree_clf = tree.DecisionTreeClassifier()
tree_scores = cross_val_score(tree_clf, X_train, y_train, cv=3)
tree_mean = tree_scores.mean()

# Gradient Boosting Classifier
grad_clf = GradientBoostingClassifier()
grad_scores = cross_val_score(grad_clf, X_train, y_train, cv=3)
grad_mean = grad_scores.mean()

# Random Forest Classifier
rand_clf = RandomForestClassifier(n_estimators=18)
rand_scores = cross_val_score(rand_clf, X_train, y_train, cv=3)
rand_mean = rand_scores.mean()

# NeuralNet Classifier
neural_clf = MLPClassifier(alpha=1)
neural_scores = cross_val_score(neural_clf, X_train, y_train, cv=3)
neural_mean = neural_scores.mean()

# Naives Bayes
nav_clf = GaussianNB()
nav_scores = cross_val_score(nav_clf, X_train, y_train, cv=3)
nav_mean = neural_scores.mean()

# Create a Dataframe with the results.
d = {'Classifiers': ['Logistic Reg.', 'SVC', 'KNN', 'Dec Tree', 'Grad B CLF', 'Rand FC', 'Neural Classifier', 'Naives Bayes'], 
    'Crossval Mean Scores': [log_reg_mean, svc_mean, knn_mean, tree_mean, grad_mean, rand_mean, neural_mean, nav_mean]}

result_df = pd.DataFrame(data=d)

In [None]:
# All our models perform well but I will go with Logistic Regression.
result_df = result_df.sort_values(by=['Crossval Mean Scores'], ascending=False)
result_df

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(log_reg, X_train, y_train, cv=3)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
grad_clf.fit(X_train, y_train)
print ("Logistic Regression Classifier accuracy is %2.2f" % accuracy_score(y_train, y_train_pred))
print('Precision Score: ', precision_score(y_train, y_train_pred))
# The classifier 
print('Recall Score: ', recall_score(y_train, y_train_pred))
print('F1 Score: ', f1_score(y_train, y_train_pred))

In [None]:
log_y_scores = cross_val_predict(log_reg, X_train, y_train, cv=3, method="decision_function")
neural_y_scores = cross_val_predict(neural_clf, X_train, y_train, cv=3, method="predict_proba")
naives_y_scores = cross_val_predict(nav_clf, X_train, y_train, cv=3, method="predict_proba")

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, threshold = precision_recall_curve(y_train, log_y_scores)

In [None]:
def precision_recall_curve(precisions, recalls, thresholds):
    fig, ax = plt.subplots(figsize=(12,8))
    plt.plot(thresholds, precisions[:-1], "r--", label="Precisions")
    plt.plot(thresholds, recalls[:-1], "#424242", label="Recalls")
    plt.title("Precision and Recall \n Tradeoff", fontsize=18)
    plt.ylabel("Level of Precision and Recall", fontsize=16)
    plt.xlabel("Thresholds", fontsize=16)
    plt.legend(loc="best", fontsize=14)
    plt.xlim([-2, 4.7])
    plt.ylim([0, 1])
    plt.axvline(x=0, linewidth=3, color="#0B3861")
    plt.annotate('Precision and \n Recall Balance ', xy=(0, 0.73), xytext=(55, -40),
             textcoords="offset points",
            arrowprops=dict(facecolor='black', shrink=0.05),
                fontsize=12, 
                color='k')
    
precision_recall_curve(precisions, recalls, threshold)
plt.show()

In [None]:
# hack to work around issue #9589 introduced in Scikit-Learn 0.19.0
if log_y_scores.ndim == 2:
    log_y_scores = log_y_scores[:, 1]

if neural_y_scores.ndim == 2:
    neural_y_scores = neural_y_scores[:, 1]
    
if naives_y_scores.ndim == 2:
    naives_y_scores = naives_y_scores[:, 1]

In [None]:
from sklearn.metrics import roc_curve
# Logistic RegressionClassifier
# Neural Classifier
# Naives Bayes Classifier
log_fpr, log_tpr, thresold = roc_curve(y_train, log_y_scores)
neu_fpr, neu_tpr, neu_threshold = roc_curve(y_train, neural_y_scores)
nav_fpr, nav_tpr, nav_threshold = roc_curve(y_train, naives_y_scores)

In [None]:
def graph_roc_curve(false_positive_rate, true_positive_rate, label=None):
    plt.figure(figsize=(10,6))
    plt.title('ROC Curve \n Logistic Regression Classifier', fontsize=18)
    plt.plot(false_positive_rate, true_positive_rate, label=label)
    plt.plot([0, 1], [0, 1], '#0C8EE0')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.annotate('ROC Score of 80.69% ', xy=(0.5, 0.9), xytext=(0.6, 0.85),
            arrowprops=dict(facecolor='#F75118', shrink=0.05),
            )
    plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
                arrowprops=dict(facecolor='#F75118', shrink=0.05),
                )
    
    
graph_roc_curve(log_fpr, log_tpr, thresold)
plt.show()

In [None]:
from sklearn.metrics import roc_auc_score

print('Logistic Regression Classifier Score: ', roc_auc_score(y_train, log_y_scores))
print('Neural Classifier Score: ', roc_auc_score(y_train, neural_y_scores))
print('Naives Bayes Classifier: ', roc_auc_score(y_train, naives_y_scores))

In [None]:
def graph_roc_curve_multiple(log_fpr, log_tpr, neu_fpr, neu_tpr, nav_fpr, nav_tpr):
    plt.figure(figsize=(8,6))
    plt.title('ROC Curve \n Top 3 Classifiers', fontsize=18)
    plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier (Score = 80.69%)')
    plt.plot(neu_fpr, neu_tpr, label='Neural Classifier (Score = 80.26%)')
    plt.plot(nav_fpr, nav_tpr, label='Naives Bayes Classifier (Score = 79.17%)')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
                arrowprops=dict(facecolor='#6E726D', shrink=0.05),
                )
    plt.legend()
    
graph_roc_curve_multiple(log_fpr, log_tpr, neu_fpr, neu_tpr, nav_fpr, nav_tpr)
plt.show()

A voting classifier is a machine learning estimator that trains various base models or estimators and predicts on the basis of aggregating the findings of each base estimator. The aggregating criteria can be combined decision of voting for each estimator output. The voting criteria can be of two types:

* Hard Voting: Voting is calculated on the predicted output class.
* Soft Voting: Voting is calculated on the predicted probability of the output class.

In [None]:
# Our three classifiers are log_reg, nav_clf and neural_clf
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[('log', log_reg),  ('nav', nav_clf), ('neural', neural_clf)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

for clf in (log_reg, nav_clf, neural_clf, voting_clf):
    clf.fit(X_train, y_train)
    predict = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, predict))

# <font color="#00bfff"> 7. Conclusion
<a id="7"></a> 
***

# Logistic Regression Wins!
From the test result, our aim is to predict the winner of a game according to the the first 10min in-game data. 
From the review of the models above, the Logistic Regression provide a decent balance of the recall and precision for training set. Although the test data is  lower with regard to predicting, the accuracy could be improved by providing retraining the model with more data over time. Hence,as the game goes on, the prediction accuracy will rise gradually.
