
<img src="https://gamesmea.com/wp-content/uploads/2018/11/IG-810x400.jpg" style="float: right;" width="500" height="100" />



# Winrate Prediction!

# Exploratory Data Analysis  & Prediction!


# Table of contents
***

* [1. Introduction](#1)
* [2. Data set review & preparation](#2)
* [3. Exploratory Data Analysis](#3)
* [4. Outliers](#4)
* [5. Feature engineering](#5)
* [6. Data preparation for model fitting](#6)
* [7. Model fitting and selection](#7)
* [8. Conclusion](#8)


# <font color="#00bfff"> 1. Introduction </font>
<a id="1"></a> 
***

**Context**
* League of Legends is a MOBA (multiplayer online battle arena) where 2 teams (blue and red) face off. There are 3 lanes, a jungle, and 5 roles. The goal is to take down the enemy Nexus to win the game.

**Glossary**
* Warding totem: An item that a player can put on the map to reveal the nearby area. Very useful for map/objectives control.
* Minions: NPC that belong to both teams. They give gold when killed by players.
* Jungle minions: NPC that belong to NO TEAM. They give gold and buffs when killed by players.
* Elite monsters: Monsters with high hp/damage that give a massive bonus (gold/XP/stats) when killed by a team.
* Dragons: Elite monster which gives team bonus when killed. The 4th dragon killed by a team gives a massive stats bonus. The 5th dragon (Elder Dragon) offers a huge advantage to the team.
* Herald: Elite monster which gives stats bonus when killed by the player. It helps to push a lane and destroys structures.
* Towers: Structures you have to destroy to reach the enemy Nexus. They give gold.
* Level: Champion level. Start at 1. Max is 18.

**We aim to accomplish the following for this study:**

**1.Identify and visualize which factors contribute to bluewins**

**2.Build a prediction model**

In [190]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from mlxtend.preprocessing import minmax_scaling
pd.options.display.max_rows = None
pd.options.display.max_columns = None

# <font color="#00bfff"> 2. Data set review & preparation
<a id="2"></a> 
***    

In [191]:
#Read data frame
df = pd.read_csv('../input/league-of-legends-diamond-ranked-games-10-min/high_diamond_ranked_10min.csv',delimiter=',')
df.shape

**The df has 9879 rows with 40 attributes. We review this further to identify what attributes will be necessary and what data manipulation needs to be carried out before Exploratory analysis and prediction modelling**

In [192]:
#Date cleaning
#Missing value 
missing_values_count = df.isnull().sum()
missing_values_count

In [193]:
#unqiue count
df.nunique()

In [194]:
#Drop some unnecessary columns. e.g. gameId,blue and red team firstblood, blue and red team EliteMonsters etc are unnecessary and repeated
df = df.drop(['gameId','redFirstBlood', 'redKills', 'redEliteMonsters', 'redDragons','redTotalMinionsKilled',
       'redTotalJungleMinionsKilled', 'redGoldDiff', 'redExperienceDiff', 'redCSPerMin', 'redGoldPerMin', 'redHeralds','redDeaths','redTotalGold','redTotalExperience','redAvgLevel'], axis = 1)

In [195]:
df.head()

In [196]:
df.info()

# <font color="#00bfff"> 3. Exploratory Data Analysis
<a id="3"></a> 
***

**Here our main interest is to get an understanding as to how the given attributes relate to the 'bluewins' status**

In [197]:
labels = 'Bluewins', 'Redwins'
sizes = [df.blueWins[df['blueWins'] == 1].count(), df.blueWins[df['blueWins'] == 0].count()]
colors = ['b','r']
explode = (0,0.1)
fig1,ax1 = plt.subplots(figsize = (7,7))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90,colors = colors)
ax1.axis('equal')
plt.title("Winrate", size = 20)
plt.show()

**Blue and Red winrates are both close to 50%,hence team assign will not influence winrate**

In [198]:
#let's sort correlation of features with blueWins column and drop negetively correlated furture
plt.figure(figsize=(15,6))
dfw = df.corr()['blueWins'].drop(['blueWins','redWardsDestroyed','redWardsPlaced','redTowersDestroyed','redAssists','blueDeaths'])
dfw = dfw.sort_values(ascending=False)
sns.barplot(y=dfw.index, x=dfw)
plt.show()

In [199]:
#Create ranking based on correlation for each feature on scale 0-10, where most important feature gets 10 points.
dfw.apply(lambda x: round(round(20/dfw.max()*x)/2, 1))

From the above analysis, gold and experience are two influential features affecting winrate. Kills, Assists and minionsKilled are following closely.

In [200]:
#correlation matrix
plt.figure(figsize=(17, 12))
sns.heatmap(df.drop('blueWins', axis=1).corr(), annot=True, fmt='.2f', vmin=0);

In [201]:
#Based on the correlation matrix, let's clean the dataset a little bit more to avoid colinearity
df = df.drop(['redAssists','blueGoldPerMin','redTowersDestroyed'], axis = 1)

# <font color="#00bfff"> 4. Outliers
<a id="4"></a> 
***

In [202]:
#We note some outliers. We will remove them if judged not relevant.
# blueWardsPlaced
# blueWardsDestroyed
# blueDeath
# blueTowerDestroyed

In [203]:
#Copy
df1 = df.copy()

**BlueWardsPlaced**

In [204]:
sns.displot(df1['blueWardsPlaced'],kind="ecdf")

We can see in some games,blue players has placed more than 100 within 10 minutes, which is not an usual action. Some players think the game are already lost, placing wards in to wait for the game end due to they cannot surrender within 10 minutes.
We remove any wardplaced more than 100.


In [205]:
#Remove bluewardsplaced more than 100
d1f = df1.loc[df1['blueWardsPlaced'] <= 100]

**blueWardsDestroyed**

In [206]:
sns.displot(df1['blueWardsDestroyed'],kind="ecdf")

For the same reason above, winning player can have free wards from the losing side once the match goes into garbage time.\
We remove any value over 99%.


In [207]:
#Remove everything above 99%
df1 = df1.loc[df1['blueWardsDestroyed'] <= np.quantile(df1['blueWardsDestroyed'],q=0.99)]

**Blue Deaths**

In [208]:
sns.displot(df1['blueDeaths'],kind="ecdf")

In [209]:
df1['blueDeaths'].loc[df1['blueDeaths'] >= 20]

Based on my personal game experience, 22 deaths in 10 mins is not too rare. Some players are in dark mood they may give free kills after they are camped or solokilled. Besides, if you lane opponent is smurf, you are likely to give many kills. Teammate is an esssential part of the game,we cannot guarrenty every player is perfect, so we do not remove this oultlier.\
*camp: To repeatedly gank the same lane\
*smurf: An experienced player who creates a new account for the purposes of being matched against inexperienced players for easy wins.

In [210]:
df1.loc[df1['blueDeaths'] == 0] = 0.5

Since 0 death reflect a significant advantage, we can double this KDA by half the denominator.\
*KDA = (Kill + Assist)/Death

In [211]:
# How many games do we remove?
print("We've removed {} outliers".format(df.shape[0] - df1.shape[0]))

In [212]:
df1.head()

**Since there is no second Dragon or Herald in 10 mins, we classify bluedragons and blueheralds as categorical varibales**

Same ananlysis for the relationship between winrate and furtures but we use countplot and boxplot to compare the bule winrate change to red winrate.

In [213]:
 # We review the 'Status' relation with categorical variables and TowerDestroyed
fig, ax2 = plt.subplots(2,2,figsize = (10,10))
sns.countplot(x='blueFirstBlood', hue = 'blueWins',data = df,palette="Set1", ax=ax2[0][0])
sns.countplot(x='blueDragons', hue = 'blueWins',data = df, palette="Set1", ax=ax2[0][1])
sns.countplot(x='blueHeralds', hue = 'blueWins',data = df, palette="Set1", ax=ax2[1][0])
sns.countplot(x='blueTowersDestroyed', hue = 'blueWins',data = df,palette="Set1",  ax=ax2[1][1])

Blue team have higher winrate when these categorical variables are true.

In [214]:
# Relations based on the continuous data attributes
fig,ax3 = plt.subplots(3,2, figsize = (10,10))
sns.boxplot(y='blueKills',x = 'blueWins', hue = 'blueWins',data = df,palette="Set1", ax = ax3[0][0])
sns.boxplot(y='blueDeaths',x = 'blueWins', hue = 'blueWins',data = df,palette="Set1", ax = ax3[0][1])
sns.boxplot(y='blueWardsPlaced',x = 'blueWins', hue = 'blueWins',data = df, palette="Set1",ax = ax3[1][0])
sns.boxplot(y='blueGoldDiff',x = 'blueWins',hue = 'blueWins',data = df,palette="Set1", ax=ax3[1][1])
sns.boxplot(y='blueExperienceDiff',x = 'blueWins',hue = 'blueWins',data = df,palette="Set1", ax=ax3[2][0])
sns.boxplot(y='blueTotalJungleMinionsKilled',x = 'blueWins',hue = 'blueWins',data = df,palette="Set1", ax=ax3[2][1])

# <font color="#00bfff"> 5. Feature engineering
<a id="5"></a> 
***
We split the train and test sets

In [215]:
# Split Train, test data
# x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=4)
df_train = df1.sample(frac = 0.8, random_state = 0)
df_test = df1.drop(df_train.index)

print(len(df_train))
print(len(df_test))

In [216]:
# blueward retention ratio can repesent map control
df_train ['bluewardretentionratio'] = (df_train.blueWardsPlaced - df_train.redWardsDestroyed)/df_train.blueWardsPlaced
sns.boxplot(y='bluewardretentionratio',x = 'blueWins', hue = 'blueWins',data = df_train,palette="Set1")
plt.ylim(0, 1.1)

In [217]:
# KDA is kills,assist to death ratio 
df_train['KDA'] = ((df_train.blueKills +df_train.blueAssists)/df_train.blueDeaths)
sns.boxplot(y='KDA',x = 'blueWins', hue = 'blueWins',data = df_train,palette="Set1")

# <font color="#00bfff"> 6. Data preparation for model fitting
<a id="6"></a> 
***

In [218]:
# Arrange columns by data type for easier manipulation
continuous_vars = ['KDA','blueExperienceDiff','blueGoldDiff','bluewardretentionratio']
categorical_vars = ['blueFirstBlood','blueDragons','blueHeralds']
df_train = df_train[['blueWins'] + continuous_vars + categorical_vars ]
df_train.head()

In [219]:
#For the one hot variables, we change 0 to -1 so that the models can capture a negative relation where the attribute in inapplicable instead of 0
df_train.loc[df_train.blueFirstBlood == 0, 'blueFirstBlood'] = -1
df_train.loc[df_train.blueDragons == 0, 'blueDragons'] = -1
df_train.loc[df_train.blueHeralds == 0, 'blueHeralds'] = -1

In [220]:
#Data Standardization
scaler = StandardScaler()
scaler.fit(df_train[continuous_vars])
df_train[continuous_vars] = scaler.transform(df_train[continuous_vars])

In [221]:
# data prep pipeline for test data
def Preppipeline(df_predict,df_train_Cols):
    # Add new features
    df_predict['KDA'] =((df_predict.blueKills +df_predict.blueAssists)/df_predict.blueDeaths)
    df_predict['bluewardretentionratio'] = (df_predict.blueWardsPlaced - df_predict.redWardsDestroyed)/df_predict.blueWardsPlaced
 
    # Reorder the columns
    continuous_vars = ['KDA','blueExperienceDiff','blueGoldDiff','bluewardretentionratio']
    categorical_vars = ['blueFirstBlood','blueDragons','blueHeralds']
    df_predict = df_predict[['blueWins'] + continuous_vars + categorical_vars]
    # Change the 0 in categorical variables to -1
    df_predict.loc[df_predict.blueFirstBlood == 0, 'blueFirstBlood'] = -1
    df_predict.loc[df_predict.blueDragons == 0, 'blueDragons'] = -1
    df_predict.loc[df_predict.blueHeralds == 0, 'blueHeralds'] = -1
    
    # Ensure that all one hot encoded variables that appear in the train data appear in the subsequent data
    L = list(set(df_train_Cols) - set(df_predict.columns))
    for l in L:
        df_predict[str(l)] = -1  
    #Data Standardization
    scaler = StandardScaler()
    scaler.fit(df_predict[continuous_vars])
    df_predict[continuous_vars] = scaler.transform(df_predict[continuous_vars])
    # Ensure that The variables are ordered in the same way as was ordered in the train set
    df_predict = df_predict[df_train_Cols]
    le = LabelEncoder()
    df_predict.blueWins = le.fit_transform(df_predict.blueWins)
    return df_predict

# <font color="#00bfff"> 7. Model fitting and selection
<a id="7"></a> 
***

In [227]:
# Fit models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
le = LabelEncoder()
df_train.blueWins = le.fit_transform(df_train.blueWins)

In [223]:
# Fit primal logistic regression
log_primal = LogisticRegression()
log_primal.fit(df_train.loc[:, df_train.columns != 'blueWins'],df_train.blueWins)

In [224]:
# Fit Random Forest classifier
RF = RandomForestClassifier(bootstrap=True,max_depth=8, max_features=6, max_leaf_nodes=None)
RF.fit(df_train.loc[:, df_train.columns != 'blueWins'],df_train.blueWins)

In [225]:
# Fit XGB
XGB = XGBClassifier()
XGB.fit(df_train.loc[:, df_train.columns != 'blueWins'],df_train.blueWins)

In [228]:
print(classification_report(df_train.blueWins, log_primal.predict(df_train.loc[:, df_train.columns != 'blueWins'])))

In [229]:
print(classification_report(df_train.blueWins,  RF.predict(df_train.loc[:, df_train.columns != 'blueWins'])))

In [230]:
print(classification_report(df_train.blueWins,  XGB.predict(df_train.loc[:, df_train.columns != 'blueWins'])))

**Test model prediction accuracy on test data**

In [231]:
df_test = Preppipeline(df_test,df_train.columns)
df_test = df_test.mask(np.isinf(df_test))
df_test = df_test.dropna()
df_test.shape

In [232]:
print(classification_report(df_test.blueWins,  XGB.predict(df_test.loc[:, df_test.columns != 'blueWins'])))

In [233]:
print(classification_report(df_test.blueWins, log_primal.predict(df_test.loc[:, df_train.columns != 'blueWins'])))

In [234]:
print(classification_report(df_test.blueWins,  RF.predict(df_test.loc[:, df_test.columns != 'blueWins'])))

# <font color="#00bfff"> 8. Conclusion
<a id="8"></a> 
***

From the test result, our aim is to predict the winner of a game according to the the first 10min in-game data. 
From the review of the models above, the XGB model provide a decent balance of the recall and precision for training set. Although the test data is  lower with regard to predicting, the accuracy could be imprved by providing retraining the model with more data over time. Hence,as the game goes on, the prediction accuracy will rise gradually.\
**BUT I am wondering why bluwine have 3 classes in the final result of train and test?**