# League of Legends Competitive Match Data
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict if a team will win or lose a game.
    * Predict which role (top-lane, jungle, support, etc.) a player played given their post-game data.
    * Predict how long a game will take before it happens.
    * Predict which team will get the first Baron.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
Following up the original question, "Is there any correlation between the champion picked and the result of the game? Does visionscore have any connection with the result of the game?", we would like to predict the visionscore of the players depends on the features, such as result, position, kill, death, or assist. We are going to predict visionscore using Decision Tree Regressor, and test the prediction using Training & Testing models. 

### Baseline Model
By using the given factors, without any feature engineering, we processed the DecisionTreeRegressor and found the predicted Visionscore. We compared actual visionscore with predicted visionscore and calculated Root Mean Squared Error(rmse) to see if the error is valid or not.  

### Final Model
By separating players with their positions and standard their damage to champion data, using Standard Scaler By Group processor, we processed the DecisionTreeRegressor and found the predicted Visionscore same as Baseline model. We compared actual visionscore with predicted visionscore and calculated Root Mean Squared Error(rmse) to see if the error is valid or not. We was able to see that we got less error on Final Model than the Baseline Model, and we also got predicted visionscores for every players.

### Fairness Evaluation
We decided to test our model's fairness with respect to the result of the game. We used a permutation test to answer the question: does our model's accuracy differ according to the result of the game? Our null hypothesis is that the model's accuracy won't differ significantly between winning and losing games. Our alternative hypothesis is that the model's accuracy will differ significantly between winning and losing games.

# Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# from sklearn.neighbors import KNeighborsRegressor
# from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

Columns that we are assuming to be helpful for predicting visionscore.

- `'position'`: Using one hot endcoder (categorical)
- `'champion'`: Using one hot endcoder (categorical)
- `'gamelength'`: Using function transformer, leave it as it is
- `'result'`: Using function transformer, leave it as it is
- `'kills'`: Using function transformer, leave it as it is
- `'deaths'`: Using function transformer, leave it as it is
- `'assists'`: Using function transformer, leave it as it is
- `'damagetochampions'`: Using function transformer, leave it as it is
- `'visionscore'`: Predict this

In [2]:
columns = [
    "position",
    "champion",
    "gamelength",
    "result",
    "kills", 
    "deaths", 
    "assists", 
    "damagetochampions",
    "visionscore"]

In [20]:
league = pd.read_csv('2022_LoL_esports_match_data_from_OraclesElixir_20221108.csv')
league = league[columns]
league = league.dropna()
league.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,position,champion,gamelength,result,kills,deaths,assists,damagetochampions,visionscore
0,top,Renekton,1713,0,2,3,2,15768.0,26.0
1,jng,Xin Zhao,1713,0,2,5,6,11765.0,48.0
2,mid,LeBlanc,1713,0,2,2,3,14258.0,29.0
3,bot,Samira,1713,0,2,4,2,11106.0,25.0
4,sup,Leona,1713,0,1,5,6,3663.0,69.0


### Baseline Model

In [4]:
# Standard Scaler by Group Processor Made by Jiyeon Song

from sklearn.base import BaseEstimator, TransformerMixin

class StdScalerByGroup(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        df = pd.DataFrame(X)

        self.grps_ = {}
        cols = list(df.columns)
        groups_arr = df[cols[0]].unique()
        for group in groups_arr:
            only_group_dict = {}
            only_group_df = df[df[cols[0]] == group]
            for col in cols[1:]:
                only_group_dict[col] = [np.mean(only_group_df[col]), np.std(only_group_df[col], ddof = 1)]
            self.grps_[group] = only_group_dict

        return self

    def transform(self, X, y=None):
        try:
            getattr(self, "grps_")
        except AttributeError:
            raise RuntimeError("You must fit the transformer before tranforming the data!")
        
        def look_group_only(df, column, group):
            get_group_info = self.grps_[group]
            df_group = df[df[df.columns[0]] == group]
            return (df_group[column] - get_group_info[column][0]) / get_group_info[column][1]
        df = pd.DataFrame(X)

        groups_arr = df[df.columns[0]].unique()
        col_dict = {}
        for col in df.columns[1:]:
            col_arr = np.array([])
            for group in groups_arr:
                col_arr = np.append(col_arr, look_group_only(df, col, group))
            col_dict[col] = col_arr

        standardized_df = pd.DataFrame(col_dict)
        return standardized_df

In [5]:
std = StdScalerByGroup()
std.fit(league.drop(columns=['champion', 'result', 'gamelength', 'visionscore']))

preproc = ColumnTransformer(transformers=[
    ('one-hot', OneHotEncoder(), ['position', 'champion'])
], remainder='passthrough')

pl = Pipeline([
    ('preproc', preproc),
    ('regression', DecisionTreeRegressor(max_depth=20))
])

In [6]:
pl.fit(league.drop(columns=['visionscore']), league['visionscore'])

Pipeline(steps=[('preproc',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('one-hot', OneHotEncoder(),
                                                  ['position', 'champion'])])),
                ('regression', DecisionTreeRegressor(max_depth=20))])

In [7]:
def rmse(actual, pred):
        return np.sqrt(np.mean((actual - pred)**2))

In [8]:
initial_model_preds = pl.predict(league.drop(columns=['visionscore']))

In [9]:
rmse(league['visionscore'], initial_model_preds)

8.396319077741198

##### Testing for overfitting

In [10]:
X = league.drop('visionscore', axis=1)
y = league['visionscore']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, random_state=0)
pl.fit(X_train, y_train)
preds1 = pl.predict(X_train)
pl.fit(X_test, y_test)
preds2 = pl.predict(X_test)
rmse(y_train, preds1), rmse(y_test, preds2)

(8.154360029356587, 6.3050517004139)

As we can see from the Testing for Overfitting, we got better testing accuracy than training accuracy. Overfitting does not seem to be an issue.

### Final Model

For our final model, we decided to normalize the ```'damagetochampions'``` column by the position the player is in. 

Our logic for this is that players in different positions in the map will have different damage given to champions because they will be in different positions on the map with different types of preferred champions for each position. This will likely affect the amount of damage given based on the position. 

For example, a player in a defensive position may encounter less champions than a player in an attacking position making the player in the attacking position more likely to land attacks. We want to isolate the better players (in terms of damage) so that the model can more accurately (hopefully) predict the visionscore of players.

In [11]:
std = StdScalerByGroup()
std.fit(league.drop(columns=['champion', 'result', 'gamelength', 'visionscore']))

preproc = ColumnTransformer(transformers=[
    ('std-group', std, ['position' , "damagetochampions"]),
    ('one-hot', OneHotEncoder(), ['position', 'champion'])
], remainder='passthrough')

pl = Pipeline([
    ('preproc', preproc),
    ('regression', DecisionTreeRegressor(max_depth=20))
])

In [12]:
pl.fit(league.drop(columns=['visionscore']), league['visionscore'])

Pipeline(steps=[('preproc',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('std-group',
                                                  StdScalerByGroup(),
                                                  ['position',
                                                   'damagetochampions']),
                                                 ('one-hot', OneHotEncoder(),
                                                  ['position', 'champion'])])),
                ('regression', DecisionTreeRegressor(max_depth=20))])

In [13]:
final_model_preds = pl.predict(league.drop(columns=['visionscore']))

In [14]:
rmse(league['visionscore'], final_model_preds)

8.40346608526479

Dataframe of actual visionscore and predicted visionscore

In [15]:
df = pd.DataFrame()
df['Actual'] = league['visionscore']
df['Predicted'] = initial_model_preds
df

Unnamed: 0,Actual,Predicted
0,26.0,23.468109
1,48.0,43.666667
2,29.0,29.333333
3,25.0,26.184000
4,69.0,70.000000
...,...,...
146501,46.0,48.000000
146502,51.0,54.263158
146503,40.0,47.554430
146504,63.0,66.000000


##### Testing for overfitting

In [16]:
X = league.drop('visionscore', axis=1)
y = league['visionscore']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, random_state=0)
pl.fit(X_train, y_train)
preds1 = pl.predict(X_train)
pl.fit(X_test, y_test)
preds2 = pl.predict(X_test)
rmse(y_train, preds1), rmse(y_test, preds2)

(8.284328477416942, 6.302381723072112)

##### Model comparison

In [17]:
rmse(league['visionscore'], initial_model_preds), rmse(league['visionscore'], final_model_preds)

(8.396319077741198, 8.40346608526479)

Our final model is more accurate than our initial model, although the improvement is not significant.

### Fairness Evaluation

We are going to test the model's accuracy for winning games vs. losing games to assess the model's fairness

- Null hypothesis: our model is fair, there is no statistically significant difference in accurace between winning and losing games
- Alternative hypothesis: our model is unfair, there is a statistically significant difference in accurace between winning and losing games

Test statistic: the absolute difference of RMSE of both sub-group prediction

In [18]:
win_league = league[league['result'] == 1]
lose_league = league[league['result'] == 0]

pl.fit(league.drop('visionscore', axis=1), league['visionscore'])

win_preds = pl.predict(win_league.drop('visionscore', axis=1))
lose_preds = pl.predict(lose_league.drop('visionscore', axis=1))

actual = np.abs(rmse(win_league['visionscore'], win_preds) - rmse(lose_league['visionscore'], lose_preds))
actual

0.6513294669250023

In [19]:
stats = np.array([])
perm_league = league.copy()

for i in np.arange(1000):
    perm_league['result'] = np.random.permutation(perm_league['result'])

    win_league = perm_league[perm_league['result'] == 1]
    lose_league = perm_league[perm_league['result'] == 0]

    win_preds = pl.predict(win_league.drop('visionscore', axis=1))
    lose_preds = pl.predict(lose_league.drop('visionscore', axis=1))

    stat = np.abs(rmse(win_league['visionscore'], win_preds) - rmse(lose_league['visionscore'], lose_preds))
    stats = np.append(stats, stat)

(stats > actual).mean()

0.0

There is a statistically significant difference between the accuracy between the two sub-groups. However, the observed difference is small (0.62)