# Data Feature Engineering

After doing some basic data exploration, we have: 
* noticed that we need to engineer our existing features in order to get more insight about the GameDuration variable that we are trying to predict.
* observed some possible skewness in the continuous variables and the need for some feature scaling
* spoted some outliers in the continuous variables

We are going to answer those problematics in this first Step of feature engineering (I. Basic Feature Engineering)

In [2]:
# Necessary librairies
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## I. Basic Feature Engineering

In [5]:
# import our dataset
df = pd.read_csv('./data/league_data_cleaned_10min.csv')
df.sample(10)
df = df.drop(columns='Unnamed: 0', axis=1)

In [8]:
df.columns

Index(['gameId', 'blueWins', 'blueWardsPlaced', 'blueWardsDestroyed',
       'blueFirstBlood', 'blueKills', 'blueDeaths', 'blueAssists',
       'blueEliteMonsters', 'blueDragons', 'blueHeralds',
       'blueTowersDestroyed', 'blueTotalGold', 'blueAvgLevel',
       'blueTotalExperience', 'blueTotalMinionsKilled',
       'blueTotalJungleMinionsKilled', 'blueGoldDiff', 'blueExperienceDiff',
       'blueCSPerMin', 'blueGoldPerMin', 'redWardsPlaced', 'redWardsDestroyed',
       'redFirstBlood', 'redKills', 'redDeaths', 'redAssists',
       'redEliteMonsters', 'redDragons', 'redHeralds', 'redTowersDestroyed',
       'redTotalGold', 'redAvgLevel', 'redTotalExperience',
       'redTotalMinionsKilled', 'redTotalJungleMinionsKilled', 'redGoldDiff',
       'redExperienceDiff', 'redCSPerMin', 'redGoldPerMin', 'GameDuration'],
      dtype='object')

Let's start by removing features which obviously (based on my knowledge of the game) won't help at prediction / analyzing the game duration of the games. 
* gameId
* blueWins

In [10]:
df_fe = df.drop(columns=['gameId', 'blueWins'], axis=1)


### 1. Transformation of existing feature into new more insighful ones

We are interested in the different of performance between the two teams (The greater the difference the lower we can expect the game duration to be).
Therefore we might want to replace all the variables blue__ / red__  by teamDiff__ (ex: replace redKills and blueKills
by teamDiffKills = abs(redKills-blueKills))

In [11]:
# blueWardsPlaced / redWardsPlaced -> teamDiffWardsPlaced
# blueWardsDestroyed / redWardsDestroyed -> teamDiffWardsDestroyed
# blueKills / redKills -> teamDiffKills
# blueEliteMonsters / redEliteMonsters -> teamDiffEliteMonsters
# etc.

In [18]:
df_fe = df_fe.reindex(sorted(df_fe.columns), axis=1)

In [22]:
# make some lists of red and blue variable so we can create the teamDiff__ variables by iteration
blue_var = [x for x in df_fe.columns if 'blue' in x]
red_var = [x for x in df_fe.columns if 'red' in x]

In [26]:
for b, r in zip(blue_var, red_var):
    name_vr = 'teamDiff'+b[4:] # create the names of the new variables adding the word after 'blue' in the orignal feature
    df_fe[name_vr] = abs(df_fe[b] - df_fe[r])
df_fe.head()

Unnamed: 0,GameDuration,blueAssists,blueAvgLevel,blueCSPerMin,blueDeaths,blueDragons,blueEliteMonsters,blueExperienceDiff,blueFirstBlood,blueGoldDiff,...,teamDiffGoldPerMin,teamDiffHeralds,teamDiffKills,teamDiffTotalExperience,teamDiffTotalGold,teamDiffTotalJungleMinionsKilled,teamDiffTotalMinionsKilled,teamDiffTowersDestroyed,teamDiffWardsDestroyed,teamDiffWardsPlaced
0,32.0,11,6.6,19.5,6,0,0,-8,1,643,...,64.3,0,3,8,643,19,2,0,4,13
1,19.0,5,6.6,17.4,5,0,0,-1173,0,-2908,...,290.8,1,0,1173,2908,9,66,1,0,0
2,32.0,4,6.4,18.6,11,1,1,-1033,0,-1172,...,117.2,0,4,1033,1172,18,17,0,3,0
3,23.0,5,7.0,20.1,5,0,1,-7,0,-1321,...,132.1,1,1,7,1321,8,34,0,1,28
4,27.0,6,7.0,21.0,6,0,0,230,0,-1004,...,100.4,0,0,230,1004,10,15,0,2,58


Using this procedure we created a few unfortunate columns we need to drop:
- teamDiffExperienceDiff 
- teamDiffFirstBlood
- teamDiffGoldDiff

In [36]:
df_fe = df_fe.drop(columns=['teamDiffExperienceDiff', 'teamDiffFirstBlood', 'teamDiffGoldDiff'], axis=1)

In [38]:
# Now we can drop the original red/blue features which don't provide more useful information (that the teamDiff variables); We are just
# keeping blueFirstBlood / redFirstBlood in order to engineer later an 'EarlyGameLead' feature