## Lab Assignment Five: Wide and Deep Network

### Luis Garduno

#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Brief Business Understanding

#### <ins>`About League of Legends`<ins>

Developed by Riot Studios, League of Legends, or "LoL", is an online multiplayer
video-game that is available to Windows/MacOS users. LoL consists 2 teams ('Blue
&amp; 'Red') facing each other, where the main objective is to destroy the opposing
teams 'Nexus', or home base, while facing obstacles like destroying damage dealing
towers &amp; eliminating players throughout the way. Perks &amp; gold are able to be
obtained by players/teams through completing tasks such as eliminating players,
enemy creeps, or dragons. Players then spend the gold to purchase items that help
raise the power of their abilities.

League of Legends offers different game modes, such as ranked. In this game mode,
players are given a rank based off of the number of wins + the number of games
played. "Diamond" is one of the highest ranks a player may obtain and is known
to be extremely competitive. A ranked game on average lasts 30-45 minutes. The
dataset we will be using contains the first 10 minute analytics of each team
for different diamond ranked matches.

#### <ins>`Measure of Success`<ins>

Once the data is analyzed, third parties, or teams/players, would be able to conceptualize the level
of priority different attributes have during early stages of diamond ranked matches. With the first
ten minutes of each game being critical, they could then use this information to adjust their strategy
to one proven to win matches. In order for this data to be useful and trusted by third parties in
specific situations such as playing at professional level, the data would have to render at least a 80%
accuracy. The reason for it being 80% and not any higher is because as mentioned this data only include
the first 10 minutes of a game (average full game: 30-45 minutes). We leave a 20% error gap for any
changes of pace the winning team might have for the remaining time of the game (~67%).

Additionally, players who are accustomed to playing as the 'jungle' role (a player
role that focuses on obtaining objective eliminations within the jungle areas of
the map) can use this analyzed data to better understand the impact elite monsters
have on winning games.

-------------------------------------

Dataset [Kaggle]: <a href="https://www.kaggle.com/bobbyscience/league-of-legends-diamond-ranked-games-10-min" target="_top"><b>First 10 minutes of diamond ranked League of Legends matches</b></a>

Question Of Interest : As of the first 10 minutes, which team will win?

## 1. Preparation

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 Loading Data & Adjustments (10%)
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1.1 Data Description

In [48]:
import numpy as np
import pandas as pd

# Load in the dataset into dataframe
dataset = pd.read_csv('https://raw.githubusercontent.com/luisegarduno/MachineLearning_Projects/master/Datasets/high_diamond_ranked_10min.csv')

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9879 entries, 0 to 9878
Data columns (total 40 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   gameId                        9879 non-null   int64  
 1   blueWins                      9879 non-null   int64  
 2   blueWardsPlaced               9879 non-null   int64  
 3   blueWardsDestroyed            9879 non-null   int64  
 4   blueFirstBlood                9879 non-null   int64  
 5   blueKills                     9879 non-null   int64  
 6   blueDeaths                    9879 non-null   int64  
 7   blueAssists                   9879 non-null   int64  
 8   blueEliteMonsters             9879 non-null   int64  
 9   blueDragons                   9879 non-null   int64  
 10  blueHeralds                   9879 non-null   int64  
 11  blueTowersDestroyed           9879 non-null   int64  
 12  blueTotalGold                 9879 non-null   int64  
 13  blu

---------------------------------

Printing out the information about the dataframe we are able to see that there are a
total of 9,879 instances, and 39 attributes.

Additionally we are able to see that there are 19 of the same attributes for each
the blue & red team (columns 1-19 are the same as 20-38).

Attributes for each team includes :
- Wards placed & destroyed
- Total number of kills, deaths, & assists
- First Bloods (1st elimination of the game)
- Total : towers destroyed, gold, experience
- Average : level, CS per minute, & gold per minute
- Difference in gold & experience between the teams
- Objective eliminations : elite monsters(dragons, heralds), minions, & jungle minions

Attributes such as total gold, experience, objectives eliminations, towers destroyed, etc.
will be of type integer (int64) because they will always be whole numbers. Attributes involving
averages such as cs per minute, gold per minute, & level, should be the only of double-precision floating-point
format (float64).

The data type for "blueWins" and "first bloods" could be changed to be of type boolean, but because we are wanting to
visualize these attributes later on, optimally it is best to keep these as integer data types. As a result,
the data types presented for each attribute are correct and should not be changed.

Below is a brief description of some of the key attributes.

In [49]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# describe dataframe
dataset.describe()

Unnamed: 0,gameId,blueWins,blueWardsPlaced,blueWardsDestroyed,blueFirstBlood,blueKills,blueDeaths,blueAssists,blueEliteMonsters,blueDragons,...,redTowersDestroyed,redTotalGold,redAvgLevel,redTotalExperience,redTotalMinionsKilled,redTotalJungleMinionsKilled,redGoldDiff,redExperienceDiff,redCSPerMin,redGoldPerMin
count,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,...,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0
mean,4500084000.0,0.499038,22.288288,2.824881,0.504808,6.183925,6.137666,6.645106,0.549954,0.36198,...,0.043021,16489.041401,6.925316,17961.730438,217.349226,51.313088,-14.414111,33.620306,21.734923,1648.90414
std,27573280.0,0.500024,18.019177,2.174998,0.500002,3.011028,2.933818,4.06452,0.625527,0.480597,...,0.2169,1490.888406,0.305311,1198.583912,21.911668,10.027885,2453.349179,1920.370438,2.191167,149.088841
min,4295358000.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,11212.0,4.8,10465.0,107.0,4.0,-11467.0,-8348.0,10.7,1121.2
25%,4483301000.0,0.0,14.0,1.0,0.0,4.0,4.0,4.0,0.0,0.0,...,0.0,15427.5,6.8,17209.5,203.0,44.0,-1596.0,-1212.0,20.3,1542.75
50%,4510920000.0,0.0,16.0,3.0,1.0,6.0,6.0,6.0,0.0,0.0,...,0.0,16378.0,7.0,17974.0,218.0,51.0,-14.0,28.0,21.8,1637.8
75%,4521733000.0,1.0,20.0,4.0,1.0,8.0,8.0,9.0,1.0,1.0,...,0.0,17418.5,7.2,18764.5,233.0,57.0,1585.5,1290.5,23.3,1741.85
max,4527991000.0,1.0,250.0,27.0,1.0,22.0,22.0,29.0,2.0,1.0,...,2.0,22732.0,8.2,22269.0,289.0,92.0,10830.0,9333.0,28.9,2273.2


In [50]:
dataset_describe = pd.DataFrame({'Features' : ['blueWins','WardsPlaced / WardsDestroyed','FirstBlood','Kills / Deaths / Assists',
                                         'TowersDestroyed','TotalGold','AvgLevel','TotalExperience','CSPerMin','GoldPerMin']})

dataset_describe['Description'] = ['whether blue team won or not','number of total wards placed or destroyed by team','team with the first kill of game',
                             'total number of kills, deaths, or assists of team','total number of towers destroyed by team','total gold obtained by team',
                             'average level of all players on team','total experience points accumulated by team','average creep score per minute','average gold obtained per minute']

dataset_describe['Feature type'] = ['Discrete','Continuous','Discrete','Continuous','Continuous','Continuous','Continuous','Continuous','Continuous','Continuous']

dataset_describe['Attribute Type'] = ['nominal','ratio','nominal','ratio','ratio','ratio','ratio','ratio','ratio','ratio']

dataset_describe['Range'] = ['0: red team won; 1: blue team won','placed: 5 - 250;destroyed: 0 - 27','0: did not get first kill; 1: team obtained first kill',
                       'kills: 0-22;deaths: 0-22;assists: 0-29','0 - 2','11,000 - 25,000','4.5 - 8.5','10,000 - 24,000','10.0 - 30.0','1,100.0 - 2,000.0']
dataset_describe

Unnamed: 0,Features,Description,Feature type,Attribute Type,Range
0,blueWins,whether blue team won or not,Discrete,nominal,0: red team won; 1: blue team won
1,WardsPlaced / WardsDestroyed,number of total wards placed or destroyed by team,Continuous,ratio,placed: 5 - 250;destroyed: 0 - 27
2,FirstBlood,team with the first kill of game,Discrete,nominal,0: did not get first kill; 1: team obtained fi...
3,Kills / Deaths / Assists,"total number of kills, deaths, or assists of team",Continuous,ratio,kills: 0-22;deaths: 0-22;assists: 0-29
4,TowersDestroyed,total number of towers destroyed by team,Continuous,ratio,0 - 2
5,TotalGold,total gold obtained by team,Continuous,ratio,"11,000 - 25,000"
6,AvgLevel,average level of all players on team,Continuous,ratio,4.5 - 8.5
7,TotalExperience,total experience points accumulated by team,Continuous,ratio,"10,000 - 24,000"
8,CSPerMin,average creep score per minute,Continuous,ratio,10.0 - 30.0
9,GoldPerMin,average gold obtained per minute,Continuous,ratio,"1,100.0 - 2,000.0"


#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1.2 Normalizing the Dataset

In [51]:
# We need to use proper variable representations (int, float, one-hot, etc.).
# Before we begin making adjustments to the dataframe, we should normalize the given values

from sklearn import preprocessing



col_names = dataset.columns
dataset_normalized = preprocessing.normalize(dataset)
df = pd.DataFrame(dataset_normalized, columns=col_names)
df.head()

TypeError: cannot do slice indexing on RangeIndex with these indexers [0       0
1       0
2       0
3       0
4       0
       ..
9874    1
9875    1
9876    0
9877    0
9878    1
Name: blueWins, Length: 9879, dtype: int64] of type Series


---------------

#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1.3 Data Quality

Using the `missingno` package, we are able to additionally confirm that all the data is complete
and there is no missing entries with the dataset. If there was missing data, we could impute the
missing values by using the k-nearest neighbor. But if an instance was missing a majority of its
attributes, it would be removed from the dataset.

The number of unique values in the column "gameId" is printed to verify that all instances
are weighted equally.

In [None]:
import missingno as mn

mn.matrix(df)

# Count unique values in column 'gameId' of the dataframe
print('Number of unique values in column "gameId" : ', df['gameId'].nunique())

dup_df = df.replace(to_replace=-1,value=np.nan)

dup_df = dup_df.duplicated()
print('Duplicates : ', len(df[dup_df]))


------------------------------

#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1.4 Cleaning the Dataset

After confirming there are no duplicates in the data, the "gameId" column can be removed since it
will have no impact on the results.

Using the correlation feature from the `pandas` package, for each team we find the names of
attributes that correlate most with winning (correlation >= 7%). The names of these attributes
are stored in a array for later use.

Lastly, two dataframes are created to hold the attributes at instances when blue team wins, and
when blue team loses.

In [None]:
del df['gameId']

red_col = df.corr()[df.corr()['blueWins'] <= -0.07].index.values
blue_col = df.corr()[df.corr()['blueWins'] >= 0.07].index.values

# Create dataframes for the 2 possible outcomes :
df_win  = df[df["blueWins"]==1.0]     # Blue Team Win  /  Red Team Lost
df_lose = df[df["blueWins"]==0.0]     # Red Team Win   /  Blue Team Lost


### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 Finding & Creating Cross-Product Features (20%)

In [None]:
# - Identify groups of features in my data that should be combined into cross-product features.

- Provide justification for why these features should be crossed (or why some features shouldn't be crossed).


### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 Measuring Algorithm Performance (30%)

In [None]:
ax = sns.countplot(x="blueWins", data=df, palette=['red', 'blue'])
ax.set_title('Win Rate by Team')
ax.set_xlabel('Teams')
ax.set_xticks([0,1])
ax.set_xticklabels(['Red', 'Blue'])
ax.set_ylabel('Frequency')

- Choose & explain what metric(s) I will use to evaluate your algorithm’s performance.
- I should give a detailed argument for why this (these) metric(s) are appropriate on my data.
- That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task).
- Note: rarely is accuracy the best evaluation metric to use.Think deeply about an appropriate measure of performance.


-------------------

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.4 Splitting the Dataset (40%)
Using Scikit-learn's
<a href="https://scikit-learn.org/stable/modules/cross_validation.html" target="_top"><b>cross-validation modules</b></a>
we are able to split our dataset for training and testing purposes.

In [None]:
from sklearn.model_selection import train_test_split

# Create X data & y target dataframe's
if 'blueWins' in df:
    y = df['blueWins'].values
    del df['blueWins']
    X = df.to_numpy()


# Divide the data: 80% Training & 20% Testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, test_size=0.1, random_state=0)

print("Training Set", "\n   - Data Shape:",X_train.shape,"\n   - Target Shape:",y_train.shape)
print("\nTesting Set","\n   - Data Shape:",X_test.shape ,"\n   - Target Shape:",y_test.shape)

-------------------

We perform a split within our dataset: 90% will be used for training, and 10% for testing. The 80/20 split is appropriate for
the dataset because recall that the end goal is for users to be able to determine the probabilities of them winning their
on-going game, or in other words we will only be predicting the win probability of __ONE__ game.

Additionally if a 95/5 split was applied it would also be appropriate to use as well. With League of Legends being a
strategy based game, our prediction algorithm essentially uses the training data to find which combination of
objectives/attributes have the biggest impact/correlation withing winning games. These game winning objectives/attributes could
be found quite early on during training, but we need to account that these objectives/attributes can be wrong in certain
instances due to the fact of the dataset only containing attributes for the first 10 minutes. So as the size of the training
set increases, the amount of fine-tunning performed increases, thus rendering a higher accuracy when predicting through the
testing dataset.

In [None]:

# - Choose the method I will use for dividing my data into training & testing



- Explain what i'm using & why (i.e Stratified 10-fold cross validation? Shuffle splits?)
- Explain why my chosen method is appropriate or use more than one method as appropriate.
- Argue why my cross validation method is a realistic mirroring of how an algorithm would be used in practice. 



--------------------------

## 2. Modeling


### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 Creating Wide & Deep Networks (60%)

In [None]:

# - Create at least 3 combined wide & deep networks to classify your data using Keras.


In [None]:

# - Visualize the performance of the network on the training data & validation data in:
# - the same plot
# - vs.
# - training iterations.
# - Note: use the "history" return parameter that is part of Keras "fit" function to easily access this data.



### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 Investigating Generalization Performance (80%)

In [None]:

# - Investigate generalization performance by altering the number of layers in the deep branch of the network.
# - Try at least 2 different number of layers.


In [None]:

# - Use the method of cross validation & evaluation metric I chose at the start of the lab to help select
#   the # of layers that performs superiorly. 



### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 Comparing Performaces (90%)

- Use proper statistical method to compare the performance of different models.  

In [None]:

# - Compare the performance of my best wide & deep network to a standard multi-layer perceptron (MLP).


In [None]:

# - For classification tasks, use the receiver operating characteristic & area under the curve.


In [None]:

#- For regression tasks, use Bland-Altman plots & residual variance calculations.



--------------------------------------------------

## 3. t-SNE Dimensionality Reduction

Reference
