# Your Title Here

**Name**: Kliment Ho

**Website Link**: (your website link)

In [217]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score, f1_score
from scipy.stats import chi2_contingency, pointbiserialr


import plotly.express as px
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

In [111]:
# Question: Which objectives are the most impactful in over the course of the 2023 season?
'''
In a match of League of Legends, players are often influenced by the decision to take or "securing" objectives. For the sake of this project, we define "objectives" as
jungle monsters("creeps") that offer team-wide stat increase or gold. Notable objectives include Baron, Rift Herald, Drake, and Tower.
Hypothesis: I believe the first team with mid-tower first are the more likely team to win.

Dataset Overview
The dataset used for this project is derived from professional League of Legends (LoL) esports matches from 2023, compiled by Oracle's Elixir. The dataset includes a rich collection of gameplay statistics and outcomes, such as objective control (e.g., first dragon, first baron), player performance metrics (kills, deaths, assists), and team results (win/loss). These features are essential for understanding the dynamics of each match and for developing models that can predict match outcomes based on the in-game actions of teams.

Research Question
The central research question guiding this analysis is: How do different in-game objectives influence the likelihood of winning a League of Legends match? The analysis will focus on understanding the relationship between securing specific objectives (e.g., dragons, barons, towers) and the match outcome. This question is crucial because it directly relates to strategic decision-making in professional play, where teams aim to optimize their chances of winning by prioritizing certain objectives.

Significance of the Study
Understanding the impact of objectives on match outcomes can help teams refine their strategies and improve their performance. By identifying which objectives are most strongly associated with winning, teams can prioritize those objectives during gameplay. Additionally, this analysis can inform coaching decisions and contribute to the broader esports community's understanding of the game.

Relevant Columns in the Dataset
gameid: Unique identifier for each match.
side: The team affiliation (Blue/Red) of the players.
result: Indicates the match outcome for the team (1 for win, 0 for loss).
firstdragon, dragons, firstbaron, barons, towers: Variables representing key objectives secured during the match.
teamkills, teamdeaths: Team-level performance metrics.
This study will explore how these variables, particularly the objectives, correlate with the likelihood of a team winning a match, forming the basis for predictive modeling in later stages.
'''

'\nIn a match of League of Legends, players are often influenced by the decision to take or "securing" objectives. For the sake of this project, we define "objectives" as\njungle monsters("creeps") that offer team-wide stat increase or gold. Notable objectives include Baron, Rift Herald, Drake, and Tower.\nHypothesis: I believe the first team with mid-tower first are the more likely team to win.\n\nDataset Overview\nThe dataset used for this project is derived from professional League of Legends (LoL) esports matches from 2023, compiled by Oracle\'s Elixir. The dataset includes a rich collection of gameplay statistics and outcomes, such as objective control (e.g., first dragon, first baron), player performance metrics (kills, deaths, assists), and team results (win/loss). These features are essential for understanding the dynamics of each match and for developing models that can predict match outcomes based on the in-game actions of teams.\n\nResearch Question\nThe central research que

## Step 2: Data Cleaning and Exploratory Data Analysis

In [112]:
# Univariate Insights:

# Objectives: Most teams secure between X and Y of each objective.
# Team Kills/Deaths: Average team kills are A, with a standard deviation of B. Notable outliers exist in team deaths indicating some matches are significantly one-sided.
# Bivariate Insights:

# Objectives vs. Outcome: Securing more objectives generally correlates with higher win rates. For example, teams securing the first dragon have a win rate of Z%.
# Team Kills vs. Outcome: Higher team kills are strongly associated with wins, indicating aggressive playstyles might be beneficial.
# Correlations: Objectives like 'dragons' and 'barons' show the highest correlation with match outcomes, suggesting their strategic importance.
# Aggregate Insights:

# Objective Combinations: Teams securing both first dragon and first baron have a significantly higher win rate compared to securing either alone.
# Side-Based Performance: Teams on the Blue side have a slightly higher win rate, potentially due to map advantages or starting positions.
# Patch Influence: Certain patches show higher win rates, possibly due to balance changes favoring specific strategies or objectives.
# Visual Patterns:

# Heatmaps and Faceted Plots: Reveal that securing higher counts of dragons and barons dramatically increases win probabilities.
# Violin Plots: Show that winning teams not only secure more objectives but also maintain a healthier balance between team kills and deaths.

'''
Initial Data Inspection
The dataset was initially inspected for missing values, inconsistencies, and potential outliers. 
The objectives data (e.g., first dragon, first baron) had missing values where a team did not secure the objective, 
which were initially represented as NaN. These missing values were filled with zeros to accurately represent that the objective was not secured.

Data Filtering
Only relevant columns that pertain directly to the research question were selected. 
These include columns related to game outcomes (result), objectives (firstdragon, barons, etc.), 
and team performance metrics (teamkills, teamdeaths).

Grouping Data
To ensure each row represented a unique team within each match, the data was grouped by gameid and side. 
This allows for analysis at the team level, where each team's performance and objective control could be directly linked to the match outcome.

Handling Missing Data
As mentioned, missing data related to objectives was handled by imputing zeros. 
This decision was made because the absence of an objective capture (e.g., not securing the first dragon) is a meaningful data point in this context, 
indicating that the team did not achieve that objective.

Feature Selection
For the baseline model, the initial feature set includes the following:

Objective-based features: firstdragon, firsttower, firstbaron, etc.
Outcome feature: result These features are central to understanding the relationship between in-game objectives and match outcomes.
This preprocessing step ensures that the dataset is clean, relevant, 
and ready for further analysis and model development. The next step involves exploratory data analysis to uncover patterns 
and relationships within this cleaned dataset.
'''

"\nInitial Data Inspection\nThe dataset was initially inspected for missing values, inconsistencies, and potential outliers. \nThe objectives data (e.g., first dragon, first baron) had missing values where a team did not secure the objective, \nwhich were initially represented as NaN. These missing values were filled with zeros to accurately represent that the objective was not secured.\n\nData Filtering\nOnly relevant columns that pertain directly to the research question were selected. \nThese include columns related to game outcomes (result), objectives (firstdragon, barons, etc.), \nand team performance metrics (teamkills, teamdeaths).\n\nGrouping Data\nTo ensure each row represented a unique team within each match, the data was grouped by gameid and side. \nThis allows for analysis at the team level, where each team's performance and objective control could be directly linked to the match outcome.\n\nHandling Missing Data\nAs mentioned, missing data related to objectives was handl

In [113]:
# Loading 2023 Match Data
filepath = Path('data') / '2023_LoL_esports_match_data_from_OraclesElixir.csv'
lol_stats = pd.read_csv(filepath)
print(lol_stats.describe())
print(lol_stats.info())


Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.



                year       playoffs           game          patch  \
count  130764.000000  130764.000000  130764.000000  130644.000000   
mean     2023.034505       0.209874       1.649353      13.085001   
std         0.182523       0.407220       0.940911       0.056031   
min      2023.000000       0.000000       1.000000      13.010000   
25%      2023.000000       0.000000       1.000000      13.040000   
50%      2023.000000       0.000000       1.000000      13.100000   
75%      2023.000000       0.000000       2.000000      13.130000   
max      2024.000000       1.000000       5.000000      13.240000   

       participantid     gamelength         result          kills  \
count  130764.000000  130764.000000  130764.000000  130764.000000   
mean       29.583333    1877.762045       0.500000       4.712016   
std        57.650688     335.342468       0.500002       5.772584   
min         1.000000     201.000000       0.000000       0.000000   
25%         3.750000    1643.0000

In [185]:
# Display all columns
for i in lol_stats.columns:
    print(i)

# Filter relevant columns
filtered_df = lol_stats[lol_stats['champion'].isna()].loc[:, [
    "gameid", 
    'league',
    "game", 
    "patch", 
    "side", 
    "position", 
    'gamelength',
    "result", 
    "teamkills", 
    "teamdeaths",
    "team kpm",
    "totalgold",
    ] +
    lol_stats.loc[:1, "firstdragon":"opp_inhibitors"].columns.to_list()
    ]
filtered_df[filtered_df['void_grubs'].notna() & filtered_df['void_grubs'] > 0]['void_grubs'].head()

# prev_match = lol_stats.loc[lol_stats['gameid'] ==  'ESPORTSTMNT06_2753012'].to_html(classes='table table-striped', border=0, index=True)
# file_path = 'assets/firstmatch_data.html'
# with open(file_path, 'w') as f:
#     f.write(prev_match)

gameid
datacompleteness
url
league
year
split
playoffs
date
game
patch
participantid
side
position
playername
playerid
teamname
teamid
champion
ban1
ban2
ban3
ban4
ban5
pick1
pick2
pick3
pick4
pick5
gamelength
result
kills
deaths
assists
teamkills
teamdeaths
doublekills
triplekills
quadrakills
pentakills
firstblood
firstbloodkill
firstbloodassist
firstbloodvictim
team kpm
ckpm
firstdragon
dragons
opp_dragons
elementaldrakes
opp_elementaldrakes
infernals
mountains
clouds
oceans
chemtechs
hextechs
dragons (type unknown)
elders
opp_elders
firstherald
heralds
opp_heralds
void_grubs
opp_void_grubs
firstbaron
barons
opp_barons
firsttower
towers
opp_towers
firstmidtower
firsttothreetowers
turretplates
opp_turretplates
inhibitors
opp_inhibitors
damagetochampions
dpm
damageshare
damagetakenperminute
damagemitigatedperminute
wardsplaced
wpm
wardskilled
wcpm
controlwardsbought
visionscore
vspm
totalgold
earnedgold
earned gpm
earnedgoldshare
goldspent
gspd
gpr
total cs
minionkills
monsterkills
mon

Series([], Name: void_grubs, dtype: float64)

In [115]:
fig = px.histogram(filtered_df, x='towers', nbins=20, title='Distribution of Towers Destroyed',
                   labels={'towers': 'Number of Towers Destroyed'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Towers Destroyed',
    yaxis_title='Frequency',
    template='plotly_dark'
)
fig.write_html('assets/dist_towers.html', include_plotlyjs='cdn')
fig

In [116]:
fig = px.histogram(filtered_df[filtered_df['result'] == 1], x='towers', nbins=20, title='Distribution of Towers Destroyed',
                   labels={'towers': 'Winners Number of Towers Destroyed'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Towers Destroyed',
    yaxis_title='Frequency',
    template='plotly_dark'
)
fig.write_html('assets/win_dist_towers.html', include_plotlyjs='cdn')
fig

In [117]:
major_objectives = ['barons', 'dragons', 'heralds', 'void_grubs']
obj_count_df = filtered_df.copy()
# Sum the counts of each objective across rows
obj_count_df['major_objectives_count'] = obj_count_df[major_objectives].sum(axis=1)
# Display the new DataFrame
print(obj_count_df.head())

fig = px.histogram(obj_count_df, x='major_objectives_count', nbins=20, 
                   title='Distribution of Major Objectives Count',
                   labels={'major_objectives_count': 'Major Objectives Count'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Major Objectives Taken',
    yaxis_title='Frequency',
    template='plotly_dark'
)

# Save the plot as an HTML file
fig.write_html('assets/major_objectives_distribution.html', include_plotlyjs='cdn')
fig

                   gameid league  game  patch  side position  result  \
10  ESPORTSTMNT06_2753012   LFL2     1  13.01  Blue     team       1   
11  ESPORTSTMNT06_2753012   LFL2     1  13.01   Red     team       0   
22  ESPORTSTMNT06_2754023   LFL2     1  13.01  Blue     team       0   
23  ESPORTSTMNT06_2754023   LFL2     1  13.01   Red     team       1   
34  ESPORTSTMNT06_2755035   LFL2     1  13.01  Blue     team       1   

    teamkills  teamdeaths  team kpm  ...  firsttower  towers  opp_towers  \
10         13           7    0.2986  ...         1.0    11.0         2.0   
11          7          13    0.1608  ...         0.0     2.0        11.0   
22         20          16    0.4926  ...         0.0     5.0        11.0   
23         16          20    0.3941  ...         1.0    11.0         5.0   
34         20           7    0.6061  ...         0.0     7.0         4.0   

    firstmidtower  firsttothreetowers  turretplates  opp_turretplates  \
10            1.0                 1.0

In [118]:
fig = px.histogram(obj_count_df[obj_count_df['result'] == 1], x='major_objectives_count', nbins=20, 
                   title='Distribution of Winners Major Objectives Count',
                   labels={'major_objectives_count': 'Major Objectives Count'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Major Objectives Taken',
    yaxis_title='Frequency',
    template='plotly_dark'
)

# Save the plot as an HTML file
fig.write_html('assets/win_major_objectives_distribution.html', include_plotlyjs='cdn')
fig

In [119]:
# Group matches so that each row represent each team that played per game

grouped_df = filtered_df.fillna(0).groupby(['gameid', 'side']).max().loc[:, 
['result', "teamkills", "teamdeaths"] + filtered_df.loc[:, 'firstdragon':'opp_inhibitors'].columns.to_list()]


In [120]:
#correlation matrix
correlation_matrix = grouped_df.corr()
relevant_correlations = correlation_matrix.loc[
    ['result', 'teamkills', 'teamdeaths'],
    grouped_df.columns[grouped_df.columns.str.startswith('first') | grouped_df.columns.str.startswith('opp')]
]

print(relevant_correlations)

            firstdragon  opp_dragons  opp_elementaldrakes  opp_elders  \
result         0.203462    -0.614756            -0.482659   -0.114585   
teamkills      0.180140    -0.426800            -0.324150   -0.005718   
teamdeaths    -0.150218     0.556212             0.453087    0.130965   

            firstherald  opp_heralds  opp_void_grubs  firstbaron  opp_barons  \
result         0.143287    -0.240959             NaN    0.569740   -0.638985   
teamkills      0.131432    -0.194477             NaN    0.462868   -0.356137   
teamdeaths    -0.101405     0.228897             NaN   -0.419880    0.556437   

            firsttower  opp_towers  firstmidtower  firsttothreetowers  \
result        0.339589   -0.889475       0.392519            0.499866   
teamkills     0.292062   -0.600457       0.350364            0.418675   
teamdeaths   -0.261857    0.716710      -0.320213           -0.388201   

            opp_turretplates  opp_inhibitors  
result             -0.276630       -0.753096  

In [121]:
#Distribution of wins based on objectives
objectives = [
    'firstdragon', 'dragons', 'opp_dragons', 'elementaldrakes', 'opp_elementaldrakes',
    'infernals', 'mountains', 'clouds', 'oceans', 'chemtechs', 'hextechs',
    'dragons (type unknown)', 'elders', 'opp_elders', 'firstherald', 'heralds', 
    'opp_heralds', 'void_grubs', 'opp_void_grubs', 'firstbaron', 'barons', 
    'opp_barons', 'firsttower', 'towers', 'opp_towers', 'firstmidtower', 
    'firsttothreetowers', 'turretplates', 'opp_turretplates', 'inhibitors', 
    'opp_inhibitors'
]

# for obj in objectives:
#     fig = px.histogram(grouped_df, x=obj, nbins=10, 
#                        title=f'Distribution of {obj.capitalize()}',
#                        labels={obj: obj.capitalize()})
#     fig.show()
# Distribution of outcomes (wins vs losses)
# fig = px.histogram(grouped_df, x='result', nbins=2, 
#                    title='Distribution of Match Outcomes (Win/Loss)',
#                    labels={'result': 'Match Outcome'})
# fig.show()


In [122]:
# Group by the number of towers destroyed and calculate aggregates
by_towers_df = obj_count_df.groupby('towers').agg(
    num_wins=('result', 'sum'),
    total_games=('result', 'count'),
    win_loss_ratio=('result', lambda x: x.sum() / x.count()),
    avg_teamkills=('teamkills', 'mean'),
    avg_teamdeaths=('teamdeaths', 'mean'),
    avg_totalgold=('totalgold', 'mean')
).reset_index()

# Calculate additional columns if needed
by_towers_df['win_proportion'] = by_towers_df['num_wins'] / by_towers_df['total_games']
by_towers_df.head(10)


Unnamed: 0,towers,num_wins,total_games,win_loss_ratio,avg_teamkills,avg_teamdeaths,avg_totalgold,win_proportion
0,0.0,1,1143,0.000875,5.910761,19.391951,40251.934383,0.000875
1,1.0,0,1934,0.0,6.843847,19.066184,44142.608066,0.0
2,2.0,0,2423,0.0,7.600495,18.676847,48310.250929,0.0
3,3.0,0,2243,0.0,9.279091,19.043691,53022.637093,0.0
4,4.0,0,1283,0.0,10.738893,19.354638,57146.473889,0.0
5,5.0,13,811,0.01603,12.44143,19.437731,61844.991369,0.01603
6,6.0,108,587,0.183986,14.364566,17.775128,64555.148211,0.183986
7,7.0,933,1211,0.770438,17.161024,12.383154,61299.076796,0.770438
8,8.0,1780,1947,0.914227,18.561376,10.629173,61219.237288,0.914227
9,9.0,2992,3084,0.970169,18.998054,9.300908,61689.851816,0.970169


In [123]:
# Group by the number of major objectives taken and calculate aggregates
by_major_objectives_df = obj_count_df.groupby('major_objectives_count').agg(
    num_wins=('result', 'sum'),
    total_games=('result', 'count'),
    win_loss_ratio=('result', lambda x: x.sum() / x.count()),
    avg_teamkills=('teamkills', 'mean'),
    avg_teamdeaths=('teamdeaths', 'mean'),
    avg_totalgold=('totalgold', 'mean')
).reset_index()

# Calculate additional columns if needed
by_major_objectives_df['win_proportion'] = by_major_objectives_df['num_wins'] / by_major_objectives_df['total_games']
by_major_objectives_df


Unnamed: 0,major_objectives_count,num_wins,total_games,win_loss_ratio,avg_teamkills,avg_teamdeaths,avg_totalgold,win_proportion
0,0.0,3,1672,0.001794,6.000598,19.232057,42286.542464,0.001794
1,1.0,38,2546,0.014925,7.393166,19.060487,46408.57502,0.014925
2,2.0,317,2902,0.109235,9.430737,18.005169,51123.73501,0.109235
3,3.0,922,2827,0.326141,12.798373,15.945525,56176.487089,0.326141
4,4.0,1950,3113,0.626405,16.366849,12.850948,59249.806617,0.626405
5,5.0,2820,3428,0.822637,18.249708,10.808051,61393.306884,0.822637
6,6.0,2545,2848,0.89361,19.041081,9.999298,63323.331461,0.89361
7,7.0,1561,1670,0.934731,19.349701,9.84012,65617.610778,0.934731
8,8.0,595,631,0.942948,19.790808,10.866878,70340.63233,0.942948
9,9.0,120,130,0.923077,19.676923,11.730769,76984.476923,0.923077


In [124]:
#Correlation
# from scipy.stats import pointbiserialr

# correlations = {}
# for obj in objectives:
#     corr, _ = pointbiserialr(grouped_df[obj], grouped_df['result'])
#     correlations[obj] = corr

# correlations_df = pd.DataFrame(list(correlations.items()), columns=['Objective', 'Correlation with Result'])

# # Plot correlations
# fig = px.bar(correlations_df, x='Objective', y='Correlation with Result', 
#              title='Correlation between Objectives and Match Outcome')
# fig.show()
# fig.write_html('assets/correlation_obj_outcome.html', include_plotlyjs='cdn')

# We can see below that the objective "tower" has a strong correlation with winning. This makes sense
# as a game of league requires a team to destroy at least 5 towers to reach the final objective and
# win a match. The only case where a win/loss occurs before the destruction of 5 towers is
# a forfeit, typically rare in a professional eSports match.

# Notably, the number of dragons acquired and the slaying of "Baron" has a strong correlation with
# the match outcome. Shockingly, having dragons seem to have more impact on an outcome than Baron
# in this correlation bar graph.

In [125]:
#Pairwise correlation vs. objectives
# fig = px.imshow(grouped_df[objectives].corr(), 
#                 title='Pairwise Correlations between Objectives')
# fig.show()
# fig.write_html('assets/pairwise_correlation.html', include_plotlyjs='cdn')
# To show the correlation of objectives. Namely, there are a strong negative correlation between objectives with and without the "opp" string.
# "opp" represents opponents respective objective data.


In [126]:
# Proportions of Win/Lose for Objective
# objective_proportions = {}

# for obj in objectives:
#     win_loss_counts = grouped_df.groupby([obj, 'result']).size().unstack(fill_value=0)
#     win_loss_proportions = win_loss_counts.div(win_loss_counts.sum(axis=1), axis=0)
    
#     objective_proportions[obj] = win_loss_proportions



In [127]:
#Stacked Bar Graph for Proportional Objective wins
# for obj in objectives:
#     proportions_df = objective_proportions[obj].reset_index()

#     fig = px.bar(proportions_df, 
#                  x=obj, 
#                  y=[0, 1], 
#                  title=f'Proportion of Wins/Losses for {obj.capitalize()}',
#                  labels={obj: f'Number of {obj.capitalize()}', 'value': 'Proportion'},
#                  barmode='stack')
    
#     fig.update_layout(yaxis=dict(tickformat=".0%"))
#     fig.show()
#     fig.write_html(f'assets/{obj.capitalize()}_prop_bar.html', include_plotlyjs='cdn')


In [128]:
# Specify the objectives to analyze
selected_objectives = ['barons', 'dragons', 'firstdragon', 'towers', 'void_grubs', 'elders', 'inhibitors']

objective_proportions = {}

# Calculate win/loss proportions for each selected objective
for obj in selected_objectives:
    win_loss_counts = obj_count_df.groupby([obj, 'result']).size().unstack(fill_value=0)
    win_loss_proportions = win_loss_counts.div(win_loss_counts.sum(axis=1), axis=0)
    
    # Add the objective proportions to the dictionary
    objective_proportions[obj] = win_loss_proportions

for obj in selected_objectives:
    proportions_df = objective_proportions[obj].reset_index()

    # Create a bar plot for each objective
    fig = px.bar(proportions_df, 
                 x=obj, 
                 y=[0, 1], 
                 title=f'Proportion of Wins/Losses for {obj.capitalize()}',
                 labels={obj: f'Number of {obj.capitalize()}', 'value': 'Proportion'},
                 barmode='stack')

    # Update layout for better readability
    fig.update_layout(
        yaxis=dict(tickformat=".0%"),
        template='plotly_dark'
        )
    fig.show()
    
    # Save the plot as an HTML file
    fig.write_html(f'assets/{obj.capitalize()}_prop_bar.html', include_plotlyjs='cdn')

In [129]:
win_proportions = {}

for obj in objectives:
    # Calculate the win proportion for each count of the objective
    win_loss_counts = grouped_df.groupby([obj, 'result']).size().unstack(fill_value=0)
    win_proportion = win_loss_counts[1] / (win_loss_counts[0] + win_loss_counts[1])
    
    # Filter out zero win proportions
    win_proportion = win_proportion[win_proportion > 0]
    win_proportions[obj] = win_proportion.reset_index(drop=True)  # Reset index to remove it

# Convert the win proportions to a DataFrame for easier plotting
win_proportions_df = pd.DataFrame(win_proportions)

import plotly.express as px

# Melt the DataFrame for plotting
win_proportions_melted = win_proportions_df.melt(var_name='Objective', 
                                                 value_name='Win Proportion')

# Create the line plot without zero points
fig = px.line(win_proportions_melted, 
              x=win_proportions_melted.index, 
              y='Win Proportion', 
              color='Objective', 
              title='Win Proportion by Number of Objectives Secured (Non-Zero)',
              labels={'index': 'Number of Objectives', 'Win Proportion': 'Win Proportion'})

fig.update_layout(yaxis=dict(tickformat=".0%"))
fig.show()



In [130]:
by_towers_df = obj_count_df.groupby('towers').agg(
    num_wins=('result', 'sum'),
    total_games=('result', 'count'),
    win_loss_ratio=('result', lambda x: x.sum() / x.count()),
    avg_teamkills=('teamkills', 'mean'),
    avg_teamdeaths=('teamdeaths', 'mean'),
    avg_totalgold=('totalgold', 'mean')
).reset_index()

# Calculate additional columns if needed
by_towers_df['win_proportion'] = by_towers_df['num_wins'] / by_towers_df['total_games']
by_towers_df


Unnamed: 0,towers,num_wins,total_games,win_loss_ratio,avg_teamkills,avg_teamdeaths,avg_totalgold,win_proportion
0,0.0,1,1143,0.000875,5.910761,19.391951,40251.934383,0.000875
1,1.0,0,1934,0.0,6.843847,19.066184,44142.608066,0.0
2,2.0,0,2423,0.0,7.600495,18.676847,48310.250929,0.0
3,3.0,0,2243,0.0,9.279091,19.043691,53022.637093,0.0
4,4.0,0,1283,0.0,10.738893,19.354638,57146.473889,0.0
5,5.0,13,811,0.01603,12.44143,19.437731,61844.991369,0.01603
6,6.0,108,587,0.183986,14.364566,17.775128,64555.148211,0.183986
7,7.0,933,1211,0.770438,17.161024,12.383154,61299.076796,0.770438
8,8.0,1780,1947,0.914227,18.561376,10.629173,61219.237288,0.914227
9,9.0,2992,3084,0.970169,18.998054,9.300908,61689.851816,0.970169


In [131]:
# Group by the number of major objectives taken and calculate aggregates
by_major_objectives_df = obj_count_df.groupby('major_objectives_count').agg(
    num_wins=('result', 'sum'),
    total_games=('result', 'count'),
    win_loss_ratio=('result', lambda x: x.sum() / x.count()),
    avg_teamkills=('teamkills', 'mean'),
    avg_teamdeaths=('teamdeaths', 'mean'),
    avg_totalgold=('totalgold', 'mean')
).reset_index()

# Calculate additional columns if needed
by_major_objectives_df['win_proportion'] = by_major_objectives_df['num_wins'] / by_major_objectives_df['total_games']
by_major_objectives_df

Unnamed: 0,major_objectives_count,num_wins,total_games,win_loss_ratio,avg_teamkills,avg_teamdeaths,avg_totalgold,win_proportion
0,0.0,3,1672,0.001794,6.000598,19.232057,42286.542464,0.001794
1,1.0,38,2546,0.014925,7.393166,19.060487,46408.57502,0.014925
2,2.0,317,2902,0.109235,9.430737,18.005169,51123.73501,0.109235
3,3.0,922,2827,0.326141,12.798373,15.945525,56176.487089,0.326141
4,4.0,1950,3113,0.626405,16.366849,12.850948,59249.806617,0.626405
5,5.0,2820,3428,0.822637,18.249708,10.808051,61393.306884,0.822637
6,6.0,2545,2848,0.89361,19.041081,9.999298,63323.331461,0.89361
7,7.0,1561,1670,0.934731,19.349701,9.84012,65617.610778,0.934731
8,8.0,595,631,0.942948,19.790808,10.866878,70340.63233,0.942948
9,9.0,120,130,0.923077,19.676923,11.730769,76984.476923,0.923077


In [132]:
# To html with DataFrame that has custom CSS to be easily readable in hacker
custom_css = """
<style>
    body {
        background-color: #2b2b2b;
        color: #f0f0f0;
    }
    table {
        width: 100%;
        border-collapse: collapse;
        margin: 25px 0;
        font-size: 1.2em;
        font-family: Arial, sans-serif;
        text-align: left;
        border-radius: 5px 5px 0 0;
        overflow: hidden;
    }
    thead tr {
        background-color: #333;
        color: #ffffff;
        text-align: left;
        font-weight: bold;
    }
    th, td {
        padding: 12px 15px;
        color: #f0f0f0;
    }
    tbody tr {
        border-bottom: 1px solid #444;
    }
    tbody tr:nth-of-type(even) {
        background-color: #3a3a3a;
    }
    tbody tr:nth-of-type(odd) {
        background-color: #2b2b2b;
    }
    tbody tr:hover {
        background-color: #555;
        color: #ffffff;
    }
    th, td {
        border: 1px solid #444;
    }
</style>
"""

# Convert the by_towers_df DataFrame to HTML with the custom CSS
towers_html = custom_css + by_towers_df.to_html(classes='table table-striped', border=0, index=True)

# Convert the by_major_objectives_df DataFrame to HTML with the custom CSS
objectives_html = custom_css + by_major_objectives_df.to_html(classes='table table-striped', border=0, index=True)

# Save the HTML outputs to files
with open('assets/by_towers_df.html', 'w') as f:
    f.write(towers_html)

with open('assets/by_major_objectives_df.html', 'w') as f:
    f.write(objectives_html)


In [133]:
# from sklearn.preprocessing import StandardScaler

# grouped_df.fillna(0, inplace=True)
# scaler = StandardScaler()
# grouped_df[objectives] = scaler.fit_transform(grouped_df[objectives])
# grouped_df['dragon_x_tower'] = grouped_df['firstdragon'] * grouped_df['firsttower']

In [134]:
# from sklearn.model_selection import train_test_split

# X = grouped_df[objectives]
# y = grouped_df['result']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [135]:
# Interesting Aggregates
# Goal: Summarize and group data to understand collective behaviors or patterns.
# Actions:
# Calculate win/loss proportions for different counts of objectives (e.g., how often teams win when they secure 2 dragons vs. 3 dragons).
# Visualize these aggregates using bar plots to see how securing different numbers of objectives impacts win rates.

## Step 3: Assessment of Missingness

In [136]:
# For all objectives, all presently filled NaN values correlated to taking zero objectives. Therefore, we believe the columns (objectives) are all Not Missing At Random as there
# is no trends of missingness but rather dependent from team-to-team decision. Here is the original grouped DataFrame with the NaN values.

In [137]:
# Missingness Dependency: However, what about matches where NaN appears for the entire objective. We will show a new grouped df where it only includes matches that have
# objectives where it is completely NaN. As we can see, NaN values occur in the "firstdragon" column even when the "dragon" objective has a formal value. Here, we will
# investigate whether or not the "firstdragon" missingness depends on the "league" column as viewing the groupby aggregation with .max() on "league" shows NaN values.

In [138]:
# Add a column to indicate missingness for 'dragons'
nangrouped2_df = filtered_df.groupby(['league']).max().loc[:, 
filtered_df.loc[:, 'firstdragon':'opp_inhibitors'].columns.to_list()]
nangrouped2_df.head(10)

Unnamed: 0_level_0,firstdragon,dragons,opp_dragons,elementaldrakes,opp_elementaldrakes,infernals,mountains,clouds,oceans,chemtechs,...,opp_barons,firsttower,towers,opp_towers,firstmidtower,firsttothreetowers,turretplates,opp_turretplates,inhibitors,opp_inhibitors
league,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AL,1.0,6.0,6.0,4.0,4.0,4.0,4.0,4.0,2.0,4.0,...,3.0,1.0,11.0,11.0,1.0,1.0,12.0,12.0,7.0,7.0
CBLOL,1.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,...,4.0,1.0,11.0,11.0,1.0,1.0,11.0,11.0,6.0,6.0
CBLOLA,1.0,6.0,6.0,4.0,4.0,4.0,3.0,3.0,4.0,3.0,...,4.0,1.0,11.0,11.0,1.0,1.0,11.0,11.0,8.0,8.0
CDF,1.0,4.0,4.0,4.0,4.0,3.0,2.0,4.0,2.0,2.0,...,4.0,1.0,11.0,11.0,1.0,1.0,15.0,15.0,8.0,8.0
CT,1.0,6.0,6.0,4.0,4.0,2.0,3.0,3.0,3.0,3.0,...,3.0,1.0,11.0,11.0,1.0,1.0,10.0,10.0,4.0,4.0
DCup,,4.0,4.0,,,,,,,,...,3.0,,11.0,11.0,,,,,4.0,4.0
DDH,1.0,5.0,5.0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,...,3.0,1.0,11.0,11.0,1.0,1.0,13.0,13.0,5.0,5.0
EBL,1.0,6.0,6.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,...,4.0,1.0,11.0,11.0,1.0,1.0,13.0,13.0,6.0,6.0
EL,1.0,6.0,6.0,4.0,4.0,2.0,2.0,3.0,3.0,3.0,...,4.0,1.0,11.0,11.0,1.0,1.0,12.0,12.0,5.0,5.0
EM,1.0,5.0,5.0,4.0,4.0,3.0,3.0,4.0,3.0,3.0,...,3.0,1.0,11.0,11.0,1.0,1.0,15.0,15.0,5.0,5.0


In [139]:
# FirstDragon Dependency on League
lol_stats_with_missing = filtered_df[['league', 'firstdragon']]

# Add a new column to indicate missingness for 'firstdragon'
lol_stats_with_missing['firstdragon_missing'] = lol_stats_with_missing['firstdragon'].isna()

# Preview the new dataframe
lol_stats_with_missing.head(5)

league_distribution = lol_stats_with_missing.groupby(['league', 'firstdragon_missing']).size().unstack(fill_value=0)

# Normalize the distribution
league_distribution_norm = league_distribution.div(league_distribution.sum(axis=1), axis=0)

# Display the normalized distribution
print(league_distribution_norm)


observed_tvd = league_distribution_norm.diff(axis=1).iloc[:, -1].abs().sum()
# Permutation test
n_permutations = 1000
permuted_tvds = []

for _ in range(n_permutations):
    shuffled = lol_stats_with_missing['firstdragon_missing'].sample(frac=1).values
    permuted_counts = lol_stats_with_missing.assign(shuffled_missing=shuffled).groupby(['league', 'shuffled_missing']).size().unstack(fill_value=0)
    permuted_counts_norm = permuted_counts.div(permuted_counts.sum(axis=1), axis=0)
    permuted_tvd = permuted_counts_norm.diff(axis=1).iloc[:, -1].abs().sum()
    permuted_tvds.append(permuted_tvd)

# Calculate p-value
p_value = np.mean(np.array(permuted_tvds) >= observed_tvd)
print(f"Observed TVD: {observed_tvd}")
print(f"P-value: {p_value:.4f}")

fig = px.histogram(permuted_tvds, nbins=50, title="Permutation Test: TVD between 'firstdragon' Missingness and 'league'",
                   labels={'value': 'TVD'}, marginal="box")
fig.add_vline(x=observed_tvd, line_width=3, line_dash="dash", line_color="lime", annotation_text="Observed TVD")
fig.update_layout(yaxis_title="Frequency", xaxis_title="TVD", template='plotly_dark')
fig.write_html('assets/firstdragon_league_tvd.html', include_plotlyjs='cdn')
fig




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



firstdragon_missing     False      True
league                                 
AL                   1.000000  0.000000
CBLOL                1.000000  0.000000
CBLOLA               1.000000  0.000000
CDF                  1.000000  0.000000
CT                   1.000000  0.000000
DCup                 0.000000  1.000000
DDH                  1.000000  0.000000
EBL                  1.000000  0.000000
EL                   1.000000  0.000000
EM                   1.000000  0.000000
EPL                  1.000000  0.000000
ESLOL                1.000000  0.000000
GL                   1.000000  0.000000
GLL                  1.000000  0.000000
HC                   1.000000  0.000000
HM                   1.000000  0.000000
IC                   1.000000  0.000000
LAS                  1.000000  0.000000
LCK                  1.000000  0.000000
LCKC                 1.000000  0.000000
LCO                  1.000000  0.000000
LCS                  1.000000  0.000000
LDL                  0.000000  1.000000


In [140]:
# Missingness of firstdragon does not depend on league column.

# However, we suspect that there was an underlying meaning. It turns out, a NaN firstdragon could simply mean that the game played
# does not have the dragon slained at all. Therefore, we shall test upon "firsttower"

# FirstDragon Dependency on Number of Wins
lol_stats_with_missing = filtered_df[['result', 'firstdragon']]

# Add a new column to indicate missingness for 'firstdragon'
lol_stats_with_missing['firstdragon_missing'] = lol_stats_with_missing['firstdragon'].isna()

# Count the number of wins (result = 1) for missing and non-missing firstdragon
win_counts = lol_stats_with_missing.groupby('firstdragon_missing')['result'].sum()

# Calculate the observed difference in the number of wins
observed_diff = win_counts.diff().iloc[-1]

# Permutation test
n_permutations = 1000
permuted_diffs = []

for _ in range(n_permutations):
    shuffled = lol_stats_with_missing['firstdragon_missing'].sample(frac=1).values
    permuted_counts = lol_stats_with_missing.assign(shuffled_missing=shuffled).groupby('shuffled_missing')['result'].sum()
    permuted_diff = permuted_counts.diff().iloc[-1]
    permuted_diffs.append(permuted_diff)

# Calculate p-value
p_value = np.mean(np.array(permuted_diffs) >= observed_diff)
print(f"Observed Difference: {observed_diff}")
print(f"P-value: {p_value:.4f}")

# Plotting the permutation test results
fig = px.histogram(permuted_diffs, nbins=50, title="Permutation Test: Difference in Wins between 'firstdragon' Missingness",
                   labels={'value': 'Difference in Wins'}, marginal="box")
fig.add_vline(x=observed_diff, line_width=3, line_dash="dash", line_color="lime", annotation_text="Observed Difference")
fig.update_layout(yaxis_title="Frequency", xaxis_title="Difference in Wins", template='plotly_dark')
fig.write_html('assets/firstdragon_result_diff.html', include_plotlyjs='cdn')
fig




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Observed Difference: -7567.0
P-value: 0.5060


In [141]:
# Missingness of firsttower does not depend on league column.

# As a result, we fail to reject the null as for both firsttower and firstdragon, the p-value is greater than the significance level of 0.5.

In [142]:
# Permutation Test 3: Missingness of void_grubs vs. result using TVD
# VoidGrubs Dependency on Result
lol_stats_with_missing = filtered_df[['result', 'void_grubs']]

# Add a new column to indicate missingness for 'void_grubs'
lol_stats_with_missing['void_grubs_missing'] = lol_stats_with_missing['void_grubs'].isna()

# Calculate the distribution of result based on missingness
result_distribution = lol_stats_with_missing.groupby(['result', 'void_grubs_missing']).size().unstack(fill_value=0)

# Normalize the distribution
result_distribution_norm = result_distribution.div(result_distribution.sum(axis=1), axis=0)

# Calculate the observed TVD
observed_tvd = result_distribution_norm.diff(axis=1).iloc[:, -1].abs().sum()

# Permutation test
n_permutations = 1000
permuted_tvds = []

for _ in range(n_permutations):
    shuffled = lol_stats_with_missing['void_grubs_missing'].sample(frac=1).values
    permuted_counts = lol_stats_with_missing.assign(shuffled_missing=shuffled).groupby(['result', 'shuffled_missing']).size().unstack(fill_value=0)
    permuted_counts_norm = permuted_counts.div(permuted_counts.sum(axis=1), axis=0)
    permuted_tvd = permuted_counts_norm.diff(axis=1).iloc[:, -1].abs().sum()
    permuted_tvds.append(permuted_tvd)

# Calculate p-value
p_value = np.mean(np.array(permuted_tvds) >= observed_tvd)
print(f"Observed TVD: {observed_tvd}")
print(f"P-value: {p_value:.4f}")

# Plotting the permutation test results
fig = px.histogram(permuted_tvds, nbins=50, title="Permutation Test: TVD between 'void_grubs' Missingness and 'result'",
                   labels={'value': 'TVD'}, marginal="box")
fig.add_vline(x=observed_tvd, line_width=3, line_dash="dash", line_color="lime", annotation_text="Observed TVD")
fig.update_layout(yaxis_title="Frequency", xaxis_title="TVD", template='plotly_dark')
fig.write_html('assets/void_grubs_result_tvd.html', include_plotlyjs='cdn')
fig.show()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Observed TVD: 1.9941268238964853
P-value: 0.6460


In [143]:
# Missingness of inhibitors does depend on league column.

# As a result, we reject the null as for inhibitors, the p-value is less than the significance level of 0.5.

In [144]:
# Example: Testing dependency of missingness in 'firstdragon' on 'result'
contingency_table = pd.crosstab(grouped_df['firstdragon'].isnull(), grouped_df['result'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f'Chi-Square Test for Missingness in First Dragon vs Result: p-value = {p}')


Chi-Square Test for Missingness in First Dragon vs Result: p-value = 1.0


## Step 4: Hypothesis Testing

In [145]:
'''
Null Hypothesis (H₀): Securing a specific objective (e.g., first dragon, first tower, or baron) does not significantly affect the probability of winning the game.

Alternative Hypothesis (H₁): Securing a specific objective (e.g., first dragon, first tower, or baron) significantly increases the probability of winning the game.

Test Statistics: We will use the absolute difference in win proportions between teams that secured the objective and those that did not.

Significance Level: 5%
'''

'\nNull Hypothesis (H₀): Securing a specific objective (e.g., first dragon, first tower, or baron) does not significantly affect the probability of winning the game.\n\nAlternative Hypothesis (H₁): Securing a specific objective (e.g., first dragon, first tower, or baron) significantly increases the probability of winning the game.\n\nTest Statistics: We will use the absolute difference in win proportions between teams that secured the objective and those that did not.\n\nSignificance Level: 5%\n'

In [159]:
# ALTER BELOW TO SELECT COLUMN
objective_count = 'major_objectives_count'

# Calculate the observed difference in win proportions
win_proportions = obj_count_df.groupby(objective_count)['result'].mean()
observed_diff = win_proportions.diff().iloc[-1]

# Permutation test
n_permutations = 1000
permuted_diffs = []

for _ in range(n_permutations):
    shuffled = obj_count_df[objective_count].sample(frac=1).reset_index(drop=True)
    permuted_win_proportions = obj_count_df.assign(shuffled=shuffled).groupby('shuffled')['result'].mean()
    permuted_diff = permuted_win_proportions.diff().iloc[-1]
    permuted_diffs.append(permuted_diff)

# Calculate p-value
p_value = np.mean(np.array(permuted_diffs) >= observed_diff)
print(f"Observed Difference: {observed_diff}")
print(f"P-value: {p_value:.4f}")

# Plotting the permutation test results
fig = px.histogram(permuted_diffs, nbins=50, title="Permutation Test: Impact of Major Objectives Count on Winning",
                   labels={'value': 'Difference in Win Proportions'}, marginal="box")
fig.add_vline(x=observed_diff, line_width=3, line_dash="dash", line_color="lime", annotation_text="Observed Difference")
fig.update_layout(yaxis_title="Frequency", xaxis_title="Difference in Win Proportions", template='plotly_dark')
fig.write_html('assets/major_objectives_win_diff.html', include_plotlyjs='cdn')
fig.show()

Observed Difference: 0.045454545454545414
P-value: 0.4660


In [147]:
# for obj in significant_results['Objective']:
#     contingency_table = pd.crosstab(grouped_df[obj], grouped_df['result'])
#     fig = px.bar(contingency_table, barmode='group', 
#                  title=f'Proportion of Wins/Losses by Securing {obj.capitalize()}',
#                  labels={'index': f'{obj.capitalize()} Secured', 'value': 'Count'})
#     fig.show()


In [148]:
# # Filter out objectives that contain "opp"
# filtered_results = results_df[~results_df['Objective'].str.contains('opp')]

# # Sort the objectives by p-value
# sorted_filtered_results = filtered_results.sort_values(by='p-value').head(5)

# # Display the top 5 most influential objectives without "opp"
# print("Top 5 Most Influential Objectives (excluding 'opp'):")
# print(sorted_filtered_results)


In [165]:
# Sort by Chi-Square Statistic to see which objectives have the strongest association
sorted_by_chi2 = results_df.sort_values(by='Chi-Square Statistic', ascending=False)
print(sorted_by_chi2.head())


NameError: name 'results_df' is not defined

## Step 5: Framing a Prediction Problem

In [None]:
# Predict the outcome of a League of Legends match (win/loss) based on in-game objective statistics, even if some objective data is missing.

'''
Prediction Problem Definition
The prediction problem involves determining the likelihood that a team will win a League of Legends match based on in-game objectives. 
This is a binary classification problem where the target variable (result) has two possible outcomes: win (1) or loss (0). 
The features used for prediction include various objectives such as firstdragon, firstbaron, firsttower, etc.

Justification for the Prediction Problem
Predicting match outcomes based on in-game objectives is highly relevant in the context of esports strategy. 
Understanding how specific objectives contribute to the overall likelihood of winning can help teams refine their 
strategies during matches. This prediction problem also ties directly into the earlier analysis, 
which identified the most influential objectives in determining match outcomes.

Accuracy: Chosen as the primary metric because the dataset is relatively balanced, and accuracy provides 
a straightforward measure of how often the model correctly predicts the match outcome.

F1-Score: Considered if the dataset shows any imbalance, as it balances precision and recall.

Type: Binary Classification

Response Variable: result (1 for win, 0 for loss)

Justification: This problem aligns with understanding the impact of objectives on match outcomes, building on earlier analysis, and can help in strategizing gameplay.

Evaluation Metric:

Accuracy: For balanced datasets.
F1-score: If the dataset is imbalanced, as it balances precision and recall.
This setup helps maintain a coherent theme across your project.
'''

'\nPrediction Problem Definition\nThe prediction problem involves determining the likelihood that a team will win a League of Legends match based on in-game objectives. \nThis is a binary classification problem where the target variable (result) has two possible outcomes: win (1) or loss (0). \nThe features used for prediction include various objectives such as firstdragon, firstbaron, firsttower, etc.\n\nJustification for the Prediction Problem\nPredicting match outcomes based on in-game objectives is highly relevant in the context of esports strategy. \nUnderstanding how specific objectives contribute to the overall likelihood of winning can help teams refine their \nstrategies during matches. This prediction problem also ties directly into the earlier analysis, \nwhich identified the most influential objectives in determining match outcomes.\n\nAccuracy: Chosen as the primary metric because the dataset is relatively balanced, and accuracy provides \na straightforward measure of how 

In [245]:
selected_columns = [
    'firstdragon',
    'elementaldrakes',
    'elders',
    'heralds',
    'barons',
    'firsttower',
    'towers',
    'inhibitors'
]

# Creating a filtered DataFrame with only the relevant columns
filtered_df_prediction = filtered_df[selected_columns]

html_string = custom_css + filtered_df_prediction.head(10).to_html(classes='table table-striped', border=0, index=False)

# Save the HTML to a file
with open('assets/filtered_df_prediction.html', 'w') as f:
    f.write(html_string)

## Step 6: Baseline Model

The features selected for this prediction include all relevant in-game objectives:

- **ElementalDrakes**: These provide significant permanent buffs that can drastically alter a team's strength and strategy throughout the match, making them a critical factor in determining the game's outcome.
- **Elders**: This late-game objective grants a temporary but powerful execution effect, often turning the tide of the game in favor of the team that secures it.
- **Barons**: The Baron buff is a major late-game objective that provides a substantial temporary enhancement, boosting a team's push potential and often leading to decisive moments in the game.
- **Heralds**: Secured early in the game, the Rift Herald helps a team take down towers, providing both strategic advantage and significant gold, contributing to early game dominance.
- **FirstTower**: This binary indicator (0 or 1) shows whether a team has secured the first tower, a key event that often snowballs into further advantages, providing extra gold and momentum.
- **Towers**: The number of towers destroyed is a crucial metric as teams must destroy at least five towers to reach the Nexus and win the game, making it a direct indicator of progress towards victory.


In [None]:
'''
Feature Selection
For the baseline model, two key features were selected:

firstdragon: Indicates whether the team secured the first dragon.
firsttower: Indicates whether the team secured the first tower.
These features were chosen based on the earlier exploratory analysis and 
hypothesis testing, which indicated that these objectives were significant in determining match outcomes.

Model Choice
A simple Logistic Regression model was selected as the baseline model due to 
its interpretability and effectiveness in binary classification problems.

Pipeline Setup
A scikit-learn pipeline was created to streamline the preprocessing and 
model training process. The pipeline included:

StandardScaler: Applied to numerical features to standardize them.
LogisticRegression: The classifier used for predicting match outcomes.

The baseline model achieved an accuracy of 66.07%, providing a reference point for further model improvements. 
This performance, while moderate, sets the stage for enhancing the model by incorporating more features and 
fine-tuning hyperparameters in the next steps.
'''

'\nFeature Selection\nFor the baseline model, two key features were selected:\n\nfirstdragon: Indicates whether the team secured the first dragon.\nfirsttower: Indicates whether the team secured the first tower.\nThese features were chosen based on the earlier exploratory analysis and \nhypothesis testing, which indicated that these objectives were significant in determining match outcomes.\n\nModel Choice\nA simple Logistic Regression model was selected as the baseline model due to \nits interpretability and effectiveness in binary classification problems.\n\nPipeline Setup\nA scikit-learn pipeline was created to streamline the preprocessing and \nmodel training process. The pipeline included:\n\nStandardScaler: Applied to numerical features to standardize them.\nLogisticRegression: The classifier used for predicting match outcomes.\n\nThe baseline model achieved an accuracy of 66.07%, providing a reference point for further model improvements. \nThis performance, while moderate, sets

In [252]:
winner_df = filtered_df[filtered_df['result'] == 1][['gameid', 'gamelength', 'totalgold', 'firstdragon', 'dragons', 'barons', 'towers']]
loser_df = filtered_df[filtered_df['result'] == 0][['gameid', 'totalgold', 'firstdragon', 'dragons', 'barons', 'towers']]

merged_df = winner_df.merge(loser_df, on='gameid', suffixes=('_winner', '_loser'))

html_string = custom_css + merged_df.head(10).to_html(classes='table table-striped', border=0, index=False)

# Save the HTML file
html_file_path = 'assets/merged_df.html'
with open(html_file_path, 'w') as f:
    f.write(html_string)

X = merged_df[['gamelength', 'firstdragon_winner', 'dragons_winner', 'barons_winner', 'towers_winner']]
y = merged_df['totalgold_loser'] 

# Imputation to replace all NaN as zero. After visualizing the DataFrame, NaN values represent that the condition is not cleared(ex. first dragon not taken)
X = X.fillna(0)
y = y.fillna(0)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scale the features
    ('regressor', RandomForestRegressor(random_state=42))  # Random Forest Regressor
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Model Mean Squared Error: {mse:.4f}')
print(f'Model R-squared: {r2:.4f}')

Model Mean Squared Error: 9905658.8615
Model R-squared: 0.9234


## Step 7: Final Model

In [None]:
'''
Feature Engineering
To improve upon the baseline model, two new features were engineered:

total_objectives: Sum of all objectives secured by the team (e.g., first dragon, first tower, first baron).
objective_ratio: Ratio of objectives secured by the team compared to the opponent.
These features were chosen to capture the cumulative impact of securing multiple objectives and to measure a team's performance relative to their opponent.

Hyperparameter Tuning
A RandomForestClassifier was selected as the final model, given its ability to handle complex interactions between features. 
To optimize the model, a GridSearchCV was performed to tune the following hyperparameters:

n_estimators: Number of trees in the forest.
max_depth: Maximum depth of the tree.
min_samples_split: Minimum number of samples required to split an internal node.
min_samples_leaf: Minimum number of samples required to be at a leaf node.
Pipeline and Model Training

The final model was constructed using a scikit-learn pipeline, which included preprocessing steps and the Random Forest classifier. 
The same train-test split from the baseline model was used to ensure consistency in evaluation.

As a result, the final model achieved an accuracy of 66.07%, with the best hyperparameters 
being {max_depth: None, min_samples_leaf: 1, min_samples_split: 2, n_estimators: 50}. 
This performance, along with the additional engineered features, 
suggests that the model can effectively capture the relationship between objectives and match outcomes.
'''

"\nFeature Engineering\nTo improve upon the baseline model, two new features were engineered:\n\ntotal_objectives: Sum of all objectives secured by the team (e.g., first dragon, first tower, first baron).\nobjective_ratio: Ratio of objectives secured by the team compared to the opponent.\nThese features were chosen to capture the cumulative impact of securing multiple objectives and to measure a team's performance relative to their opponent.\n\nHyperparameter Tuning\nA RandomForestClassifier was selected as the final model, given its ability to handle complex interactions between features. \nTo optimize the model, a GridSearchCV was performed to tune the following hyperparameters:\n\nn_estimators: Number of trees in the forest.\nmax_depth: Maximum depth of the tree.\nmin_samples_split: Minimum number of samples required to split an internal node.\nmin_samples_leaf: Minimum number of samples required to be at a leaf node.\nPipeline and Model Training\n\nThe final model was constructed u

In [None]:
winning_teams_df = filtered_df[filtered_df['result'] == 1]

# Features and target variable
X = winning_teams_df[selected_columns]  # Replace selected_columns with your feature list
y = winning_teams_df['gamelength']

# Handle missing values
X = X.fillna(0)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scale the features
    ('regressor', RandomForestRegressor(random_state=42))  # Random Forest Regressor
])

# Fit the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Model Mean Squared Error: {mse:.4f}')
print(f'Model R-squared: {r2:.4f}')

Final Model Accuracy: 0.6607
Best Parameters: {'classifier__max_depth': None, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 50}


## Step 8: Fairness Analysis

In [None]:
# TODO

In [None]:
# Get predictions
y_pred = best_model.predict(X_test)

# Group X: Teams that secured first dragon
group_x_mask = X_test['firstdragon'] == 1
group_x_accuracy = accuracy_score(y_test[group_x_mask], y_pred[group_x_mask])

# Group Y: Teams that did not secure first dragon
group_y_mask = X_test['firstdragon'] == 0
group_y_accuracy = accuracy_score(y_test[group_y_mask], y_pred[group_y_mask])

print(f"Accuracy for Group X (firstdragon=1): {group_x_accuracy:.4f}")
print(f"Accuracy for Group Y (firstdragon=0): {group_y_accuracy:.4f}")


Accuracy for Group X (firstdragon=1): 0.6758
Accuracy for Group Y (firstdragon=0): 0.6491


In [None]:
# Calculate observed difference in accuracy
observed_diff = group_x_accuracy - group_y_accuracy

# Perform permutation test
n_permutations = 1000
diffs = []

for _ in range(n_permutations):
    shuffled = np.random.permutation(group_x_mask)
    shuffled_x_accuracy = accuracy_score(y_test[shuffled], y_pred[shuffled])
    shuffled_y_accuracy = accuracy_score(y_test[~shuffled], y_pred[~shuffled])
    diffs.append(shuffled_x_accuracy - shuffled_y_accuracy)

# Calculate p-value
p_value = np.mean(np.abs(diffs) >= np.abs(observed_diff))

print(f"Observed difference in accuracy: {observed_diff:.4f}")
print(f"Permutation test p-value: {p_value:.4f}")


Observed difference in accuracy: 0.0267
Permutation test p-value: 0.0810


In [None]:
#We fail to reject the null hypothesis at the 5% significance level.

'''
Objective of Fairness Analysis
The goal of this fairness analysis is to evaluate whether the final model performs differently 
for teams that secured a specific in-game objective compared to those that did not. 
Specifically, we want to see if the model's accuracy is consistent across these groups, ensuring that 
no team is disadvantaged based on their in-game performance in securing objectives.

Groups for Comparison
Group X: Teams that secured the first dragon (firstdragon = 1).
Group Y: Teams that did not secure the first dragon (firstdragon = 0).

Null and Alternative Hypotheses
Null Hypothesis (H₀): The model's accuracy is the same for both groups (teams that secured the first dragon and those that didn't), and any observed difference is due to random chance.
Alternative Hypothesis (H₁): The model's accuracy is different for these groups, suggesting potential bias.

Procedure
Model Accuracy Calculation: The accuracy was calculated separately for Group X and Group Y.
Permutation Test: A permutation test was conducted to determine if the observed difference in accuracy is statistically significant.

Results
Observed Difference in Accuracy: The difference in model accuracy between Group X and Group Y was found to be 0.0267.
Permutation Test P-value: The p-value from the permutation test was 0.0530.

Conclusion
Given the p-value of 0.0530, which is slightly above the 0.05 threshold, 
we fail to reject the null hypothesis. This indicates that there is no strong evidence of bias in the model's 
performance between teams that secured the first dragon and those that did not. However, the result is borderline, 
suggesting a potential area for further investigation to ensure fairness.
'''

"\nObjective of Fairness Analysis\nThe goal of this fairness analysis is to evaluate whether the final model performs differently \nfor teams that secured a specific in-game objective compared to those that did not. \nSpecifically, we want to see if the model's accuracy is consistent across these groups, ensuring that \nno team is disadvantaged based on their in-game performance in securing objectives.\n\nGroups for Comparison\nGroup X: Teams that secured the first dragon (firstdragon = 1).\nGroup Y: Teams that did not secure the first dragon (firstdragon = 0).\n\nNull and Alternative Hypotheses\nNull Hypothesis (H₀): The model's accuracy is the same for both groups (teams that secured the first dragon and those that didn't), and any observed difference is due to random chance.\nAlternative Hypothesis (H₁): The model's accuracy is different for these groups, suggesting potential bias.\n\nProcedure\nModel Accuracy Calculation: The accuracy was calculated separately for Group X and Group

In [None]:
# Assuming you have features and data prepared
X = grouped_df[['firstdragon', 'firstbaron', 'firsttower', 'dragons', 'towers', 
                'opp_dragons', 'elementaldrakes', 'opp_elementaldrakes', 
                'infernals', 'mountains', 'clouds', 'oceans', 'chemtechs', 
                'hextechs', 'elders', 'opp_elders', 'firstherald', 'heralds', 
                'opp_heralds', 'void_grubs', 'opp_void_grubs', 'barons', 
                'opp_barons', 'firstmidtower', 'firsttothreetowers', 
                'turretplates', 'opp_turretplates', 'inhibitors', 
                'opp_inhibitors']]
y = grouped_df['result']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Hyperparameter tuning
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Set the final model
final_model = grid_search.best_estimator_

def objective_win(**kwargs):
    # Default values for all 30 objectives, with 'first' objectives as booleans
    default_values = {
        'firstdragon': False,
        'dragons': 0,
        'opp_dragons': 0,
        'elementaldrakes': 0,
        'opp_elementaldrakes': 0,
        'infernals': 0,
        'mountains': 0,
        'clouds': 0,
        'oceans': 0,
        'chemtechs': 0,
        'hextechs': 0,
        'dragons (type unknown)': 0,
        'elders': 0,
        'opp_elders': 0,
        'firstherald': False,
        'heralds': 0,
        'opp_heralds': 0,
        'void_grubs': 0,
        'opp_void_grubs': 0,
        'firstbaron': False,
        'barons': 0,
        'opp_barons': 0,
        'firsttower': False,
        'towers': 0,
        'opp_towers': 0,
        'firstmidtower': False,
        'firsttothreetowers': False,
        'turretplates': 0,
        'opp_turretplates': 0,
        'inhibitors': 0,
        'opp_inhibitors': 0,
    }
    
    # Update the default values with the provided arguments
    for key, value in kwargs.items():
        if key in default_values:
            default_values[key] = value
    
    # Convert boolean 'first' values to 1 (True) or 0 (False)
    for key in default_values:
        if 'first' in key:
            default_values[key] = 1 if default_values[key] else 0
    
    # Ensure the order of features matches the training set
    ordered_features = [default_values[col] for col in X.columns]  # X.columns should match the training data columns
    
    # Create the feature array for the model
    feature_array = np.array(ordered_features).reshape(1, -1)
    
    # Predict the probability of winning
    win_probability = final_model.predict_proba(feature_array)[0, 1]
    
    return f"Chance of winning: {win_probability:.2%}"

# Example usage:
print(objective_win(firstdragon=True, firstbaron=False, firsttower=True))

Chance of winning: 51.24%



X does not have valid feature names, but StandardScaler was fitted with feature names

