# Your Title Here

**Name**: Kliment Ho

**Website Link**: (your website link)

In [2]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy.stats import chi2_contingency, pointbiserialr


import plotly.express as px
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

In [3]:
# Question: Which objectives are the most impactful in over the course of the 2023 season?
'''
In a match of League of Legends, players are often influenced by the decision to take or "securing" objectives. For the sake of this project, we define "objectives" as
jungle monsters("creeps") that offer team-wide stat increase or gold. Notable objectives include Baron, Rift Herald, Drake, and Tower.
Hypothesis: I believe the first team with mid-tower first are the more likely team to win.

Dataset Overview
The dataset used for this project is derived from professional League of Legends (LoL) esports matches from 2023, compiled by Oracle's Elixir. The dataset includes a rich collection of gameplay statistics and outcomes, such as objective control (e.g., first dragon, first baron), player performance metrics (kills, deaths, assists), and team results (win/loss). These features are essential for understanding the dynamics of each match and for developing models that can predict match outcomes based on the in-game actions of teams.

Research Question
The central research question guiding this analysis is: How do different in-game objectives influence the likelihood of winning a League of Legends match? The analysis will focus on understanding the relationship between securing specific objectives (e.g., dragons, barons, towers) and the match outcome. This question is crucial because it directly relates to strategic decision-making in professional play, where teams aim to optimize their chances of winning by prioritizing certain objectives.

Significance of the Study
Understanding the impact of objectives on match outcomes can help teams refine their strategies and improve their performance. By identifying which objectives are most strongly associated with winning, teams can prioritize those objectives during gameplay. Additionally, this analysis can inform coaching decisions and contribute to the broader esports community's understanding of the game.

Relevant Columns in the Dataset
gameid: Unique identifier for each match.
side: The team affiliation (Blue/Red) of the players.
result: Indicates the match outcome for the team (1 for win, 0 for loss).
firstdragon, dragons, firstbaron, barons, towers: Variables representing key objectives secured during the match.
teamkills, teamdeaths: Team-level performance metrics.
This study will explore how these variables, particularly the objectives, correlate with the likelihood of a team winning a match, forming the basis for predictive modeling in later stages.
'''

'\nIn a match of League of Legends, players are often influenced by the decision to take or "securing" objectives. For the sake of this project, we define "objectives" as\njungle monsters("creeps") that offer team-wide stat increase or gold. Notable objectives include Baron, Rift Herald, Drake, and Tower.\nHypothesis: I believe the first team with mid-tower first are the more likely team to win.\n\nDataset Overview\nThe dataset used for this project is derived from professional League of Legends (LoL) esports matches from 2023, compiled by Oracle\'s Elixir. The dataset includes a rich collection of gameplay statistics and outcomes, such as objective control (e.g., first dragon, first baron), player performance metrics (kills, deaths, assists), and team results (win/loss). These features are essential for understanding the dynamics of each match and for developing models that can predict match outcomes based on the in-game actions of teams.\n\nResearch Question\nThe central research que

## Step 2: Data Cleaning and Exploratory Data Analysis

In [4]:
# Univariate Insights:

# Objectives: Most teams secure between X and Y of each objective.
# Team Kills/Deaths: Average team kills are A, with a standard deviation of B. Notable outliers exist in team deaths indicating some matches are significantly one-sided.
# Bivariate Insights:

# Objectives vs. Outcome: Securing more objectives generally correlates with higher win rates. For example, teams securing the first dragon have a win rate of Z%.
# Team Kills vs. Outcome: Higher team kills are strongly associated with wins, indicating aggressive playstyles might be beneficial.
# Correlations: Objectives like 'dragons' and 'barons' show the highest correlation with match outcomes, suggesting their strategic importance.
# Aggregate Insights:

# Objective Combinations: Teams securing both first dragon and first baron have a significantly higher win rate compared to securing either alone.
# Side-Based Performance: Teams on the Blue side have a slightly higher win rate, potentially due to map advantages or starting positions.
# Patch Influence: Certain patches show higher win rates, possibly due to balance changes favoring specific strategies or objectives.
# Visual Patterns:

# Heatmaps and Faceted Plots: Reveal that securing higher counts of dragons and barons dramatically increases win probabilities.
# Violin Plots: Show that winning teams not only secure more objectives but also maintain a healthier balance between team kills and deaths.

'''
Initial Data Inspection
The dataset was initially inspected for missing values, inconsistencies, and potential outliers. 
The objectives data (e.g., first dragon, first baron) had missing values where a team did not secure the objective, 
which were initially represented as NaN. These missing values were filled with zeros to accurately represent that the objective was not secured.

Data Filtering
Only relevant columns that pertain directly to the research question were selected. 
These include columns related to game outcomes (result), objectives (firstdragon, barons, etc.), 
and team performance metrics (teamkills, teamdeaths).

Grouping Data
To ensure each row represented a unique team within each match, the data was grouped by gameid and side. 
This allows for analysis at the team level, where each team's performance and objective control could be directly linked to the match outcome.

Handling Missing Data
As mentioned, missing data related to objectives was handled by imputing zeros. 
This decision was made because the absence of an objective capture (e.g., not securing the first dragon) is a meaningful data point in this context, 
indicating that the team did not achieve that objective.

Feature Selection
For the baseline model, the initial feature set includes the following:

Objective-based features: firstdragon, firsttower, firstbaron, etc.
Outcome feature: result These features are central to understanding the relationship between in-game objectives and match outcomes.
This preprocessing step ensures that the dataset is clean, relevant, 
and ready for further analysis and model development. The next step involves exploratory data analysis to uncover patterns 
and relationships within this cleaned dataset.
'''

"\nInitial Data Inspection\nThe dataset was initially inspected for missing values, inconsistencies, and potential outliers. \nThe objectives data (e.g., first dragon, first baron) had missing values where a team did not secure the objective, \nwhich were initially represented as NaN. These missing values were filled with zeros to accurately represent that the objective was not secured.\n\nData Filtering\nOnly relevant columns that pertain directly to the research question were selected. \nThese include columns related to game outcomes (result), objectives (firstdragon, barons, etc.), \nand team performance metrics (teamkills, teamdeaths).\n\nGrouping Data\nTo ensure each row represented a unique team within each match, the data was grouped by gameid and side. \nThis allows for analysis at the team level, where each team's performance and objective control could be directly linked to the match outcome.\n\nHandling Missing Data\nAs mentioned, missing data related to objectives was handl

In [5]:
# Loading 2023 Match Data
filepath = Path('data') / '2023_LoL_esports_match_data_from_OraclesElixir.csv'
lol_stats = pd.read_csv(filepath)
print(lol_stats.describe())
print(lol_stats.info())

  exec(code_obj, self.user_global_ns, self.user_ns)


                year       playoffs           game          patch  \
count  130764.000000  130764.000000  130764.000000  130644.000000   
mean     2023.034505       0.209874       1.649353      13.085001   
std         0.182523       0.407220       0.940911       0.056031   
min      2023.000000       0.000000       1.000000      13.010000   
25%      2023.000000       0.000000       1.000000      13.040000   
50%      2023.000000       0.000000       1.000000      13.100000   
75%      2023.000000       0.000000       2.000000      13.130000   
max      2024.000000       1.000000       5.000000      13.240000   

       participantid     gamelength         result          kills  \
count  130764.000000  130764.000000  130764.000000  130764.000000   
mean       29.583333    1877.762045       0.500000       4.712016   
std        57.650688     335.342468       0.500002       5.772584   
min         1.000000     201.000000       0.000000       0.000000   
25%         3.750000    1643.0000

In [6]:
# Display all columns
for i in lol_stats.columns:
    print(i)

# Filter relevant columns
filtered_df = lol_stats[lol_stats['champion'].isna()].loc[:, [
    "gameid", 
    "game", 
    "patch", 
    "side", 
    "position", 
    "result", 
    "teamkills", 
    "teamdeaths"
    ] +
    lol_stats.loc[:1, "firstdragon":"opp_inhibitors"].columns.to_list()
    ]

# prev_match = lol_stats.loc[lol_stats['gameid'] ==  'ESPORTSTMNT06_2753012'].to_html(classes='table table-striped', border=0, index=True)
# file_path = 'assets/firstmatch_data.html'
# with open(file_path, 'w') as f:
#     f.write(prev_match)

gameid
datacompleteness
url
league
year
split
playoffs
date
game
patch
participantid
side
position
playername
playerid
teamname
teamid
champion
ban1
ban2
ban3
ban4
ban5
pick1
pick2
pick3
pick4
pick5
gamelength
result
kills
deaths
assists
teamkills
teamdeaths
doublekills
triplekills
quadrakills
pentakills
firstblood
firstbloodkill
firstbloodassist
firstbloodvictim
team kpm
ckpm
firstdragon
dragons
opp_dragons
elementaldrakes
opp_elementaldrakes
infernals
mountains
clouds
oceans
chemtechs
hextechs
dragons (type unknown)
elders
opp_elders
firstherald
heralds
opp_heralds
void_grubs
opp_void_grubs
firstbaron
barons
opp_barons
firsttower
towers
opp_towers
firstmidtower
firsttothreetowers
turretplates
opp_turretplates
inhibitors
opp_inhibitors
damagetochampions
dpm
damageshare
damagetakenperminute
damagemitigatedperminute
wardsplaced
wpm
wardskilled
wcpm
controlwardsbought
visionscore
vspm
totalgold
earnedgold
earned gpm
earnedgoldshare
goldspent
gspd
gpr
total cs
minionkills
monsterkills
mon

In [10]:
fig = px.histogram(filtered_df, x='towers', nbins=20, title='Distribution of Towers Destroyed',
                   labels={'towers': 'Number of Towers Destroyed'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Towers Destroyed',
    yaxis_title='Frequency',
    template='plotly_dark'
)
fig.write_html('assets/dist_towers.html', include_plotlyjs='cdn')
fig

In [16]:
fig = px.histogram(filtered_df[filtered_df['result'] == 1], x='towers', nbins=20, title='Distribution of Towers Destroyed',
                   labels={'towers': 'Winners Number of Towers Destroyed'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Towers Destroyed',
    yaxis_title='Frequency',
    template='plotly_dark'
)
fig.write_html('assets/win_dist_towers.html', include_plotlyjs='cdn')
fig

In [25]:
major_objectives = ['barons', 'dragons', 'heralds', 'void_grubs']
obj_count_df = filtered_df.copy()
# Sum the counts of each objective across rows
obj_count_df['major_objectives_count'] = obj_count_df[major_objectives].sum(axis=1)
# Display the new DataFrame
print(obj_count_df.head())

fig = px.histogram(obj_count_df, x='major_objectives_count', nbins=20, 
                   title='Distribution of Major Objectives Count',
                   labels={'major_objectives_count': 'Major Objectives Count'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Major Objectives Taken',
    yaxis_title='Frequency',
    template='plotly_dark'
)

# Save the plot as an HTML file
fig.write_html('assets/major_objectives_distribution.html', include_plotlyjs='cdn')
fig

                   gameid  game  patch  side position  result  teamkills  \
10  ESPORTSTMNT06_2753012     1  13.01  Blue     team       1         13   
11  ESPORTSTMNT06_2753012     1  13.01   Red     team       0          7   
22  ESPORTSTMNT06_2754023     1  13.01  Blue     team       0         20   
23  ESPORTSTMNT06_2754023     1  13.01   Red     team       1         16   
34  ESPORTSTMNT06_2755035     1  13.01  Blue     team       1         20   

    teamdeaths  firstdragon  dragons  ...  firsttower  towers  opp_towers  \
10           7          0.0      4.0  ...         1.0    11.0         2.0   
11          13          1.0      3.0  ...         0.0     2.0        11.0   
22          16          0.0      3.0  ...         0.0     5.0        11.0   
23          20          1.0      4.0  ...         1.0    11.0         5.0   
34           7          0.0      4.0  ...         0.0     7.0         4.0   

    firstmidtower  firsttothreetowers  turretplates  opp_turretplates  \
10     

In [17]:
fig = px.histogram(obj_count_df[obj_count_df['result'] == 1], x='major_objectives_count', nbins=20, 
                   title='Distribution of Winners Major Objectives Count',
                   labels={'major_objectives_count': 'Major Objectives Count'})

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Number of Major Objectives Taken',
    yaxis_title='Frequency',
    template='plotly_dark'
)

# Save the plot as an HTML file
fig.write_html('assets/win_major_objectives_distribution.html', include_plotlyjs='cdn')
fig

In [None]:
# Group matches so that each row represent each team that played per game

grouped_df = filtered_df.fillna(0).groupby(['gameid', 'side']).max().loc[:, 
['result', "teamkills", "teamdeaths"] + filtered_df.loc[:, 'firstdragon':'opp_inhibitors'].columns.to_list()]


In [None]:
#correlation matrix
correlation_matrix = grouped_df.corr()
relevant_correlations = correlation_matrix.loc[
    ['result', 'teamkills', 'teamdeaths'],
    grouped_df.columns[grouped_df.columns.str.startswith('first') | grouped_df.columns.str.startswith('opp')]
]

print(relevant_correlations)

            firstdragon  opp_dragons  opp_elementaldrakes  opp_elders  \
result         0.203462    -0.614756            -0.482659   -0.114585   
teamkills      0.180140    -0.426800            -0.324150   -0.005718   
teamdeaths    -0.150218     0.556212             0.453087    0.130965   

            firstherald  opp_heralds  opp_void_grubs  firstbaron  opp_barons  \
result         0.143287    -0.240959             NaN    0.569740   -0.638985   
teamkills      0.131432    -0.194477             NaN    0.462868   -0.356137   
teamdeaths    -0.101405     0.228897             NaN   -0.419880    0.556437   

            firsttower  opp_towers  firstmidtower  firsttothreetowers  \
result        0.339589   -0.889475       0.392519            0.499866   
teamkills     0.292062   -0.600457       0.350364            0.418675   
teamdeaths   -0.261857    0.716710      -0.320213           -0.388201   

            opp_turretplates  opp_inhibitors  
result             -0.276630       -0.753096  

In [None]:
#Distribution of wins based on objectives
objectives = [
    'firstdragon', 'dragons', 'opp_dragons', 'elementaldrakes', 'opp_elementaldrakes',
    'infernals', 'mountains', 'clouds', 'oceans', 'chemtechs', 'hextechs',
    'dragons (type unknown)', 'elders', 'opp_elders', 'firstherald', 'heralds', 
    'opp_heralds', 'void_grubs', 'opp_void_grubs', 'firstbaron', 'barons', 
    'opp_barons', 'firsttower', 'towers', 'opp_towers', 'firstmidtower', 
    'firsttothreetowers', 'turretplates', 'opp_turretplates', 'inhibitors', 
    'opp_inhibitors'
]

# for obj in objectives:
#     fig = px.histogram(grouped_df, x=obj, nbins=10, 
#                        title=f'Distribution of {obj.capitalize()}',
#                        labels={obj: obj.capitalize()})
#     fig.show()
# Distribution of outcomes (wins vs losses)
# fig = px.histogram(grouped_df, x='result', nbins=2, 
#                    title='Distribution of Match Outcomes (Win/Loss)',
#                    labels={'result': 'Match Outcome'})
# fig.show()


In [21]:
#Correlation
# from scipy.stats import pointbiserialr

# correlations = {}
# for obj in objectives:
#     corr, _ = pointbiserialr(grouped_df[obj], grouped_df['result'])
#     correlations[obj] = corr

# correlations_df = pd.DataFrame(list(correlations.items()), columns=['Objective', 'Correlation with Result'])

# # Plot correlations
# fig = px.bar(correlations_df, x='Objective', y='Correlation with Result', 
#              title='Correlation between Objectives and Match Outcome')
# fig.show()
# fig.write_html('assets/correlation_obj_outcome.html', include_plotlyjs='cdn')

# We can see below that the objective "tower" has a strong correlation with winning. This makes sense
# as a game of league requires a team to destroy at least 5 towers to reach the final objective and
# win a match. The only case where a win/loss occurs before the destruction of 5 towers is
# a forfeit, typically rare in a professional eSports match.

# Notably, the number of dragons acquired and the slaying of "Baron" has a strong correlation with
# the match outcome. Shockingly, having dragons seem to have more impact on an outcome than Baron
# in this correlation bar graph.

In [20]:
#Pairwise correlation vs. objectives
# fig = px.imshow(grouped_df[objectives].corr(), 
#                 title='Pairwise Correlations between Objectives')
# fig.show()
# fig.write_html('assets/pairwise_correlation.html', include_plotlyjs='cdn')
# To show the correlation of objectives. Namely, there are a strong negative correlation between objectives with and without the "opp" string.
# "opp" represents opponents respective objective data.


In [22]:
# Proportions of Win/Lose for Objective
# objective_proportions = {}

# for obj in objectives:
#     win_loss_counts = grouped_df.groupby([obj, 'result']).size().unstack(fill_value=0)
#     win_loss_proportions = win_loss_counts.div(win_loss_counts.sum(axis=1), axis=0)
    
#     objective_proportions[obj] = win_loss_proportions



In [18]:
#Stacked Bar Graph for Proportional Objective wins
# for obj in objectives:
#     proportions_df = objective_proportions[obj].reset_index()

#     fig = px.bar(proportions_df, 
#                  x=obj, 
#                  y=[0, 1], 
#                  title=f'Proportion of Wins/Losses for {obj.capitalize()}',
#                  labels={obj: f'Number of {obj.capitalize()}', 'value': 'Proportion'},
#                  barmode='stack')
    
#     fig.update_layout(yaxis=dict(tickformat=".0%"))
#     fig.show()
#     fig.write_html(f'assets/{obj.capitalize()}_prop_bar.html', include_plotlyjs='cdn')


In [27]:
# Specify the objectives to analyze
selected_objectives = ['barons', 'dragons', 'firstdragon', 'towers', 'void_grubs', 'elders', 'inhibitors']

objective_proportions = {}

# Calculate win/loss proportions for each selected objective
for obj in selected_objectives:
    win_loss_counts = obj_count_df.groupby([obj, 'result']).size().unstack(fill_value=0)
    win_loss_proportions = win_loss_counts.div(win_loss_counts.sum(axis=1), axis=0)
    
    # Add the objective proportions to the dictionary
    objective_proportions[obj] = win_loss_proportions

for obj in selected_objectives:
    proportions_df = objective_proportions[obj].reset_index()

    # Create a bar plot for each objective
    fig = px.bar(proportions_df, 
                 x=obj, 
                 y=[0, 1], 
                 title=f'Proportion of Wins/Losses for {obj.capitalize()}',
                 labels={obj: f'Number of {obj.capitalize()}', 'value': 'Proportion'},
                 barmode='stack')

    # Update layout for better readability
    fig.update_layout(
        yaxis=dict(tickformat=".0%"),
        template='plotly_dark'
        )
    fig.show()
    
    # Save the plot as an HTML file
    fig.write_html(f'assets/{obj.capitalize()}_prop_bar.html', include_plotlyjs='cdn')

SyntaxError: invalid syntax (3339148653.py, line 28)

In [None]:
win_proportions = {}

for obj in objectives:
    # Calculate the win proportion for each count of the objective
    win_loss_counts = grouped_df.groupby([obj, 'result']).size().unstack(fill_value=0)
    win_proportion = win_loss_counts[1] / (win_loss_counts[0] + win_loss_counts[1])
    
    # Filter out zero win proportions
    win_proportion = win_proportion[win_proportion > 0]
    win_proportions[obj] = win_proportion.reset_index(drop=True)  # Reset index to remove it

# Convert the win proportions to a DataFrame for easier plotting
win_proportions_df = pd.DataFrame(win_proportions)

import plotly.express as px

# Melt the DataFrame for plotting
win_proportions_melted = win_proportions_df.melt(var_name='Objective', 
                                                 value_name='Win Proportion')

# Create the line plot without zero points
fig = px.line(win_proportions_melted, 
              x=win_proportions_melted.index, 
              y='Win Proportion', 
              color='Objective', 
              title='Win Proportion by Number of Objectives Secured (Non-Zero)',
              labels={'index': 'Number of Objectives', 'Win Proportion': 'Win Proportion'})

fig.update_layout(yaxis=dict(tickformat=".0%"))
fig.show()



In [None]:
# from sklearn.preprocessing import StandardScaler

# grouped_df.fillna(0, inplace=True)
# scaler = StandardScaler()
# grouped_df[objectives] = scaler.fit_transform(grouped_df[objectives])
# grouped_df['dragon_x_tower'] = grouped_df['firstdragon'] * grouped_df['firsttower']

In [None]:
# from sklearn.model_selection import train_test_split

# X = grouped_df[objectives]
# y = grouped_df['result']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Interesting Aggregates
# Goal: Summarize and group data to understand collective behaviors or patterns.
# Actions:
# Calculate win/loss proportions for different counts of objectives (e.g., how often teams win when they secure 2 dragons vs. 3 dragons).
# Visualize these aggregates using bar plots to see how securing different numbers of objectives impacts win rates.

### Univariate Analysis
- **Objective Distribution**: Histograms were generated for each objective-related feature (e.g., `firstdragon`, `firstbaron`) to examine the distribution of these key variables. This analysis revealed how frequently each objective was secured by teams, highlighting the importance of specific objectives in matches.
  
- **Outcome Distribution**: The match outcomes (`result`) were also analyzed to understand the balance between wins and losses across the dataset. This step ensured that the dataset was not skewed heavily toward one outcome, which could affect model performance.

### Bivariate Analysis
- **Objective vs. Outcome**: The relationship between each objective and match outcomes was explored using correlation matrices. This analysis provided insight into which objectives were most strongly associated with winning or losing a match, guiding feature selection for modeling.
  
- **Pairwise Correlations**: Pairwise correlations between objectives were analyzed to detect multicollinearity. Identifying highly correlated objectives is crucial for avoiding redundancy in the model and ensuring that each feature contributes unique information.

### Interesting Aggregates
- **Win Proportion by Objective Count**: Aggregated statistics were calculated to understand how the number of objectives secured by a team impacted their likelihood of winning. For instance, teams that secured three or more dragons had a higher win rate compared to those that secured fewer or none.


## Step 3: Assessment of Missingness

In [None]:
# For all objectives, all presently filled NaN values correlated to taking zero objectives. Therefore, we believe the columns (objectives) are all Not Missing At Random as there
# is no trends of missingness but rather dependent from team-to-team decision. Here is the original grouped DataFrame with the NaN values.

In [None]:
# Missingness Dependency: However, what about matches where NaN appears for the entire objective. We will show a new grouped df where it only includes matches that have
# objectives where it is completely NaN. As we can see, NaN values occur in the "firstdragon" column even when the "dragon" objective has a formal value. Here, we will
# investigate whether or not the "firstdragon" missingness depends on the "league" column as viewing the groupby aggregation with .max() on "league" shows NaN values.

In [None]:
nangrouped_df = filtered_df.groupby(['gameid', 'side']).max().loc[:, 
['result', "teamkills", "teamdeaths"] + filtered_df.loc[:, 'firstdragon':'opp_inhibitors'].columns.to_list()]

# Add a column to indicate missingness for 'dragons'
nangrouped2_df = lol_stats.groupby(['league']).max().loc[:, 
filtered_df.loc[:, 'firstdragon':'opp_inhibitors'].columns.to_list()]
nangrouped_df.head(5)
nangrouped2_df.head(10)


Dropping invalid columns in DataFrameGroupBy.max is deprecated. In a future version, a TypeError will be raised. Before calling .max, select only columns which should be valid for the function.



Unnamed: 0_level_0,firstdragon,dragons,opp_dragons,elementaldrakes,opp_elementaldrakes,infernals,mountains,clouds,oceans,chemtechs,...,opp_barons,firsttower,towers,opp_towers,firstmidtower,firsttothreetowers,turretplates,opp_turretplates,inhibitors,opp_inhibitors
league,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AL,1.0,6.0,6.0,4.0,4.0,4.0,4.0,4.0,2.0,4.0,...,3.0,1.0,11.0,11.0,1.0,1.0,12.0,12.0,7.0,7.0
CBLOL,1.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,...,4.0,1.0,11.0,11.0,1.0,1.0,11.0,11.0,6.0,6.0
CBLOLA,1.0,6.0,6.0,4.0,4.0,4.0,3.0,3.0,4.0,3.0,...,4.0,1.0,11.0,11.0,1.0,1.0,11.0,11.0,8.0,8.0
CDF,1.0,4.0,4.0,4.0,4.0,3.0,2.0,4.0,2.0,2.0,...,4.0,1.0,11.0,11.0,1.0,1.0,15.0,15.0,8.0,8.0
CT,1.0,6.0,6.0,4.0,4.0,2.0,3.0,3.0,3.0,3.0,...,3.0,1.0,11.0,11.0,1.0,1.0,10.0,10.0,4.0,4.0
DCup,,4.0,4.0,,,,,,,,...,3.0,,11.0,11.0,,,,,4.0,4.0
DDH,1.0,5.0,5.0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,...,3.0,1.0,11.0,11.0,1.0,1.0,13.0,13.0,5.0,5.0
EBL,1.0,6.0,6.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,...,4.0,1.0,11.0,11.0,1.0,1.0,13.0,13.0,6.0,6.0
EL,1.0,6.0,6.0,4.0,4.0,2.0,2.0,3.0,3.0,3.0,...,4.0,1.0,11.0,11.0,1.0,1.0,12.0,12.0,5.0,5.0
EM,1.0,5.0,5.0,4.0,4.0,3.0,3.0,4.0,3.0,3.0,...,3.0,1.0,11.0,11.0,1.0,1.0,15.0,15.0,5.0,5.0


In [None]:
# FirstDragon Dependency on League
lol_stats_with_missing = lol_stats[['league', 'firstdragon']]

# Add a new column to indicate missingness for 'firstdragon'
lol_stats_with_missing['firstdragon_missing'] = lol_stats_with_missing['firstdragon'].isna()

# Preview the new dataframe
lol_stats_with_missing.head(5)

league_distribution = lol_stats_with_missing.groupby(['league', 'firstdragon_missing']).size().unstack(fill_value=0)

# Normalize the distribution
league_distribution_norm = league_distribution.div(league_distribution.sum(axis=1), axis=0)

# Display the normalized distribution
print(league_distribution_norm)


observed_tvd = league_distribution_norm.diff(axis=1).iloc[:, -1].abs().sum()
# Permutation test
n_permutations = 1000
permuted_tvds = []

for _ in range(n_permutations):
    shuffled = lol_stats_with_missing['firstdragon_missing'].sample(frac=1).values
    permuted_counts = lol_stats_with_missing.assign(shuffled_missing=shuffled).groupby(['league', 'shuffled_missing']).size().unstack(fill_value=0)
    permuted_counts_norm = permuted_counts.div(permuted_counts.sum(axis=1), axis=0)
    permuted_tvd = permuted_counts_norm.diff(axis=1).iloc[:, -1].abs().sum()
    permuted_tvds.append(permuted_tvd)

# Calculate p-value
p_value = np.mean(np.array(permuted_tvds) >= observed_tvd)
print(f"Observed TVD: {observed_tvd}")
print(f"P-value: {p_value:.4f}")



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



firstdragon_missing     False      True
league                                 
AL                   0.166667  0.833333
CBLOL                0.166667  0.833333
CBLOLA               0.166667  0.833333
CDF                  0.166667  0.833333
CT                   0.166667  0.833333
DCup                 0.000000  1.000000
DDH                  0.166667  0.833333
EBL                  0.166667  0.833333
EL                   0.166667  0.833333
EM                   0.166667  0.833333
EPL                  0.166667  0.833333
ESLOL                0.166667  0.833333
GL                   0.166667  0.833333
GLL                  0.166667  0.833333
HC                   0.166667  0.833333
HM                   0.166667  0.833333
IC                   0.166667  0.833333
LAS                  0.166667  0.833333
LCK                  0.166667  0.833333
LCKC                 0.166667  0.833333
LCO                  0.166667  0.833333
LCS                  0.166667  0.833333
LDL                  0.000000  1.000000


In [None]:
# Missingness of firstdragon does not depend on league column.

# However, we suspect that there was an underlying meaning. It turns out, a NaN firstdragon could simply mean that the game played
# does not have the dragon slained at all. Therefore, we shall test upon "firsttower"

# FirstTower Dependency on League
lol_stats_with_missing_tower = lol_stats[['league', 'firsttower']].copy()

# Add a new column to indicate missingness for 'firsttower'
lol_stats_with_missing_tower['firsttower_missing'] = lol_stats_with_missing_tower['firsttower'].isna()
lol_stats_with_missing_tower.head()

# Calculate the distribution of leagues based on the missingness of 'firsttower'
league_distribution_tower = lol_stats_with_missing_tower.groupby(['league', 'firsttower_missing']).size().unstack(fill_value=0)
league_distribution_tower_norm = league_distribution_tower.div(league_distribution_tower.sum(axis=1), axis=0)

# Display normalized distr.
print(league_distribution_tower_norm)

# Calculate observed TVD (Total Variation Distance)
observed_tvd_tower = league_distribution_tower_norm.diff(axis=1).iloc[:, -1].abs().sum()

# Permutation test
n_permutations = 1000
permuted_tvds_tower = []

for _ in range(n_permutations):
    shuffled = lol_stats_with_missing_tower['firsttower_missing'].sample(frac=1).values
    permuted_counts_tower = lol_stats_with_missing_tower.assign(shuffled_missing=shuffled).groupby(['league', 'shuffled_missing']).size().unstack(fill_value=0)
    permuted_counts_tower_norm = permuted_counts_tower.div(permuted_counts_tower.sum(axis=1), axis=0)
    permuted_tvd_tower = permuted_counts_tower_norm.diff(axis=1).iloc[:, -1].abs().sum()
    permuted_tvds_tower.append(permuted_tvd_tower)

# Calculate p-value
p_value_tower = np.mean(np.array(permuted_tvds_tower) >= observed_tvd_tower)
print(f"Observed TVD: {observed_tvd_tower}")
print(f"P-value: {p_value_tower:.4f}")


firsttower_missing     False      True
league                                
AL                  0.166667  0.833333
CBLOL               0.166667  0.833333
CBLOLA              0.166667  0.833333
CDF                 0.166667  0.833333
CT                  0.166667  0.833333
DCup                0.000000  1.000000
DDH                 0.166667  0.833333
EBL                 0.166667  0.833333
EL                  0.166667  0.833333
EM                  0.166667  0.833333
EPL                 0.166667  0.833333
ESLOL               0.166667  0.833333
GL                  0.166667  0.833333
GLL                 0.166667  0.833333
HC                  0.166667  0.833333
HM                  0.166667  0.833333
IC                  0.166667  0.833333
LAS                 0.166667  0.833333
LCK                 0.166667  0.833333
LCKC                0.166667  0.833333
LCO                 0.166667  0.833333
LCS                 0.166667  0.833333
LDL                 0.000000  1.000000
LEC                 0.166

In [None]:
# Missingness of firsttower does not depend on league column.

# As a result, we fail to reject the null as for both firsttower and firstdragon, the p-value is greater than the significance level of 0.5.

In [None]:
# Inhibitors Dependency on League
lol_stats_with_missing_inhibitors = lol_stats[['league', 'inhibitors']].copy()

# Add a new column to indicate missingness for 'inhibitors'
lol_stats_with_missing_inhibitors['inhibitors_missing'] = lol_stats_with_missing_inhibitors['inhibitors'].isna()

# Preview the new dataframe
lol_stats_with_missing_inhibitors.head()

# Calculate the distribution of leagues based on the missingness of 'inhibitors'
league_distribution_inhibitors = lol_stats_with_missing_inhibitors.groupby(['league', 'inhibitors_missing']).size().unstack(fill_value=0)

# Normalize the distribution
league_distribution_inhibitors_norm = league_distribution_inhibitors.div(league_distribution_inhibitors.sum(axis=1), axis=0)

# Display the normalized distribution
print(league_distribution_inhibitors_norm)

# Calculate observed TVD (Total Variation Distance)
observed_tvd_inhibitors = league_distribution_inhibitors_norm.diff(axis=1).iloc[:, -1].abs().sum()

# Permutation test
n_permutations = 1000
permuted_tvds_inhibitors = []

for _ in range(n_permutations):
    shuffled = lol_stats_with_missing_inhibitors['inhibitors_missing'].sample(frac=1).values
    permuted_counts_inhibitors = lol_stats_with_missing_inhibitors.assign(shuffled_missing=shuffled).groupby(['league', 'shuffled_missing']).size().unstack(fill_value=0)
    permuted_counts_inhibitors_norm = permuted_counts_inhibitors.div(permuted_counts_inhibitors.sum(axis=1), axis=0)
    permuted_tvd_inhibitors = permuted_counts_inhibitors_norm.diff(axis=1).iloc[:, -1].abs().sum()
    permuted_tvds_inhibitors.append(permuted_tvd_inhibitors)

# Calculate p-value
p_value_inhibitors = np.mean(np.array(permuted_tvds_inhibitors) >= observed_tvd_inhibitors)
print(f"Observed TVD: {observed_tvd_inhibitors}")
print(f"P-value: {p_value_inhibitors:.4f}")


inhibitors_missing     False      True
league                                
AL                  1.000000  0.000000
CBLOL               1.000000  0.000000
CBLOLA              1.000000  0.000000
CDF                 1.000000  0.000000
CT                  1.000000  0.000000
DCup                0.166667  0.833333
DDH                 1.000000  0.000000
EBL                 1.000000  0.000000
EL                  1.000000  0.000000
EM                  1.000000  0.000000
EPL                 1.000000  0.000000
ESLOL               1.000000  0.000000
GL                  1.000000  0.000000
GLL                 1.000000  0.000000
HC                  1.000000  0.000000
HM                  1.000000  0.000000
IC                  1.000000  0.000000
LAS                 1.000000  0.000000
LCK                 1.000000  0.000000
LCKC                1.000000  0.000000
LCO                 1.000000  0.000000
LCS                 1.000000  0.000000
LDL                 0.166667  0.833333
LEC                 1.000

In [None]:
# Missingness of inhibitors does depend on league column.

# As a result, we reject the null as for inhibitors, the p-value is less than the significance level of 0.5.

In [None]:
# Example: Testing dependency of missingness in 'firstdragon' on 'result'
contingency_table = pd.crosstab(grouped_df['firstdragon'].isnull(), grouped_df['result'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f'Chi-Square Test for Missingness in First Dragon vs Result: p-value = {p}')


Chi-Square Test for Missingness in First Dragon vs Result: p-value = 1.0


## Step 4: Hypothesis Testing

In [None]:
# Null Hypothesis (H₀): Securing a specific objective does not significantly affect the probability of winning the game.
# Alternative Hypothesis (H₁): Securing a specific objective (e.g., first dragon) significantly increases the probability of winning the game.

'''
Formulating Hypotheses
The central hypotheses revolve around the impact of securing key objectives on match outcomes:

Null Hypothesis (H₀): Securing a specific objective (such as dragon or baron) does not significantly affect the probability of winning the game.
Alternative Hypothesis (H₁): Securing a specific objective significantly increases the probability of winning the game.

These hypotheses are tested using chi-square tests for independence to determine whether the presence or absence of 
these objectives is statistically related to match outcomes.

Hypothesis Testing Process
Chi-Square Test: For each objective, a chi-square test was conducted to compare the distribution of wins and losses 
based on whether the objective was secured. This test assessed whether the observed differences in match outcomes 
could be attributed to securing the objective or whether they were due to random chance.

P-Values and Interpretation: The p-values from these tests were used to determine statistical significance. 
If the p-value was below the standard threshold of 0.05, the null hypothesis was rejected, indicating that 
securing the objective likely had a significant impact on the match outcome.

Key Findings
Most Influential Objectives: The objectives with the smallest p-values were identified as the most influential 
in determining match outcomes. For example, securing the first dragon might have a strong statistical correlation 
with winning, as indicated by a very low p-value and high chi-square statistic.

Fairness Consideration: A fairness analysis was conducted to ensure the model's performance was consistent 
across different groups. This analysis checked whether the model performed similarly for teams that secured certain 
objectives versus those that didn't, ensuring fairness in predictions.
'''

"\nFormulating Hypotheses\nThe central hypotheses revolve around the impact of securing key objectives on match outcomes:\n\nNull Hypothesis (H₀): Securing a specific objective (such as dragon or baron) does not significantly affect the probability of winning the game.\nAlternative Hypothesis (H₁): Securing a specific objective significantly increases the probability of winning the game.\n\nThese hypotheses are tested using chi-square tests for independence to determine whether the presence or absence of \nthese objectives is statistically related to match outcomes.\n\nHypothesis Testing Process\nChi-Square Test: For each objective, a chi-square test was conducted to compare the distribution of wins and losses \nbased on whether the objective was secured. This test assessed whether the observed differences in match outcomes \ncould be attributed to securing the objective or whether they were due to random chance.\n\nP-Values and Interpretation: The p-values from these tests were used t

In [None]:
results = []

for obj in objectives:
    contingency_table = pd.crosstab(grouped_df[obj], grouped_df['result'])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    results.append((obj, chi2, p))

# Convert results to DataFrame for easy interpretation
results_df = pd.DataFrame(results, columns=['Objective', 'Chi-Square Statistic', 'p-value'])

# Display results
print(results_df)

# Optionally, filter for significant results
significant_results = results_df[results_df['p-value'] < 0.05]
print("Significant Objectives:", significant_results)


                 Objective  Chi-Square Statistic        p-value
0              firstdragon            901.380248  4.917808e-198
1                  dragons           8396.327557   0.000000e+00
2              opp_dragons           8396.327557   0.000000e+00
3          elementaldrakes           5764.258614   0.000000e+00
4      opp_elementaldrakes           5764.258614   0.000000e+00
5                infernals            893.006370  5.456657e-192
6                mountains            858.508867  1.625151e-184
7                   clouds            813.646004  8.501331e-175
8                   oceans            864.493210  8.211450e-186
9                chemtechs            706.128451  1.641486e-151
10                hextechs            909.913467  1.185018e-195
11  dragons (type unknown)            851.002339  6.872429e-183
12                  elders            296.747216   3.648890e-65
13              opp_elders            296.747216   3.648890e-65
14             firstherald            44

In [None]:
for obj in significant_results['Objective']:
    contingency_table = pd.crosstab(grouped_df[obj], grouped_df['result'])
    fig = px.bar(contingency_table, barmode='group', 
                 title=f'Proportion of Wins/Losses by Securing {obj.capitalize()}',
                 labels={'index': f'{obj.capitalize()} Secured', 'value': 'Count'})
    fig.show()


In [None]:
# Filter out objectives that contain "opp"
filtered_results = results_df[~results_df['Objective'].str.contains('opp')]

# Sort the objectives by p-value
sorted_filtered_results = filtered_results.sort_values(by='p-value').head(5)

# Display the top 5 most influential objectives without "opp"
print("Top 5 Most Influential Objectives (excluding 'opp'):")
print(sorted_filtered_results)


Top 5 Most Influential Objectives (excluding 'opp'):
             Objective  Chi-Square Statistic  p-value
29          inhibitors          18980.154280      0.0
1              dragons           8396.327557      0.0
3      elementaldrakes           5764.258614      0.0
26  firsttothreetowers           5443.550079      0.0
25       firstmidtower           3356.246277      0.0


In [None]:
# Sort by Chi-Square Statistic to see which objectives have the strongest association
sorted_by_chi2 = results_df.sort_values(by='Chi-Square Statistic', ascending=False)
print(sorted_by_chi2.head())


         Objective  Chi-Square Statistic  p-value
24      opp_towers          19332.679170      0.0
23          towers          19332.679170      0.0
30  opp_inhibitors          18980.154280      0.0
29      inhibitors          18980.154280      0.0
21      opp_barons          11134.015046      0.0


## Step 5: Framing a Prediction Problem

In [None]:
# Predict the outcome of a League of Legends match (win/loss) based on in-game objective statistics, even if some objective data is missing.

'''
Prediction Problem Definition
The prediction problem involves determining the likelihood that a team will win a League of Legends match based on in-game objectives. 
This is a binary classification problem where the target variable (result) has two possible outcomes: win (1) or loss (0). 
The features used for prediction include various objectives such as firstdragon, firstbaron, firsttower, etc.

Justification for the Prediction Problem
Predicting match outcomes based on in-game objectives is highly relevant in the context of esports strategy. 
Understanding how specific objectives contribute to the overall likelihood of winning can help teams refine their 
strategies during matches. This prediction problem also ties directly into the earlier analysis, 
which identified the most influential objectives in determining match outcomes.

Accuracy: Chosen as the primary metric because the dataset is relatively balanced, and accuracy provides 
a straightforward measure of how often the model correctly predicts the match outcome.

F1-Score: Considered if the dataset shows any imbalance, as it balances precision and recall.

Type: Binary Classification

Response Variable: result (1 for win, 0 for loss)

Justification: This problem aligns with understanding the impact of objectives on match outcomes, building on earlier analysis, and can help in strategizing gameplay.

Evaluation Metric:

Accuracy: For balanced datasets.
F1-score: If the dataset is imbalanced, as it balances precision and recall.
This setup helps maintain a coherent theme across your project.
'''

'\nPrediction Problem Definition\nThe prediction problem involves determining the likelihood that a team will win a League of Legends match based on in-game objectives. \nThis is a binary classification problem where the target variable (result) has two possible outcomes: win (1) or loss (0). \nThe features used for prediction include various objectives such as firstdragon, firstbaron, firsttower, etc.\n\nJustification for the Prediction Problem\nPredicting match outcomes based on in-game objectives is highly relevant in the context of esports strategy. \nUnderstanding how specific objectives contribute to the overall likelihood of winning can help teams refine their \nstrategies during matches. This prediction problem also ties directly into the earlier analysis, \nwhich identified the most influential objectives in determining match outcomes.\n\nAccuracy: Chosen as the primary metric because the dataset is relatively balanced, and accuracy provides \na straightforward measure of how 

## Step 6: Baseline Model

In [None]:
'''
Feature Selection
For the baseline model, two key features were selected:

firstdragon: Indicates whether the team secured the first dragon.
firsttower: Indicates whether the team secured the first tower.
These features were chosen based on the earlier exploratory analysis and 
hypothesis testing, which indicated that these objectives were significant in determining match outcomes.

Model Choice
A simple Logistic Regression model was selected as the baseline model due to 
its interpretability and effectiveness in binary classification problems.

Pipeline Setup
A scikit-learn pipeline was created to streamline the preprocessing and 
model training process. The pipeline included:

StandardScaler: Applied to numerical features to standardize them.
LogisticRegression: The classifier used for predicting match outcomes.

The baseline model achieved an accuracy of 66.07%, providing a reference point for further model improvements. 
This performance, while moderate, sets the stage for enhancing the model by incorporating more features and 
fine-tuning hyperparameters in the next steps.
'''

'\nFeature Selection\nFor the baseline model, two key features were selected:\n\nfirstdragon: Indicates whether the team secured the first dragon.\nfirsttower: Indicates whether the team secured the first tower.\nThese features were chosen based on the earlier exploratory analysis and \nhypothesis testing, which indicated that these objectives were significant in determining match outcomes.\n\nModel Choice\nA simple Logistic Regression model was selected as the baseline model due to \nits interpretability and effectiveness in binary classification problems.\n\nPipeline Setup\nA scikit-learn pipeline was created to streamline the preprocessing and \nmodel training process. The pipeline included:\n\nStandardScaler: Applied to numerical features to standardize them.\nLogisticRegression: The classifier used for predicting match outcomes.\n\nThe baseline model achieved an accuracy of 66.07%, providing a reference point for further model improvements. \nThis performance, while moderate, sets

In [None]:
# Select features and target
features = ['firstdragon', 'firsttower']
X = grouped_df[features]
y = grouped_df['result']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing: Numerical features scaling
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), features)
    ])

# Create pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', LogisticRegression())])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Baseline Model Accuracy: {accuracy:.4f}")


Baseline Model Accuracy: 0.6607


## Step 7: Final Model

In [None]:
'''
Feature Engineering
To improve upon the baseline model, two new features were engineered:

total_objectives: Sum of all objectives secured by the team (e.g., first dragon, first tower, first baron).
objective_ratio: Ratio of objectives secured by the team compared to the opponent.
These features were chosen to capture the cumulative impact of securing multiple objectives and to measure a team's performance relative to their opponent.

Hyperparameter Tuning
A RandomForestClassifier was selected as the final model, given its ability to handle complex interactions between features. 
To optimize the model, a GridSearchCV was performed to tune the following hyperparameters:

n_estimators: Number of trees in the forest.
max_depth: Maximum depth of the tree.
min_samples_split: Minimum number of samples required to split an internal node.
min_samples_leaf: Minimum number of samples required to be at a leaf node.
Pipeline and Model Training

The final model was constructed using a scikit-learn pipeline, which included preprocessing steps and the Random Forest classifier. 
The same train-test split from the baseline model was used to ensure consistency in evaluation.

As a result, the final model achieved an accuracy of 66.07%, with the best hyperparameters 
being {max_depth: None, min_samples_leaf: 1, min_samples_split: 2, n_estimators: 50}. 
This performance, along with the additional engineered features, 
suggests that the model can effectively capture the relationship between objectives and match outcomes.
'''

"\nFeature Engineering\nTo improve upon the baseline model, two new features were engineered:\n\ntotal_objectives: Sum of all objectives secured by the team (e.g., first dragon, first tower, first baron).\nobjective_ratio: Ratio of objectives secured by the team compared to the opponent.\nThese features were chosen to capture the cumulative impact of securing multiple objectives and to measure a team's performance relative to their opponent.\n\nHyperparameter Tuning\nA RandomForestClassifier was selected as the final model, given its ability to handle complex interactions between features. \nTo optimize the model, a GridSearchCV was performed to tune the following hyperparameters:\n\nn_estimators: Number of trees in the forest.\nmax_depth: Maximum depth of the tree.\nmin_samples_split: Minimum number of samples required to split an internal node.\nmin_samples_leaf: Minimum number of samples required to be at a leaf node.\nPipeline and Model Training\n\nThe final model was constructed u

In [None]:
# New Feature Engineering
def total_objectives(X):
    return X.sum(axis=1).values.reshape(-1, 1)

features = ['firstdragon', 'firsttower']
X = grouped_df[features]
y = grouped_df['result']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the new features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), features),
        ('total_objectives', FunctionTransformer(total_objectives, validate=False), features)
    ])

# Define the model
rf = RandomForestClassifier(random_state=42)

# Create the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', rf)])

# Define the hyperparameters to tune
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Perform grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters and model evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Final Model Accuracy: {accuracy:.4f}")
print(f"Best Parameters: {grid_search.best_params_}")


Final Model Accuracy: 0.6607
Best Parameters: {'classifier__max_depth': None, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 50}


## Step 8: Fairness Analysis

In [None]:
# TODO

In [None]:
# Get predictions
y_pred = best_model.predict(X_test)

# Group X: Teams that secured first dragon
group_x_mask = X_test['firstdragon'] == 1
group_x_accuracy = accuracy_score(y_test[group_x_mask], y_pred[group_x_mask])

# Group Y: Teams that did not secure first dragon
group_y_mask = X_test['firstdragon'] == 0
group_y_accuracy = accuracy_score(y_test[group_y_mask], y_pred[group_y_mask])

print(f"Accuracy for Group X (firstdragon=1): {group_x_accuracy:.4f}")
print(f"Accuracy for Group Y (firstdragon=0): {group_y_accuracy:.4f}")


Accuracy for Group X (firstdragon=1): 0.6758
Accuracy for Group Y (firstdragon=0): 0.6491


In [None]:
# Calculate observed difference in accuracy
observed_diff = group_x_accuracy - group_y_accuracy

# Perform permutation test
n_permutations = 1000
diffs = []

for _ in range(n_permutations):
    shuffled = np.random.permutation(group_x_mask)
    shuffled_x_accuracy = accuracy_score(y_test[shuffled], y_pred[shuffled])
    shuffled_y_accuracy = accuracy_score(y_test[~shuffled], y_pred[~shuffled])
    diffs.append(shuffled_x_accuracy - shuffled_y_accuracy)

# Calculate p-value
p_value = np.mean(np.abs(diffs) >= np.abs(observed_diff))

print(f"Observed difference in accuracy: {observed_diff:.4f}")
print(f"Permutation test p-value: {p_value:.4f}")


Observed difference in accuracy: 0.0267
Permutation test p-value: 0.0810


In [None]:
#We fail to reject the null hypothesis at the 5% significance level.

'''
Objective of Fairness Analysis
The goal of this fairness analysis is to evaluate whether the final model performs differently 
for teams that secured a specific in-game objective compared to those that did not. 
Specifically, we want to see if the model's accuracy is consistent across these groups, ensuring that 
no team is disadvantaged based on their in-game performance in securing objectives.

Groups for Comparison
Group X: Teams that secured the first dragon (firstdragon = 1).
Group Y: Teams that did not secure the first dragon (firstdragon = 0).

Null and Alternative Hypotheses
Null Hypothesis (H₀): The model's accuracy is the same for both groups (teams that secured the first dragon and those that didn't), and any observed difference is due to random chance.
Alternative Hypothesis (H₁): The model's accuracy is different for these groups, suggesting potential bias.

Procedure
Model Accuracy Calculation: The accuracy was calculated separately for Group X and Group Y.
Permutation Test: A permutation test was conducted to determine if the observed difference in accuracy is statistically significant.

Results
Observed Difference in Accuracy: The difference in model accuracy between Group X and Group Y was found to be 0.0267.
Permutation Test P-value: The p-value from the permutation test was 0.0530.

Conclusion
Given the p-value of 0.0530, which is slightly above the 0.05 threshold, 
we fail to reject the null hypothesis. This indicates that there is no strong evidence of bias in the model's 
performance between teams that secured the first dragon and those that did not. However, the result is borderline, 
suggesting a potential area for further investigation to ensure fairness.
'''

"\nObjective of Fairness Analysis\nThe goal of this fairness analysis is to evaluate whether the final model performs differently \nfor teams that secured a specific in-game objective compared to those that did not. \nSpecifically, we want to see if the model's accuracy is consistent across these groups, ensuring that \nno team is disadvantaged based on their in-game performance in securing objectives.\n\nGroups for Comparison\nGroup X: Teams that secured the first dragon (firstdragon = 1).\nGroup Y: Teams that did not secure the first dragon (firstdragon = 0).\n\nNull and Alternative Hypotheses\nNull Hypothesis (H₀): The model's accuracy is the same for both groups (teams that secured the first dragon and those that didn't), and any observed difference is due to random chance.\nAlternative Hypothesis (H₁): The model's accuracy is different for these groups, suggesting potential bias.\n\nProcedure\nModel Accuracy Calculation: The accuracy was calculated separately for Group X and Group

In [None]:
# Assuming you have features and data prepared
X = grouped_df[['firstdragon', 'firstbaron', 'firsttower', 'dragons', 'towers', 
                'opp_dragons', 'elementaldrakes', 'opp_elementaldrakes', 
                'infernals', 'mountains', 'clouds', 'oceans', 'chemtechs', 
                'hextechs', 'elders', 'opp_elders', 'firstherald', 'heralds', 
                'opp_heralds', 'void_grubs', 'opp_void_grubs', 'barons', 
                'opp_barons', 'firstmidtower', 'firsttothreetowers', 
                'turretplates', 'opp_turretplates', 'inhibitors', 
                'opp_inhibitors']]
y = grouped_df['result']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Hyperparameter tuning
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Set the final model
final_model = grid_search.best_estimator_

def objective_win(**kwargs):
    # Default values for all 30 objectives, with 'first' objectives as booleans
    default_values = {
        'firstdragon': False,
        'dragons': 0,
        'opp_dragons': 0,
        'elementaldrakes': 0,
        'opp_elementaldrakes': 0,
        'infernals': 0,
        'mountains': 0,
        'clouds': 0,
        'oceans': 0,
        'chemtechs': 0,
        'hextechs': 0,
        'dragons (type unknown)': 0,
        'elders': 0,
        'opp_elders': 0,
        'firstherald': False,
        'heralds': 0,
        'opp_heralds': 0,
        'void_grubs': 0,
        'opp_void_grubs': 0,
        'firstbaron': False,
        'barons': 0,
        'opp_barons': 0,
        'firsttower': False,
        'towers': 0,
        'opp_towers': 0,
        'firstmidtower': False,
        'firsttothreetowers': False,
        'turretplates': 0,
        'opp_turretplates': 0,
        'inhibitors': 0,
        'opp_inhibitors': 0,
    }
    
    # Update the default values with the provided arguments
    for key, value in kwargs.items():
        if key in default_values:
            default_values[key] = value
    
    # Convert boolean 'first' values to 1 (True) or 0 (False)
    for key in default_values:
        if 'first' in key:
            default_values[key] = 1 if default_values[key] else 0
    
    # Ensure the order of features matches the training set
    ordered_features = [default_values[col] for col in X.columns]  # X.columns should match the training data columns
    
    # Create the feature array for the model
    feature_array = np.array(ordered_features).reshape(1, -1)
    
    # Predict the probability of winning
    win_probability = final_model.predict_proba(feature_array)[0, 1]
    
    return f"Chance of winning: {win_probability:.2%}"

# Example usage:
print(objective_win(firstdragon=True, firstbaron=False, firsttower=True))

Chance of winning: 51.24%



X does not have valid feature names, but StandardScaler was fitted with feature names

