# Can You Predict a Win from the First 10 Minutes?
## Predicting League of Legends match outcomes using early-game data  
### Simone Dolcecanto 

## Overview
This notebook explores whether the outcome of a *League of Legends* match can be predicted using only information from the first 10 minutes of gameplay.

The dataset contains early-game metrics from ~10,000 high-ranked matches. For each match, we have:
- **38 features** (19 per team) describing the game state at minute 10  
- a binary target label: **`blueWins`** (1 if the blue team wins, 0 otherwise)

The workflow is:
1. Data preprocessing  
2. Exploratory analysis (correlations and distributions)  
3. Dimensionality reduction (PCA)  
4. Clustering in PCA space  
5. Supervised classification (Random Forest) and evaluation


In [None]:
import sklearn as sk
import numpy as np  
import pandas as pd
import os
import seaborn as sea #pairplot mainy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler #scale the data for PCA
from sklearn.model_selection import train_test_split #split into train and test
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
import itertools

In [None]:
#loading the dataframe
folder_path= ' '
file_name= 'high_diamond_ranked_10min.csv'
file_path =  folder_path + '/' + file_name 
lol_total = pd.read_csv(file_path)

print(lol_total.describe())

## Pairplot (sanity check for linear separability)

A Seaborn pairplot is a quick way to inspect whether the classes look linearly separable in low-dimensional projections.

Here, winning and losing matches are strongly overlapping across most feature combinations, suggesting that **a simple linear separator is unlikely to perform well** without additional feature engineering or non-linear models.


In [None]:
sea.pairplot(lol_total.sample(1000), hue="blueWins") #set hue (5 min graph). 

# Data Preprocessing


In [None]:
#SET FIRST COL AS ROWNAMES (INDEX)

lol_total.set_index('gameId', inplace=True)#inplace means acts on original object
print(lol_total) 



In [None]:
lol_total.isnull().sum()
lol_total.duplicated().sum()# no null nor duplicates
print(lol_total.isnull().sum(), lol_total.duplicated().sum())

## Data quality checks

The dataset contains **no missing values** and **no duplicate rows**.

Next, the data is split into **training** and **test** sets to support fair evaluation of downstream models.


In [None]:
lol_train, lol_test = train_test_split(lol_total, test_size=0.3, random_state=42) #i prep the sample (roughly 70%), for model training
# 42 is the random key so it always results in the same split
print(lol_train)

In [None]:
#Z-SCORE
vars = lol_train.drop('blueWins', axis=1)#drop target column. axis is 0 for index and 1 for columns
target_var= lol_train["blueWins"]
lol_zp = StandardScaler().fit_transform(vars) #take scaler instance, then exec with fit_transform on the vars. p for partial
lol_z= pd.DataFrame(lol_zp, columns= vars.columns, index= vars.index ) #make df from the scaled vars otherwise remains numpy array 
lol_z["blueWins"] = target_var #add target back to the scaled df
print(lol_z)

# Data Exploration
## Correlation with `blueWins`

A correlation matrix helps highlight which early-game variables are most associated with the final outcome (`blueWins`) and also reveals groups of features that are strongly intercorrelated (multicollinearity).


In [None]:
# Correlation matrix with blueWins as the target variable
lol_z_corr = lol_z.corr()

plt.figure(figsize=(25,25))
sea.heatmap(lol_z_corr, annot=True, cmap='Purples', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

## Interpretation

As expected, the features most associated with `blueWins` capture **relative advantage** between the two teams in the first 10 minutes.

In particular, differences in **gold** and **experience** show moderate positive correlation with blue victory. Additionally, **`redDeaths`** correlates positively with `blueWins`, indicating that early combat outcomes often translate into a meaningful advantage.


In [None]:
pca= PCA(n_components= 0.95)# % of explained variance
lol_notarg =lol_z.drop(columns="blueWins")
lol_z_pca= pca.fit_transform(lol_notarg)# gets princ components and adjourns df 


## Principal Component Analysis (PCA)

PCA is used to reduce dimensionality and make the structure of the data easier to visualize and interpret.

A scree plot (explained variance by component) is used with the **elbow method** to select a reasonable number of principal components.


In [None]:

exp_var = pca.explained_variance_ratio_ 
plt.plot(range(1, len(exp_var)+1), exp_var) # scree plot 
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Explained Variance')
plt.title('Scree Plot')
plt.show()# not enough exp var for this to be useful maybe. see numerical values


In [None]:
cum_exp_var = exp_var.cumsum()
for i, var in enumerate(exp_var):
    print(f"Variance explained by Principal Component  {i+1}: {var}")
    print(f"Cumulative exp variance after Principal Component {i+1}: {cum_exp_var[i]}")
    

In [None]:
eigenvalues = pca.explained_variance_

for i, eigenvalue in enumerate(eigenvalues):
    print(f"Eigenvalue of Principal component {i+1}: {eigenvalue}")

## PCA results

After scaling the features and running PCA, **4 principal components** are retained using the elbow method.

These components explain **~58%** of the total variance. While additional components would capture more variance, they would also make interpretation and visualization less clear.

Next, the retained components are explored using 3D biplots.


In [None]:
pca= PCA(n_components= 4)# adjust numbeer of components
lol_pca=pca.fit_transform(lol_notarg)
colummns =["pca_comp_%i" %i for i in range(4)]# string formatting: create list of col names
lol_pca= pd.DataFrame(lol_pca, columns= colummns, index= lol_notarg.index )

lol_pca["blueWins"] = target_var #add back ttarg val

print(lol_pca)

In [None]:
colorss= {0: 'red', 1: 'blue'}
def biplot_2d(pca, data, labels, component_num, target_values): #FINALLY WORKING
    components = pca.components_
    
    fig, ax = plt.subplots(figsize=(8, 6))
    
    ax.set_xlabel(f'PC{component_num}')
    ax.set_ylabel(f'PC{component_num + 1}')

    
    
    for i, label in enumerate(labels[:min(len(labels), len(components[component_num]))]):# oob when choosing min. try max?(no)
        
        ax.text(components[component_num, i], components[component_num + 1, i], label, color='g')
        
    
    for i, (comp1, comp2) in enumerate(zip(components[component_num], components[component_num + 1])):
        ax.plot([0, comp1], [0, comp2], color='r', linewidth=1, linestyle='--')

    #ax.scatter(data.iloc[:, component_num], data.iloc[:, component_num + 1], c=target_values, cmap='viridis', s=1.5, alpha=0.5)
    #ax.set_xlim(-0.5,0.5)
    #ax.set_ylim(-0.5,0.5)

    
    
    plt.show()

# Call biplot_2d function for each pair of pcs
for component_num in range(3):
    biplot_2d(pca, lol_pca, lol_notarg.columns, component_num, lol_pca["blueWins"].values.tolist() )#problem was here. +1 after comonent_num for some reason.
    #didn't work before because blueWins is a series. it needed a list or an array


## Biplots (interpreting the components)

Biplots visualize samples in PCA space while also showing which original variables contribute most to each principal component.

A reasonable interpretation is:
- **PC0** (highest variance) primarily reflects **overall resource acquisition and advantage**.
- **PC1** appears linked to **early combat performance** (kills/deaths-related dynamics).
- **PC2** emphasizes **short-term objective control**, such as the acquisition of buffs (e.g., the first herald/dragon dynamics depending on available features).

These interpretations are used as qualitative guidance rather than definitive causal claims.


In [None]:
print(pd.DataFrame(pca.components_,columns=lol_notarg.columns,index = ['PC0','PC1','PC2','PC3']))
#relative contr per column of each pc

## Clustering

Clustering provides an additional way to summarize match states by grouping games with similar early-game characteristics.

K-Means is applied in PCA space. The number of clusters is selected using the elbow method by examining the **within-cluster sum of squares (WCSS / inertia)**.


In [None]:
# K-means optimization function
def optim_kmeans(data, max_k):#an entire function for the graph, again
    means=[]
    inertia=[]#voids to append

    for k in range(1,max_k): #for each k, to a max of k clusters, apply kmeans with k clusters
        kmeans = KMeans(n_clusters=k)
        kmeans.fit(data) #apply kmeans

        means.append(k)
        inertia.append(kmeans.inertia_)

    
    plt.subplots(figsize=(10, 5))
    plt.plot(means, inertia, 'o-')
    plt.xlabel('Number of clusters')
    plt.ylabel('Inertia/WCSS')
    plt.grid(True)
    plt.show()


optim_kmeans(lol_pca, 10)#pca dataset of course

In [None]:
#4 clusters is acceptable
kmeans = KMeans(n_clusters=4)
kmeans.fit(lol_pca)
lol_pcak= lol_pca.copy()
lol_pcak["kmeans"]=kmeans.labels_#df of pca with kmeans appended

plt.scatter(lol_pcak.iloc[:,0],lol_pcak.iloc[:,1],c=kmeans.labels_,cmap="viridis",alpha=0.2)
labels=np.unique(kmeans.labels_)
colors=plt.cm.viridis(np.linspace(0,1,len(labels)))
plt.xlabel('PC0')
plt.ylabel('PC1')
plt.legend(handles=[plt.scatter([], [], color=c, label=f'Cluster {i}') for i, c in zip(labels, colors)],
           title="Clusters",
           loc="best")



The clusters distribute coherently in PCA space.

A high-level reading is:
- **Clusters 0–1**: match states dominated by **combat dynamics**
- **Clusters 2–3**: match states dominated by **resource accumulation**

Next, clustering quality is evaluated using the **silhouette coefficient** and **SSE**.


In [None]:
#internal cluster vlaidity measure: silhouette coeff
# avg[(intraclust_dist - interclust_dist)/max(intraclust_dist, interclust_dist)]
silhouette_avg= silhouette_score(lol_pcak, kmeans.labels_)
print(f"Silhouette average score on pca'ed dataframe: {silhouette_avg}")
#Silhouette average score on pca'ed dataframe: 0.25717978374136885. abysmal
silhouette_avg_orig= silhouette_score(lol_notarg, kmeans.labels_)
print(f"Silhouette average score on scaled train dataframe: {silhouette_avg_orig}")
#Silhouette average score on scaled train dataframe: 0.10824271489828395. unsurprisingly, even worse
#SSE, Sum of Squared Errors
sse_pca= kmeans.inertia_
print(f"SSE on pca'ed dataframe: {sse_pca}")
#SSE on pca'ed dataframe: 73025.85170924194. a terrible result, but one expected, due to the heavy loss of information following pca.


The difference in performance between the PCA-transformed space and the fully scaled feature space is noticeable, with an approximate drop of **~15%** in this comparison.


# Classification: Random Forest

A Random Forest classifier is trained to predict `blueWins` from early-game features.

Random Forests are a strong baseline for tabular data and handle non-linear relationships well. Hyperparameters are tuned using **GridSearchCV**.


In [None]:
#CLASSIFICATION
#grid search
grid_search =    GridSearchCV( #use grid search to find hyperparameters for random forest
        estimator=RandomForestClassifier(),
        param_grid={
            'n_estimators': [10, 100, 200],
            'max_depth': [None, 5, 10],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        },
        cv=5
    )
grid_search.fit(lol_train.drop('blueWins', axis=1), lol_train['blueWins'])#ind and dep vars
best_params=grid_search.best_params_
best_score=grid_search.best_score_

print("Best Hyperparameters:", best_params)
print("Best Score:", best_score)

#Best Hyperparameters: {'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 10}
#Best Score: 0.7298626174981924


In [None]:
rf= RandomForestClassifier(max_depth=5, min_samples_leaf=4, min_samples_split=10, n_estimators=10, random_state=42)
train_feat= lol_train.drop(['blueWins'], axis=1)
train_label= lol_train.blueWins
rf.fit(train_feat, train_label)#train
test_pred= rf.predict(lol_test.drop(['blueWins'], axis=1))

print(accuracy_score(lol_test['blueWins'], test_pred))
#0.7206477732793523 
X_val = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=42)
X_val_res= cross_val_score(rf, train_feat, train_label, cv=X_val, scoring='accuracy')
print(f"average accuracy: {X_val_res.mean()}")
#average accuracy: 0.7262890968103529

The final model does not reach the best cross-validation score observed during grid search, but it achieves an average test accuracy of **~72%**.

Repeated Stratified K-Fold validation yields consistent performance (within ~0.006), supporting the stability of this estimate.

Given that predictions are made using only the first 10 minutes of matches that can last up to ~1 hour, this is a meaningful result.


In [None]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import roc_curve, auc



## Confusion Matrix

A confusion matrix summarizes true/false positives and negatives, helping identify the types of errors the model makes and whether misclassifications are balanced across classes.


In [None]:
con_mtx = confusion_matrix(lol_test['blueWins'], test_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=con_mtx, display_labels=rf.classes_)
disp.plot(cmap='viridis')
plt.show()

In [None]:
#ROC curve
prob_t=rf.predict_proba(lol_test.drop(['blueWins'], axis=1))
fpr, tpr, thresholds = roc_curve(lol_test['blueWins'], prob_t[:,1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='red', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

## ROC Curve

The ROC curve evaluates the classifier across decision thresholds by comparing true positive rate and false positive rate.

An **AUC of ~0.81** indicates good discriminative ability when predicting match outcomes using only early-game information.
