2.2 Regression - Predicting Hero win rate from Overwatch 2

            Dataset presentation:

Number of samples: 1440

Number of features: 126

Target variable: "win rate, %"

Unit: Percentage (%)

            Dataset description

This dataset contains a list of performance statistics from Overwatch 2 heroes across multiple competitive seasons (Season 01 to Season 04) and Quick Play mode. 
Each row represents a hero performance snapshot at a given skill tier and season. The dataset includes detailed numerical metrics such as damage dealt, eliminations, deaths, healing, objective time, and accuracy-related statistics, as well as categorical information like hero name, role, and skill tier.

Link for the dataset: "https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics/data"

            Problem statement

The goal of this study is to use regression model training to predict the win rate of an Overwatch 2 hero in percentage, taking features like  in-game performance statisitcs and contextual variables as role and skill tier.



            Data Analysis

The target variable win rate shows a relatively narrow distribution, mostly between 40% and 60%, which is expected value for a balanced competitive game.

Eliminations, damage dealt, and objective time are positively correlated with win rate, while deaths show a negative correlation.
However, most correlations remain moderate, indicating that win rate depends on complex interactions between features rather than single dominant variables.

Role-based analysis shows different statistical profiles: supports dominate healing metrics, damage heroes show higher variance in damage output, and tanks exhibit more stable survivability-related metrics.

No extreme outliers (e.g. 0% or 100%) are observed, which confirms the dataset consistency.

            Regression models

Linear Regression

A baseline linear regression model is used to assess linear relationships. Performance is limited due to the complexity of the problem.

Ridge and Lasso Regression

Regularized linear models are trained with hyperparameters optimized via cross-validation. Ridge regression briefly improves performance over the baseline, while Lasso performs implicit feature selection.

Random Forest Regressor

A non-linear ensemble model is trained to catch complex interactions. Hyperparameters like as number of trees and maximum depth are optimized. This kind of model achieves the best overall performance.

In [None]:
# Install dependencies (separate cell ideally)
# %pip install pandas scikit-learn

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

# -------------------------
# Load datasets
# -------------------------
paths = [
    './dataset/data/ow2_quickplay_heroes_stats__2023-05-06.csv',
    './dataset/data/ow2_season_01_FINAL_heroes_stats__2023-05-06.csv',
    './dataset/data/ow2_season_02_FINAL_heroes_stats__2023-05-06.csv',
    './dataset/data/ow2_season_03_FINAL_heroes_stats__2023-05-06.csv',
    './dataset/data/ow2_season_04_FINAL_heroes_stats__2023-06-27.csv'
]

dfs = [pd.read_csv(p) for p in paths]
data = pd.concat(dfs, ignore_index=True)

# -------------------------
# Define target
# -------------------------
target = 'Win Rate, %'

data[target] = pd.to_numeric(data[target], errors='coerce')
data = data.dropna(subset=[target]).reset_index(drop=True)

y = data[target]
X = data.drop(columns=[target])

# -------------------------
# Identify variable types
# -------------------------
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns

X[categorical_cols] = X[categorical_cols].fillna("UNKNOWN").astype(str)

# -------------------------
# Preprocessing pipelines
# -------------------------
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='UNKNOWN')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols),
    ]
)

# -------------------------
# Train / test split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------------------------
# Models
# -------------------------
models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.01),
    'RandomForest': RandomForestRegressor(n_estimators=200, random_state=42)
}

results = {}

# -------------------------
# Training loop
# -------------------------
for name, model in models.items():
    pipe = Pipeline(steps=[
        ('preprocess', preprocessor),
        ('model', model)
    ])

    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)

    mse = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)

    r2 = r2_score(y_test, preds)

    results[name] = {'RMSE': rmse, 'R2': r2}

results
results_df = pd.DataFrame(results).T
results_df


Unnamed: 0,RMSE,R2
Linear,2.652707,0.67
Ridge,2.688208,0.661109
Lasso,2.745769,0.64644
RandomForest,2.836484,0.622692


            Evaluation metrics

To evaluate the regression models, two metrics were used:

RMSE (Root Mean Squared Error), which measures the average prediction error in percentage points.

R² score, which indicates the proportion of variance in the target variable explained by the model.

Lower RMSE values indicate better predictive accuracy, while higher R² values indicate better explanatory power.

            
        Model comparison and interpretation of results

The table compare the performance of four regression models: Linear Regression, Ridge, Lasso, and Random Forest.

The Linear Regression model achieves the lowest RMSE (2.65) and the highest R² score (0.67) among all tested models.
This indicates that it provides the most accurate predictions and explains approximately 67% of the variance in hero win rates.

Ridge and Lasso regression show slightly worse performance, suggesting that regularization does not significantly improve the model in this context.
The Random Forest model performs the worst, which may be due to limited data size or high noise in the features.

            Final model selection

Linear Regression was selected as the final model.
It provides the best balance between predictive performance, simplicity, and interpretability.

Additionally, linear regression is well suited for understanding the influence of individual features on the win rate, which is an important aspect of this analysis

            Limitations of the model

Despite its good performance, the model has several limitations.
It does not account for player skill differences, team compositions, hero synergies, or balance changes between patches.

Moreover, win rate is influenced by many external factors that are not present in the dataset, which explains why the R² score is not closer to 1.

            Conclusion

This study shows an hero performance statistics can be used to predict win rates with reasonable accuracy.
Among the tested models, linear regression achieved the best results, with an RMSE of approximately 2.65 and an R² score of 0.67.

These results show that statistical features explain a significant portion of win rate variability, while leaving room for further improvements using additional contextual data.