<a href="https://colab.research.google.com/github/kmora2b/speedrun_ml_project/blob/main/speedrun_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Speedrun World Record Times

This notebook aims to predict world record times for speedrunning based on features such as the game, category, and platform. It uses multiple regression models, including Linear Regression, Random Forest, and Gradient Boosting, with a focus on improving model performance through hyperparameter tuning, regularization, and cross-validation.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from scipy.stats import normaltest, skew


## Load and Explore the Dataset

The dataset contains speedrunning records for various games, including features such as game name, category, platform, and the world record time.
The goal is to predict the world record time for a given game and category based on these features.

###Data Source
The dataset is publicly available on Kaggle and was created by Matheus Turatti.

###Citation: Turatti, M. (n.d.). Game Speedrun Records [Data set]. Kaggle. https://www.kaggle.com/datasets/matheusturatti/game-speedrun-records

In [None]:

url = "https://www.kaggleusercontent.com/datasets/matheusturatti/game-speedrun-records"
data = pd.read_csv(url)

print("Dataset Overview:")
print(data.info())
print(data.head())


## Data Cleaning

Missing values are dropped to ensure a clean dataset for training the models. The categorical features are one-hot encoded for compatibility with machine learning algorithms.

In [None]:

cleaned_data = data.dropna()
categorical_features = ['game', 'category', 'platform']
data_encoded = pd.get_dummies(cleaned_data, columns=categorical_features, drop_first=True)
X = data_encoded.drop('time', axis=1)
y = data_encoded['time']


## Split Data for Training and Testing

The dataset is split into training and testing sets (80%-20%) to evaluate the model's performance on unseen data.

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Train Multiple Models

Linear Regression, Ridge Regression, Lasso Regression, Random Forest, and Gradient Boosting models are trained and their performance is compared using RMSE and R-squared metrics.

In [None]:

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    r2 = r2_score(y_test, predictions)
    results[name] = {'RMSE': rmse, 'R^2': r2}


## Hyperparameter Tuning and Cross-Validation

Random Forest is tuned using grid search for optimal parameters. Gradient Boosting is evaluated using cross-validation.

In [None]:

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
rf_grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error')
rf_grid_search.fit(X_train, y_train)

best_rf_model = rf_grid_search.best_estimator_
rf_predictions = best_rf_model.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))
rf_r2 = r2_score(y_test, rf_predictions)

gbr_model = GradientBoostingRegressor(random_state=42)
cross_val_scores = cross_val_score(gbr_model, X, y, cv=5, scoring='neg_mean_squared_error')
mean_cv_rmse = np.sqrt(-cross_val_scores.mean())


## Results and Performance Comparison

The RMSE and R-squared metrics for each model are compared. The performance of the tuned Random Forest model and Gradient Boosting is highlighted.

In [None]:

print("Model Results:")
for name, metrics in results.items():
    print(f"{name}: RMSE = {metrics['RMSE']:.2f}, R^2 = {metrics['R^2']:.2f}")

print("Best Random Forest Model After Tuning:")
print(f"RMSE: {rf_rmse:.2f}, R^2: {rf_r2:.2f}")

print("Gradient Boosting Cross-Validation Mean RMSE:")
print(f"Mean CV RMSE: {mean_cv_rmse:.2f}")


## Visualize Model Performance

A bar plot compares the RMSE values across different models.

In [None]:

plt.figure(figsize=(12, 6))
model_names = list(results.keys()) + ["Tuned Random Forest", "Gradient Boosting CV"]
rmse_values = [metrics['RMSE'] for metrics in results.values()] + [rf_rmse, mean_cv_rmse]

sns.barplot(x=rmse_values, y=model_names, orient='h')
plt.title("Model RMSE Comparison")
plt.xlabel("RMSE")
plt.ylabel("Model")
plt.show()


## Feature Importance from Best Random Forest Model

The top 10 most important features are visualized for the best Random Forest model.

In [None]:

feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title("Top 10 Feature Importances - Best Random Forest")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()


## Discussion and Suggestions for Improvement

1. **Learning and Takeaways:**
   - Random Forest achieved the best performance.
   - Gradient Boosting also performed well with cross-validation.
2. **Challenges:**
   - Linear Regression struggled due to potential multicollinearity.
   - Lasso Regression may have overly penalized features.
3. **Suggestions for Improvement:**
   - Experiment with stacking ensemble techniques.
   - Perform feature selection or dimensionality reduction (e.g., PCA).
   - Include additional features such as player statistics or game metadata.