# 03. Predictive Modeling

**Purpose:** This notebook focuses on building and evaluating machine learning models to predict game popularity based on the aggregated features created in the previous notebook (`02_feature_engineering.ipynb`).

**Why This Matters:** This is the core prediction step where we leverage the collected and processed data to train models that can estimate a game's potential success (e.g., peak player count) before or shortly after launch.

**What to Expect:** After running this notebook, you will:
1. Load the aggregated feature dataset.
2. Prepare the data for modeling (feature selection, splitting, scaling).
3. Train several common regression models (e.g., Linear Regression, Random Forest, Gradient Boosting).
4. Evaluate the models using standard regression metrics (R², MAE, RMSE).
5. Compare model performance to identify the most promising approach.
6. Potentially save the best-performing model for future use.

## 1. Setup and Configuration

**Purpose:** Import necessary libraries for data manipulation, modeling, and evaluation.

**Why:** Provides the tools needed for the machine learning workflow.

In [None]:
# Imports and Setup
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib # For saving models

# Add src directory to path (optional, if utility functions are needed)
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
# from src.utils import configure_plotting # Optional

# Configure plotting (optional)
# configure_plotting()
plt.style.use('seaborn-v0_8-whitegrid')

# Display pandas DataFrames nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Display current time for reference
print(f"Notebook Execution Started: {datetime.now()}")

## 2. Load Aggregated Data

**Purpose:** Load the feature set created by `02_feature_engineering.ipynb`.

**Why:** This dataset contains the features and target variables needed for training the models.

In [None]:
# Load the aggregated features dataset
data_path = os.path.join("..", "data", "aggregated_game_features.csv")
try:
    df_agg = pd.read_csv(data_path)
    print(f"Successfully loaded aggregated data from: {data_path}")
    print(f"Shape: {df_agg.shape}")
    display(df_agg.head())
except FileNotFoundError:
    print(f"Error: Aggregated data file not found at {data_path}")
    print("Please ensure '02_feature_engineering.ipynb' has been run successfully.")
    df_agg = pd.DataFrame() # Assign empty df to prevent errors later
except Exception as e:
    print(f"An error occurred loading the data: {e}")
    df_agg = pd.DataFrame()

## 3. Data Preparation for Modeling

**Purpose:** Select features and target, handle missing values definitively, split data, and apply necessary transformations (e.g., scaling).

**Why:** Machine learning models require clean, numerical input and separate training/testing sets for reliable evaluation.

In [None]:
# Proceed only if data was loaded successfully
if not df_agg.empty:
    print("Preparing data for modeling...")

    # --- Define Features (X) and Target (y) ---
    # Example: Predict peak players in the first 7 days
    # Find the actual column name (it includes the number of days)
    peak_days_col = next((col for col in df_agg.columns if col.startswith('steam_peak_players_')), None)
    if peak_days_col is None:
        raise ValueError("Target column 'steam_peak_players_*d' not found. Check aggregation notebook.")
    
    target = peak_days_col
    print(f"Target variable (y): {target}")

    # Select potential features (pre-release metrics + static info)
    # Exclude post-launch outcomes (except the target), identifiers, and dates
    potential_features = [
        'metacritic_score',
        'google_trends_avg_pre',
        'reddit_posts_avg_pre',
        'twitter_count_avg_pre',
        'reddit_subs_pre',
        'reddit_active_pre',
        # Add YouTube pre-release features if aggregated
        # 'youtube_total_views_pre', # Example if added in aggregator
        # 'youtube_avg_likes_pre',   # Example if added in aggregator
        # Potentially add genre, price etc. if properly encoded
    ]
    
    # Filter to only include features present in the DataFrame
    features = [f for f in potential_features if f in df_agg.columns]
    print(f"Selected features (X): {features}")

    # --- Handle Missing Values ---
    # Drop rows where the target variable is missing
    df_model = df_agg.dropna(subset=[target]).copy()
    print(f"Shape after dropping rows with missing target: {df_model.shape}")

    # Impute missing values in features (using median for simplicity)
    for col in features:
        if df_model[col].isnull().any():
            median_val = df_model[col].median()
            df_model[col].fillna(median_val, inplace=True)
            print(f"Imputed missing values in '{col}' with median ({median_val:.2f})")
            
    # Verify no missing values remain in features or target
    print(f"Missing values in target: {df_model[target].isnull().sum()}")
    print(f"Missing values in features: {df_model[features].isnull().sum().sum()}")

    # --- Split Data ---
    X = df_model[features]
    y = df_model[target]
    
    if len(df_model) < 10:
         print("Warning: Very small dataset, results may not be reliable. Consider collecting more data.")
         # Handle small dataset case if necessary, e.g., skip splitting or use cross-validation
         X_train, X_test, y_train, y_test = X, X, y, y # Use all data for train/test - not ideal!
    else:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        print(f"Training set shape: X={X_train.shape}, y={y_train.shape}")
        print(f"Testing set shape: X={X_test.shape}, y={y_test.shape}")

    # --- Preprocessing (Scaling) ---
    # Scale numerical features
    # Note: If categorical features were added, OneHotEncoder would be included here
    scaler = StandardScaler()
    
    # Fit scaler on training data only, then transform both train and test
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Convert scaled arrays back to DataFrames for easier inspection (optional)
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=features, index=X_train.index)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=features, index=X_test.index)
    print("\nNumerical features scaled using StandardScaler.")
    # display(X_train_scaled.head())
else:
    print("Skipping data preparation as aggregated data is empty.")

## 4. Model Training and Evaluation

**Purpose:** Train different regression models and evaluate their performance on the test set.

**Why:** To compare how well different algorithms can predict the target variable using the prepared features.

**Expected Output:** Performance metrics (R², MAE, RMSE) for each model.

In [None]:
# Proceed only if data preparation was successful
model_results = {}
if 'X_train_scaled' in locals():
    print("\n--- Training and Evaluating Models ---")
    
    # Define models to try
    models = {
        'Linear Regression': LinearRegression(),
        'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
    }
    
    # Loop through models, train, predict, and evaluate
    for name, model in models.items():
        print(f"\nTraining {name}...")
        start_time = datetime.now()
        
        # Train the model
        model.fit(X_train_scaled, y_train)
        
        # Make predictions on the test set
        y_pred = model.predict(X_test_scaled)
        
        # Evaluate the model
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        end_time = datetime.now()
        duration = end_time - start_time
        
        # Store results
        model_results[name] = {
            'MAE': mae,
            'RMSE': rmse,
            'R2': r2,
            'Training Time': duration
        }
        
        print(f"  MAE: {mae:.2f}")
        print(f"  RMSE: {rmse:.2f}")
        print(f"  R2 Score: {r2:.3f}")
        print(f"  Training Time: {duration}")
        
        # Optional: Feature importance for tree-based models
        if hasattr(model, 'feature_importances_'):
            importances = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
            print("\n  Feature Importances:")
            print(importances.head(10)) # Display top 10
            # plt.figure(figsize=(10, 6))
            # importances.plot(kind='bar')
            # plt.title(f'{name} Feature Importances')
            # plt.ylabel('Importance')
            # plt.show()
            
else:
    print("Skipping model training as data preparation failed or was skipped.")

## 5. Model Comparison

**Purpose:** Compare the performance metrics of the different models.

**Why:** To identify the best-performing model based on the chosen evaluation criteria (e.g., lowest RMSE, highest R²).

**Expected Output:** A summary table or visualization comparing model metrics.

In [None]:
# Compare model results if available
if model_results:
    results_df = pd.DataFrame.from_dict(model_results, orient='index')
    results_df = results_df.sort_values('RMSE', ascending=True) # Sort by RMSE (lower is better)
    
    print("\n--- Model Performance Comparison ---")
    display(results_df)
    
    # Identify the best model based on RMSE
    best_model_name = results_df.index[0]
    print(f"\nBest model based on RMSE: {best_model_name}")
    
    # Optional: Plot comparison
    # results_df[['MAE', 'RMSE']].plot(kind='bar', figsize=(10, 6))
    # plt.title('Model Error Comparison (Lower is Better)')
    # plt.ylabel('Error')
    # plt.xticks(rotation=0)
    # plt.show()
    # 
    # results_df['R2'].plot(kind='bar', figsize=(10, 6))
    # plt.title('Model R2 Score Comparison (Higher is Better)')
    # plt.ylabel('R2 Score')
    # plt.ylim(min(0, results_df['R2'].min() - 0.1), 1.0)
    # plt.xticks(rotation=0)
    # plt.show()
else:
    print("Skipping model comparison as no models were trained.")

## 6. Save Best Model (Optional)

**Purpose:** Persist the trained model for future use (e.g., deployment or making predictions on new data).

**Why:** Avoids retraining the model every time predictions are needed.

**Expected Output:** Confirmation that the model and scaler have been saved.

In [None]:
# Save the best performing model and the scaler
if model_results and 'best_model_name' in locals():
    best_model = models[best_model_name] # Get the trained model instance
    
    # Create directory if it doesn't exist
    model_dir = os.path.join("..", "models")
    os.makedirs(model_dir, exist_ok=True)
    
    # Define file paths
    model_path = os.path.join(model_dir, f"{best_model_name.lower().replace(' ', '_')}_model.joblib")
    scaler_path = os.path.join(model_dir, "scaler.joblib")
    features_path = os.path.join(model_dir, "features.joblib")
    
    try:
        # Save the model
        joblib.dump(best_model, model_path)
        print(f"\nBest model ({best_model_name}) saved to: {model_path}")
        
        # Save the scaler
        joblib.dump(scaler, scaler_path)
        print(f"Scaler saved to: {scaler_path}")
        
        # Save the list of features used by the model
        joblib.dump(features, features_path)
        print(f"Feature list saved to: {features_path}")
        
    except Exception as e:
        print(f"\nError saving model/scaler/features: {e}")
else:
    print("\nSkipping model saving: No best model identified or training failed.")

## 7. Conclusion and Next Steps

**Purpose:** Summarize the modeling results and suggest future directions.

**Why:** Provides closure to the modeling process and identifies areas for improvement.

**Next Actions:**
1.  **Hyperparameter Tuning:** Fine-tune the best model's parameters for potentially better performance.
2.  **Advanced Feature Engineering:** Create more sophisticated features (e.g., interaction terms, time-based features if more data is available).
3.  **Try Other Models:** Experiment with different algorithms (e.g., XGBoost, LightGBM, SVM).
4.  **Deployment:** If performance is satisfactory, consider deploying the model to make predictions on new, unseen games.
5.  **Collect More Data:** A larger dataset, especially with more games and longer historical data, will likely improve model robustness and accuracy.

In [None]:
# Final summary message
print("\nPredictive Modeling Notebook Complete.")
if model_results:
    print(f"Model comparison complete. Best model based on RMSE: {best_model_name}")
    if 'model_path' in locals() and os.path.exists(model_path):
        print(f"Best model artifacts saved in: {model_dir}")
else:
    print("Model training and evaluation were skipped or failed. Check previous cell outputs.")

---
*End of Notebook*