# 03. Predictive Modeling for Game Popularity

**Project Context:** This notebook represents the culmination of the data pipeline, following data collection (`01_data_collection.ipynb`) and feature engineering (`02_feature_engineering.ipynb`). It leverages the structured, aggregated dataset produced in the preceding stage to build and evaluate models aimed at predicting game popularity.

**Purpose of this Notebook:** The primary objective is to apply machine learning techniques to forecast a key indicator of game success, such as peak player counts within a defined post-launch window (e.g., 7 days). This involves:
*   Loading the pre-processed and feature-engineered dataset.
*   Performing final data preparation steps specific to modeling (e.g., feature selection, robust missing value imputation, scaling, train-test splitting).
*   Training a suite of regression models to predict the chosen popularity metric.
*   Rigorously evaluating these models using standard performance metrics.
*   Comparing model performance to identify the most effective algorithm(s) for this prediction task.
*   Optionally, saving the best-performing model and associated preprocessing steps for potential future use or deployment.

**Methodological Significance:** This stage translates the curated data and engineered features into actionable predictive insights. By systematically training and evaluating different models, we aim to identify patterns and relationships in the data that correlate with game popularity. The choice of features (primarily pre-release indicators) and the target variable (a post-launch success metric) is crucial for developing a model that can offer predictive value before or shortly after a game's release.

**Expected Outcomes:** Upon successful execution of this notebook, the following will be achieved:
1.  **Data Loading and Final Preparation:** The `aggregated_game_features.csv` file will be loaded, and the data will be prepared for modeling, including defining features (X) and the target variable (y), handling any residual missing values, and splitting the data into training and testing sets.
2.  **Feature Scaling:** Numerical features will be scaled to ensure they are on a comparable range, which is beneficial for many machine learning algorithms.
3.  **Model Training:** Several regression algorithms (e.g., Linear Regression, Random Forest Regressor, Gradient Boosting Regressor) will be trained on the prepared training data.
4.  **Model Evaluation:** Each trained model will be evaluated on the unseen test data using metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²).
5.  **Performance Comparison:** The performance metrics of all trained models will be compiled and compared to identify the most suitable model for the prediction task.
6.  **Feature Importance Analysis (for applicable models):** Insights into which features are most influential in predicting game popularity will be extracted from models like Random Forest or Gradient Boosting.
7.  **Model Persistence (Optional):** The best-performing model, along with its associated data scaler, may be saved to disk for future predictions or deployment scenarios.

## 1. Setup and Configuration: Initializing the Modeling Environment

**Purpose:** This initial section is dedicated to establishing the necessary Python environment for the predictive modeling tasks. It involves importing all required libraries, configuring system paths for custom module access (if needed), and setting up any global configurations for plotting or data display.

**Key Actions Undertaken:**
*   **Standard Library Imports:** Essential Python libraries for data manipulation, numerical operations, machine learning, and visualization are imported. This typically includes:
    *   `sys` and `os`: For system-level operations, such as path manipulation (though less critical here if `src` utilities are not directly used).
    *   `pandas`: For DataFrame creation, manipulation, and analysis, especially for handling the input feature set.
    *   `numpy`: For numerical computations, often used implicitly by `pandas` and `scikit-learn`.
    *   `matplotlib.pyplot` and `seaborn`: For creating static, interactive, and informative statistical graphics to visualize data and model results.
    *   `datetime`: For handling and recording timestamps, particularly the notebook's execution start time.
*   **Scikit-learn Imports:** A comprehensive suite of modules from `scikit-learn` is imported to cover the entire modeling workflow:
    *   `model_selection.train_test_split`: For dividing the dataset into training and testing subsets.
    *   `preprocessing.StandardScaler`: For standardizing numerical features.
    *   `preprocessing.OneHotEncoder` (and `compose.ColumnTransformer`): For handling categorical features if they were to be included and needed encoding (though the current feature set might be primarily numerical).
    *   `pipeline.Pipeline`: For streamlining sequences of data transformations and modeling steps.
    *   Various model algorithms: `linear_model.LinearRegression`, `ensemble.RandomForestRegressor`, `ensemble.GradientBoostingRegressor` are common choices for regression tasks.
    *   `metrics`: Functions like `mean_absolute_error`, `mean_squared_error`, and `r2_score` for evaluating model performance.
*   **Model Persistence Library:** `joblib` is imported for saving (serializing) trained machine learning models and preprocessing objects (like scalers) to disk, allowing them to be reloaded and used later without retraining.
*   **Path Configuration (Optional):** The Python system path (`sys.path`) might be augmented to include the project's `src` directory if any utility functions from `src.utils` (e.g., `configure_plotting`) are to be used. This ensures custom modules are discoverable.
*   **Visualization Styling:** Plotting styles (e.g., `seaborn-v0_8-whitegrid`) are set to ensure consistent and aesthetically pleasing visualizations throughout the notebook.
*   **Pandas Display Configuration:** `pandas` display options are configured (e.g., `display.max_columns`, `display.width`) to improve the readability of DataFrames when printed or displayed.
*   **Execution Timestamp:** The start time of the notebook's execution is recorded and printed. This serves as a useful reference for tracking the duration of the modeling operations.

**Significance:** A correctly configured environment is paramount for the seamless execution of the modeling pipeline. This setup ensures that all necessary tools and algorithms are available, data can be effectively processed and visualized, and models can be trained, evaluated, and saved, thereby preventing runtime errors and facilitating a robust and reproducible machine learning workflow.

In [None]:
# Imports and Setup
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib # For saving models

# Add src directory to path (optional, if utility functions are needed)
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from src.utils import configure_plotting # Optional

configure_plotting()
plt.style.use('seaborn-v0_8-whitegrid')

# Display pandas DataFrames nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Display current time for reference
print(f"Notebook Execution Started: {datetime.now()}")

## 2. Load Aggregated Data: Accessing the Engineered Feature Set

**Purpose:** This section is dedicated to loading the structured, aggregated dataset that was meticulously prepared and saved by the preceding `02_feature_engineering.ipynb` notebook. This dataset is the cornerstone for all subsequent modeling activities.

**Process Details:**
*   **File Path Construction:** The path to the aggregated data file is constructed using `os.path.join()`. This ensures platform-independent path creation. The file is expected to be located in the `../data/` directory relative to the current notebook and is typically named `aggregated_game_features.csv`.
*   **Data Loading with Pandas:** The `pd.read_csv()` function from the `pandas` library is used to load the data from the CSV file into a DataFrame, conventionally named `df_agg`.
*   **Error Handling:** A `try-except` block is implemented to manage potential issues during file loading:
    *   `FileNotFoundError`: If the specified CSV file does not exist (e.g., if `02_feature_engineering.ipynb` was not run or did not save its output correctly), a specific error message is printed, guiding the user to check the previous notebook's execution. An empty DataFrame is assigned to `df_agg` to prevent subsequent cells from failing due to a missing DataFrame.
    *   General `Exception`: Catches any other errors that might occur during the file loading process (e.g., corrupted file, permission issues), prints an error message, and also assigns an empty DataFrame to `df_agg`.
*   **Initial Data Inspection (Post-Load):** Upon successful loading, several key pieces of information are displayed to confirm the data integrity:
    *   A success message indicating the source file path.
    *   The shape of the loaded DataFrame (`df_agg.shape`), showing the number of games (rows) and features (columns).
    *   The first few rows of the DataFrame (`display(df_agg.head())`) to provide a quick preview of the data structure and content.

**Significance:** Loading the aggregated data is a critical first step in the modeling notebook. The `df_agg` DataFrame contains the rich set of engineered features (e.g., pre-release hype metrics, post-launch performance indicators) and the target variable(s) that will be used to train and evaluate predictive models. The quality and structure of this loaded data directly influence the potential success of the modeling efforts. The error handling ensures that the notebook can proceed gracefully or provide informative feedback if the necessary input data is unavailable.

In [None]:
# Load the aggregated features dataset
data_path = os.path.join("..", "data", "aggregated_game_features.csv")
try:
    df_agg = pd.read_csv(data_path)
    print(f"Successfully loaded aggregated data from: {data_path}")
    print(f"Shape: {df_agg.shape}")
    display(df_agg.head())
except FileNotFoundError:
    print(f"Error: Aggregated data file not found at {data_path}")
    print("Please ensure '02_feature_engineering.ipynb' has been run successfully.")
    df_agg = pd.DataFrame() # Assign empty df to prevent errors later
except Exception as e:
    print(f"An error occurred loading the data: {e}")
    df_agg = pd.DataFrame()

## 3. Data Preparation for Modeling: Crafting the Input for Machine Learning

**Purpose:** This section is dedicated to transforming the loaded `df_agg` (aggregated features) into a format suitable for training and evaluating machine learning models. It involves a sequence of critical steps: defining the predictive task (features and target), handling data quality issues (missing values), partitioning the data, and applying necessary transformations (like scaling).

**Key Steps and Rationale:**

1.  **Conditional Execution:**
    *   **Action:** The entire data preparation block is executed only if `df_agg` is not empty (i.e., data was successfully loaded in the previous step).
    *   **Why:** Prevents errors if the input data is unavailable.

2.  **Define Features (X) and Target (y):**
    *   **Action:**
        *   **Target Variable (y):** A specific column from `df_agg` is chosen as the outcome to predict. In this project, a common target is a post-launch popularity metric, such as `steam_peak_players_7d` (peak Steam player count within the first 7 days after release). The code dynamically finds this column name as it might include the number of days (e.g., `steam_peak_players_Xd`).
        *   **Feature Set (X):** A list of `potential_features` is defined. These are columns from `df_agg` believed to have predictive power for the target. Typically, these include pre-release metrics (e.g., `metacritic_score`, `google_trends_avg_pre`, `reddit_posts_avg_pre`, `twitter_count_avg_pre`, `reddit_subs_pre`, `reddit_active_pre`) and static game information. Post-launch outcomes (other than the target itself), identifiers, and raw date columns are usually excluded from `X`.
        *   The actual `features` list is then filtered to include only those `potential_features` that are present in `df_agg.columns`.
    *   **Why:** Clearly separates the input variables (features) that the model will learn from, and the output variable (target) it will try to predict. This is fundamental to supervised machine learning.

3.  **Handle Missing Values:**
    *   **Action (Target Variable):** Rows where the `target` variable is missing (`NaN`) are dropped from the DataFrame (`df_model = df_agg.dropna(subset=[target]).copy()`).
    *   **Why (Target Variable):** Models cannot be trained or evaluated effectively if the true outcome is unknown. Dropping these rows ensures data integrity for the modeling task.
    *   **Action (Features):** For the selected `features` in `X`, missing values are imputed. A common simple strategy (used here) is to fill `NaN`s with the median value of that feature column. The median is often preferred over the mean as it's less sensitive to outliers.
    *   **Why (Features):** Most machine learning algorithms cannot handle missing values directly. Imputation provides a complete dataset for the model. More sophisticated imputation techniques could be used if necessary.
    *   **Verification:** After imputation, the code checks and prints the count of remaining missing values in both the target and features to ensure the cleaning was successful.

4.  **Split Data into Training and Testing Sets:**
    *   **Action:** The `df_model` (now cleaned of critical missing values) is split into `X_train`, `X_test`, `y_train`, and `y_test` using `sklearn.model_selection.train_test_split`. A `test_size` (e.g., 0.2 for 20%) determines the proportion of data held out for testing, and `random_state` ensures reproducibility of the split.
    *   **Why:** This is a cornerstone of model evaluation. The model is trained on `X_train` and `y_train`. Its performance is then assessed on `X_test` and `y_test`, which it has not seen during training. This provides an unbiased estimate of how well the model generalizes to new, unseen data.
    *   **Small Dataset Handling:** A check is included for very small datasets (e.g., less than 10 samples). If the dataset is too small, a warning is printed, and a simplified split (using all data for both train and test) might be performed, though this is not ideal and highlights the need for more data.

5.  **Preprocessing (Feature Scaling):**
    *   **Action:** Numerical features in `X_train` and `X_test` are scaled using `sklearn.preprocessing.StandardScaler`. The scaler is `fit` *only* on the `X_train` data to learn the mean and standard deviation. Then, both `X_train` and `X_test` are `transform`ed using this fitted scaler.
    *   **Why:** Many machine learning algorithms (e.g., those using gradient descent like Linear Regression, or distance-based algorithms like SVMs, KNNs) perform better or converge faster when numerical features are on a similar scale. StandardScaler standardizes features by removing the mean and scaling to unit variance. Fitting only on training data prevents data leakage from the test set into the training process.
    *   **Output:** The scaled features (`X_train_scaled`, `X_test_scaled`) are typically converted back to pandas DataFrames for easier inspection, though models can work directly with NumPy arrays.

**Significance:** This meticulous data preparation phase ensures that the data fed into the machine learning models is clean, correctly formatted, and appropriately partitioned. Each step addresses potential issues that could otherwise hinder model training or lead to unreliable performance evaluations. The outcome of this section is a set of training and testing data (`X_train_scaled`, `y_train`, `X_test_scaled`, `y_test`) ready for the modeling algorithms.

In [None]:
# Proceed only if data was loaded successfully
if not df_agg.empty:
    print("Preparing data for modeling...")

    # --- Define Features (X) and Target (y) ---
    # Example: Predict peak players in the first 7 days
    # Find the actual column name (it includes the number of days)
    peak_days_col = next((col for col in df_agg.columns if col.startswith('steam_peak_players_')), None)
    if peak_days_col is None:
        raise ValueError("Target column 'steam_peak_players_*d' not found. Check aggregation notebook.")
    
    target = peak_days_col
    print(f"Target variable (y): {target}")

    # Select potential features (pre-release metrics + static info)
    # Exclude post-launch outcomes (except the target), identifiers, and dates
    potential_features = [
        'metacritic_score',
        'google_trends_avg_pre',
        'reddit_posts_avg_pre',
        'twitter_count_avg_pre',
        'reddit_subs_pre',
        'reddit_active_pre',
        'youtube_total_views_pre', # Example if added in aggregator
        'youtube_avg_likes_pre',   # Example if added in aggregator
    ]
    
    # Filter to only include features present in the DataFrame
    features = [f for f in potential_features if f in df_agg.columns]
    print(f"Selected features (X): {features}")

    # --- Handle Missing Values ---
    # Drop rows where the target variable is missing
    df_model = df_agg.dropna(subset=[target]).copy()
    print(f"Shape after dropping rows with missing target: {df_model.shape}")

    # Impute missing values in features (using median for simplicity)
    for col in features:
        if df_model[col].isnull().any():
            median_val = df_model[col].median()
            df_model[col].fillna(median_val, inplace=True)
            print(f"Imputed missing values in '{col}' with median ({median_val:.2f})")
            
    # Verify no missing values remain in features or target
    print(f"Missing values in target: {df_model[target].isnull().sum()}")
    print(f"Missing values in features: {df_model[features].isnull().sum().sum()}")

    # --- Split Data ---
    X = df_model[features]
    y = df_model[target]
    
    if len(df_model) < 10:
         print("Warning: Very small dataset, results may not be reliable. Consider collecting more data.")
         # Handle small dataset case if necessary, e.g., skip splitting or use cross-validation
         X_train, X_test, y_train, y_test = X, X, y, y # Use all data for train/test - not ideal!
    else:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        print(f"Training set shape: X={X_train.shape}, y={y_train.shape}")
        print(f"Testing set shape: X={X_test.shape}, y={y_test.shape}")

    # --- Preprocessing (Scaling) ---
    # Scale numerical features
    scaler = StandardScaler()
    
    # Fit scaler on training data only, then transform both train and test
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Convert scaled arrays back to DataFrames for easier inspection 
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=features, index=X_train.index)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=features, index=X_test.index)
    print("\nNumerical features scaled using StandardScaler.")
    # display(X_train_scaled.head())
else:
    print("Skipping data preparation as aggregated data is empty.")

## 4. Model Training and Evaluation: Building and Assessing Predictive Algorithms

**Purpose:** This section is the core of the predictive modeling process. It involves training various machine learning regression models on the prepared dataset and then rigorously evaluating their performance on unseen test data. The goal is to identify which model(s) are most effective at predicting the target variable (e.g., game popularity).

**Key Steps and Rationale:**

1.  **Conditional Execution:**
    *   **Action:** The entire modeling block is executed only if the `X_train_scaled` variable exists (implying that data preparation in Section 3 was successful). A dictionary `model_results` is initialized to store performance metrics.
    *   **Why:** Ensures that model training only proceeds if valid, prepared training data is available.

2.  **Define Models to Evaluate:**
    *   **Action:** A dictionary named `models` is created. Each key is a descriptive name for a model (e.g., 'Linear Regression', 'Random Forest'), and the corresponding value is an instance of the scikit-learn model class (e.g., `LinearRegression()`, `RandomForestRegressor(random_state=42)`).
    *   **Common Choices:**
        *   `LinearRegression`: A basic linear model.
        *   `RandomForestRegressor`: An ensemble method based on decision trees, often robust and good at capturing non-linearities.
        *   `GradientBoostingRegressor`: Another powerful ensemble method that builds trees sequentially, often achieving high accuracy.
    *   **Parameters:** Models are instantiated with some default parameters (e.g., `n_estimators=100` for tree-based ensembles, `random_state=42` for reproducibility). `n_jobs=-1` is often used for Random Forest to utilize all available CPU cores for faster training.
    *   **Why:** Allows for systematic training and comparison of different algorithmic approaches to the regression problem.

3.  **Iterative Model Training, Prediction, and Evaluation:**
    *   **Action:** The code iterates through each `name`, `model` pair in the `models` dictionary.
        *   **Training:** The `model.fit(X_train_scaled, y_train)` method is called to train the current model using the scaled training features and the training target variable.
        *   **Prediction:** Once trained, `model.predict(X_test_scaled)` is used to generate predictions on the (unseen) scaled test features.
        *   **Evaluation:** The predictions (`y_pred`) are compared against the actual test target values (`y_test`) using several standard regression metrics:
            *   **Mean Absolute Error (MAE):** `mean_absolute_error(y_test, y_pred)` - Average absolute difference between predicted and actual values. Interpretable in the same units as the target.
            *   **Mean Squared Error (MSE):** `mean_squared_error(y_test, y_pred)` - Average of the squared differences. Penalizes larger errors more heavily.
            *   **Root Mean Squared Error (RMSE):** `np.sqrt(mse)` - Square root of MSE, bringing the metric back to the original units of the target, making it more interpretable than MSE.
            *   **R-squared (R²):** `r2_score(y_test, y_pred)` - Coefficient of determination. Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Ranges from -∞ to 1 (higher is better, 1 is perfect).
        *   **Timing:** The duration of training for each model is recorded.
        *   **Storing Results:** The calculated metrics (MAE, RMSE, R², Training Time) for each model are stored in the `model_results` dictionary.
        *   **Printing Results:** Key metrics and training time are printed for immediate feedback on each model's performance.
    *   **Why:** This loop automates the process of training and evaluating multiple models, providing a consistent framework for comparison.

4.  **Feature Importance Analysis (for applicable models):**
    *   **Action:** For models that support it (e.g., tree-based ensembles like Random Forest and Gradient Boosting), the `feature_importances_` attribute is accessed. This provides a score for each feature, indicating its relative importance in making predictions.
    *   These importances are typically displayed as a sorted list or a bar plot.
    *   **Why:** Helps understand which features are most influential for the model's predictions, offering insights into the underlying data relationships and potentially guiding further feature engineering or selection.

**Significance:** This section translates the prepared data into predictive power. By training and evaluating a diverse set of models, we can empirically determine which algorithms and feature combinations yield the best predictive accuracy for game popularity. The evaluation metrics provide quantitative measures to compare models and select the most promising candidates for further refinement or deployment. Feature importance analysis adds a layer of interpretability to complex models.

In [None]:
# Proceed only if data preparation was successful
model_results = {}
if 'X_train_scaled' in locals():
    print("\n--- Training and Evaluating Models ---")
    
    # Define models to try
    models = {
        'Linear Regression': LinearRegression(),
        'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
    }
    
    # Loop through models, train, predict, and evaluate
    for name, model in models.items():
        print(f"\nTraining {name}...")
        start_time = datetime.now()
        
        # Train the model
        model.fit(X_train_scaled, y_train)
        
        # Make predictions on the test set
        y_pred = model.predict(X_test_scaled)
        
        # Evaluate the model
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        end_time = datetime.now()
        duration = end_time - start_time
        
        # Store results
        model_results[name] = {
            'MAE': mae,
            'RMSE': rmse,
            'R2': r2,
            'Training Time': duration
        }
        
        print(f"  MAE: {mae:.2f}")
        print(f"  RMSE: {rmse:.2f}")
        print(f"  R2 Score: {r2:.3f}")
        print(f"  Training Time: {duration}")
        
        # Optional: Feature importance for tree-based models
        if hasattr(model, 'feature_importances_'):
            importances = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
            print("\n  Feature Importances:")
            print(importances.head(10)) # Display top 10
            plt.figure(figsize=(10, 6))
            importances.plot(kind='bar')
            plt.title(f'{name} Feature Importances')
            plt.ylabel('Importance')
            plt.show()
            
else:
    print("Skipping model training as data preparation failed or was skipped.")

## 5. Model Comparison: Identifying the Top Performing Algorithm

**Purpose:** This section systematically compares the performance of all trained regression models. The primary goal is to objectively identify the best-performing model based on the evaluation metrics calculated in the previous step (MAE, RMSE, R²).

**Key Steps and Rationale:**

1.  **Conditional Execution:**
    *   **Action:** The comparison logic is executed only if the `model_results` dictionary (populated during model training and evaluation) is not empty.
    *   **Why:** Ensures that comparison only occurs if models were actually trained and their results are available.

2.  **Convert Results to DataFrame:**
    *   **Action:** The `model_results` dictionary is converted into a `pandas` DataFrame (`results_df`). Each row in this DataFrame represents a model, and columns correspond to the stored performance metrics (MAE, RMSE, R², Training Time).
    *   **Why:** DataFrames provide a structured and convenient way to display, sort, and analyze tabular data, making model comparison straightforward.

3.  **Sort by Primary Metric:**
    *   **Action:** The `results_df` is sorted based on a primary evaluation metric. For regression, RMSE (Root Mean Squared Error) is often chosen, where lower values indicate better performance. The sorting is done in ascending order for RMSE.
    *   **Why:** Sorting makes it easy to quickly identify the top-ranking models according to the most critical performance criterion.

4.  **Display Results Table:**
    *   **Action:** The sorted `results_df` is displayed.
    *   **Why:** Provides a clear, tabular summary of how each model performed across all key metrics, facilitating direct comparison.

5.  **Identify and Announce Best Model:**
    *   **Action:** The name of the model corresponding to the first row of the sorted `results_df` (i.e., the one with the best score on the primary metric) is extracted and printed as the 'best model'.
    *   **Why:** Programmatically identifies and highlights the top-performing model based on the chosen criterion.

6.  **Visual Comparison (Optional but Recommended):**
    *   **Action:** Bar plots are generated to visually compare models based on their error metrics (MAE, RMSE – lower is better) and their R² score (higher is better).
    *   **Why:** Visualizations can often make performance differences more apparent and easier to interpret than numerical tables alone. They provide an intuitive way to grasp the relative strengths and weaknesses of the models.

**Significance:** Model comparison is a critical step in the machine learning workflow. It provides a data-driven basis for selecting the most suitable model for the specific prediction task. The identified 'best model' is typically the one considered for further refinement (e.g., hyperparameter tuning), interpretation, and potential deployment. This section ensures that the choice of model is based on empirical evidence from its performance on unseen test data.

In [None]:
# Compare model results if available
if model_results:
    results_df = pd.DataFrame.from_dict(model_results, orient='index')
    results_df = results_df.sort_values('RMSE', ascending=True) # Sort by RMSE (lower is better)
    
    print("\n--- Model Performance Comparison ---")
    display(results_df)
    
    # Identify the best model based on RMSE
    best_model_name = results_df.index[0]
    print(f"\nBest model based on RMSE: {best_model_name}")
    
    results_df[['MAE', 'RMSE']].plot(kind='bar', figsize=(10, 6))
    plt.title('Model Error Comparison (Lower is Better)')
    plt.ylabel('Error')
    plt.xticks(rotation=0)
    plt.show()
    
    results_df['R2'].plot(kind='bar', figsize=(10, 6))
    plt.title('Model R2 Score Comparison (Higher is Better)')
    plt.ylabel('R2 Score')
    plt.ylim(min(0, results_df['R2'].min() - 0.1), 1.0)
    plt.xticks(rotation=0)
    plt.show()
else:
    print("Skipping model comparison as no models were trained.")

## 6. Save Best Model and Associated Artifacts (Optional): Persisting Predictive Assets

**Purpose:** This section outlines the process of saving (serializing) the best-performing trained machine learning model, along with its associated data preprocessing objects (like the feature scaler) and the list of features it was trained on. Persisting these assets allows for their reuse without needing to retrain the model from scratch, which is essential for deployment, making predictions on new data, or further analysis.

**Key Steps and Rationale:**

1.  **Conditional Execution:**
    *   **Action:** The saving process is executed only if `model_results` is not empty (meaning models were trained and evaluated) AND `best_model_name` has been defined (meaning a best model was identified in the comparison step).
    *   **Why:** Ensures that saving only occurs if there's a valid, trained, and selected best model available.

2.  **Retrieve Best Model Instance:**
    *   **Action:** The actual trained model object corresponding to `best_model_name` is retrieved from the `models` dictionary (which stores all trained model instances).
    *   **Why:** We need the specific model object that achieved the best performance to save it.

3.  **Define Save Directory and File Paths:**
    *   **Action:**
        *   A dedicated directory for storing models (e.g., `../models/`) is defined using `os.path.join()` for platform compatibility.
        *   `os.makedirs(model_dir, exist_ok=True)` creates this directory if it doesn't already exist, preventing errors.
        *   Specific file paths are constructed for:
            *   The model itself (e.g., `random_forest_model.joblib`). The filename often includes the model's name.
            *   The data scaler (e.g., `scaler.joblib`) used for preprocessing the features.
            *   The list of feature names (e.g., `features.joblib`) that the model expects as input.
    *   **Why:** Organizes saved artifacts and ensures they can be easily located and loaded later. Saving the scaler and feature list is crucial for ensuring that new data is preprocessed in exactly the same way as the training data.

4.  **Serialize and Save Artifacts using `joblib`:**
    *   **Action:** The `joblib.dump()` function is used to serialize and save the Python objects (the model, the scaler, and the feature list) to disk at their respective defined paths.
    *   **Why `joblib`?:** `joblib` is efficient for saving scikit-learn models and other large NumPy arrays. It's generally preferred over Python's built-in `pickle` for scikit-learn objects.

5.  **Confirmation and Error Handling:**
    *   **Action:** Success messages are printed to the console upon successfully saving each artifact, confirming their locations.
    *   A `try-except` block is used to catch and report any errors that might occur during the saving process (e.g., disk full, permission issues).
    *   **Why:** Provides feedback on the outcome of the saving operation and helps diagnose issues if they arise.

**Significance:** Saving the best model and its associated components is a critical step towards operationalizing the machine learning solution. It enables:
*   **Reproducibility:** The exact model and preprocessing steps can be reloaded.
*   **Efficiency:** Avoids the time-consuming process of retraining for every new prediction task.
*   **Deployment:** The saved artifacts can be loaded into a separate application or service to make predictions on new, incoming game data.
*   **Consistency:** Ensures that new data is transformed using the same scaler and feature set that the model was trained on, which is vital for accurate predictions.

In [None]:
# Save the best performing model and the scaler
if model_results and 'best_model_name' in locals():
    best_model = models[best_model_name] # Get the trained model instance
    
    # Create directory if it doesn't exist
    model_dir = os.path.join("..", "models")
    os.makedirs(model_dir, exist_ok=True)
    
    # Define file paths
    model_path = os.path.join(model_dir, f"{best_model_name.lower().replace(' ', '_')}_model.joblib")
    scaler_path = os.path.join(model_dir, "scaler.joblib")
    features_path = os.path.join(model_dir, "features.joblib")
    
    try:
        # Save the model
        joblib.dump(best_model, model_path)
        print(f"\nBest model ({best_model_name}) saved to: {model_path}")
        
        # Save the scaler
        joblib.dump(scaler, scaler_path)
        print(f"Scaler saved to: {scaler_path}")
        
        # Save the list of features used by the model
        joblib.dump(features, features_path)
        print(f"Feature list saved to: {features_path}")
        
    except Exception as e:
        print(f"\nError saving model/scaler/features: {e}")
else:
    print("\nSkipping model saving: No best model identified or training failed.")

## 7. Project Conclusion and Future Enhancements

**Project Summary:**
This project successfully established an end-to-end pipeline for predicting game popularity. Key achievements include:
1.  **Automated Data Collection (`01_data_collection.ipynb`):** A robust system was developed to gather diverse metrics (Steam player counts, Twitch viewership, social media engagement from Reddit, Twitter, Google Trends, YouTube) for a curated list of games.
2.  **Comprehensive Feature Engineering (`02_feature_engineering.ipynb`):** The collected time-series data was aggregated into a structured dataset, creating meaningful features that distinguish between pre-release hype and post-launch performance indicators.
3.  **Predictive Modeling (`03_predictive_modeling.ipynb`):** Various regression models were trained and evaluated on the engineered features to predict a defined popularity metric (e.g., peak Steam players within 7 days post-launch). This involved data preparation, model training, performance comparison, and persistence of the best-performing model and associated artifacts.

The current implementation provides a functional baseline for predicting game popularity, demonstrating the viability of using pre-release and early post-launch data for forecasting. The selected model offers a starting point for understanding the key drivers of game success based on the available data.

**Potential Future Enhancements:**
While the project delivers a working solution, several avenues exist for potential improvement and further development:
1.  **Hyperparameter Optimization:** Systematically fine-tuning the hyperparameters of the best-performing model (e.g., using GridSearchCV or RandomizedSearchCV) could yield better predictive accuracy.
2.  **Advanced Feature Engineering:**
    *   **Interaction Terms:** Explore creating features that capture interactions between existing variables (e.g., Metacritic score * pre-release Reddit activity).
    *   **Temporal Dynamics:** Develop more sophisticated time-based features, such as growth rates of social media metrics or trends in player engagement leading up to the prediction window.
    *   **Sentiment Analysis:** Incorporate sentiment scores derived from text data (e.g., Reddit comments, Twitter posts) as features.
3.  **Exploration of Diverse Algorithms:** Experiment with a broader range of machine learning models, including more complex ensemble methods (e.g., XGBoost, LightGBM, CatBoost) or even neural network architectures, which might capture more intricate patterns in the data.
4.  **Expansion of Data Sources & Features:**
    *   **More Granular Data:** Collect data at a higher frequency if APIs permit.
    *   **Additional Platforms:** Integrate data from other relevant platforms (e.g., Discord, gaming news sites).
    *   **Game Characteristics:** Include more detailed game characteristics (e.g., genre sub-categories, developer/publisher reputation, marketing budget indicators if obtainable).
5.  **Increased Data Volume:** Continuously run the data collection pipeline to amass a larger and more extensive historical dataset. A larger dataset, encompassing more games and longer timeframes, is likely to improve model robustness, generalization, and the ability to detect more subtle trends.
6.  **Deployment and Monitoring:** For a production system, deploy the model as an API and implement a monitoring system to track its performance on new, unseen games over time, triggering retraining or model updates as necessary.

These potential enhancements can build upon the solid foundation established by this project, aiming for even greater predictive power and deeper insights into the dynamics of game popularity.

In [None]:
# Final summary message
print("\nPredictive Modeling Notebook Complete.")
if model_results:
    print(f"Model comparison complete. Best model based on RMSE: {best_model_name}")
    if 'model_path' in locals() and os.path.exists(model_path):
        print(f"Best model artifacts saved in: {model_dir}")
else:
    print("Model training and evaluation were skipped or failed. Check previous cell outputs.")

---
*End of Notebook*