# 02. Feature Engineering and Data Aggregation for Predictive Modeling

**Project Context:** This notebook is the second stage in a multi-part project aimed at predicting game popularity. It builds upon the raw, time-stamped data systematically gathered by the `01_data_collection.ipynb` notebook.

**Purpose of this Notebook:** The primary objective here is to transform the granular, longitudinal data into a consolidated, feature-rich dataset suitable for machine learning applications. This involves leveraging a custom `DataAggregator` class to:
*   Consolidate data from multiple collection instances.
*   Calculate aggregated metrics that represent game performance and community engagement over specific time windows, particularly distinguishing between pre-release hype and post-launch success indicators.
*   Structure the data such that each game is represented by a single row, with columns corresponding to engineered features and potential target variables.

**Methodological Significance:** Raw time-series data, while rich in detail, is often not directly amenable to predictive models aiming to forecast a singular outcome (e.g., peak player count within a certain period post-launch). Feature engineering is crucial for extracting salient signals by aggregating data over relevant intervals. For instance, metrics like average social media sentiment in the weeks leading up to a game's release can serve as 'pre-release features,' while metrics like peak player counts or average viewership in the initial weeks post-launch can act as 'post-launch outcomes' or target variables.

**Expected Outcomes:** Upon successful execution of this notebook, the following will be achieved:
1.  **Consolidated Raw Data:** All individual data files generated by `01_data_collection.ipynb` will be loaded and merged into a unified DataFrame.
2.  **Feature Aggregation:** The `DataAggregator` will process this merged data, calculating various features and outcome metrics for each game based on its release date and predefined aggregation windows.
3.  **Structured Feature Set:** A new DataFrame will be produced where each row corresponds to a unique game. This DataFrame will contain columns representing:
    *   Static game information (e.g., App ID, name, release date, Metacritic score).
    *   Aggregated pre-release metrics (e.g., average Google Trends score, Reddit activity, YouTube views/likes before launch).
    *   Aggregated post-launch metrics (e.g., peak Steam player count, peak Twitch viewers, average engagement metrics after launch).
4.  **Preliminary Analysis and Cleaning:** Initial data cleaning steps (e.g., handling missing values) and exploratory analysis (e.g., descriptive statistics, correlation analysis) will be performed on the aggregated feature set.
5.  **Persistent Storage:** The final, aggregated feature set will be saved to a CSV file, ready for use in the subsequent modeling phase (`03_modeling.ipynb`).

## 1. Environment Initialization and Library Imports

**Purpose:** This initial section is dedicated to establishing the necessary Python environment for the feature engineering and data aggregation tasks. It involves importing all required libraries and configuring system paths to ensure custom modules are discoverable.

**Key Actions Undertaken:**
*   **Standard Library Imports:** Essential Python libraries for data manipulation, numerical operations, and visualization are imported. This typically includes:
    *   `sys` and `os`: For system-level operations, primarily path manipulation.
    *   `pandas`: For DataFrame creation, manipulation, and analysis.
    *   `numpy`: For numerical computations, especially array operations.
    *   `matplotlib.pyplot` and `seaborn`: For creating static, interactive, and informative statistical graphics.
    *   `datetime`: For handling and recording timestamps, particularly the notebook's execution start time.
*   **Path Configuration for Custom Modules:** The Python system path (`sys.path`) is dynamically augmented to include the project's `src` directory (e.g., `../src/`). This step is critical for enabling the import of custom-developed modules, most notably the `DataAggregator` class, which is central to this notebook's functionality.
*   **Custom Module Import:** The `DataAggregator` class is imported from the `src.aggregator` module. This class encapsulates the core logic for loading, merging, and aggregating the time-series data collected in the previous notebook.
*   **Visualization Styling (Optional):** Plotting styles (e.g., `seaborn-v0_8-whitegrid`) can be set to ensure consistent and aesthetically pleasing visualizations throughout the notebook.
*   **Pandas Display Configuration:** `pandas` display options are configured (e.g., `display.max_columns`, `display.max_rows`, `display.width`) to improve the readability of DataFrames when printed or displayed, ensuring comprehensive views of the data.
*   **Execution Timestamp:** The start time of the notebook's execution is recorded and printed. This serves as a useful reference for tracking the duration of the operations performed.

**Significance:** A correctly configured environment is paramount for the seamless execution of the notebook. This setup ensures that all dependencies are met, custom tools like the `DataAggregator` are accessible, and that data can be effectively manipulated and visualized, thereby preventing runtime errors and facilitating a smooth analytical workflow.

In [None]:
# Imports and Setup
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Add src directory to path to import modules
# Assumes notebook is run from the 'notebooks' directory
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our custom modules
from src.aggregator import DataAggregator
# from src.utils import configure_plotting # Optional

# Configure plotting (optional)
# configure_plotting()
plt.style.use('seaborn-v0_8-whitegrid')

# Display pandas DataFrames nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

# Display current time for reference
print(f"Notebook Execution Started: {datetime.now()}")

## 2. Data Aggregator Instantiation

**Purpose:** This step focuses on creating an instance of the `DataAggregator` class. This object is pivotal as it encapsulates the specialized logic required to load, merge, and process the historical time-series data collected by `01_data_collection.ipynb`.

**Functionality of `DataAggregator`:**
*   **Data Loading and Merging:** The primary role of the `DataAggregator` is to scan a specified directory (e.g., `../data/`) for raw data files (typically timestamped CSVs). It then loads these individual files and merges them into a single, comprehensive `pandas` DataFrame. This unified DataFrame contains all historical records, sorted chronologically.
*   **Feature Computation:** Based on game release dates (which it attempts to parse and standardize), the aggregator calculates various features. These features are typically aggregated metrics over defined time windows, such as:
    *   **Pre-release metrics:** Average player counts, social media engagement (Reddit, Twitter), search interest (Google Trends), and YouTube statistics in the period leading up to a game's launch.
    *   **Post-launch outcomes:** Peak player counts, average viewership, and other engagement metrics within specified periods after the game's release (e.g., 7-day peak, 30-day average).
*   **Configuration:** The `DataAggregator` is initialized with a `data_dir` parameter, which points to the location of the raw data files. It may also have internal configurations for file patterns (e.g., `steam_data_*.csv*`) and default aggregation windows.

**Expected Outcome Upon Execution:**
*   Successful instantiation of the `DataAggregator` object, ready to perform its data processing tasks.
*   The code cell will typically show the creation of the `aggregator` variable, which holds the instance of the `DataAggregator` class.

**Significance:** Instantiating the `DataAggregator` is a prerequisite for transforming the raw, event-level data into a structured format suitable for machine learning. It bridges the gap between raw data collection and feature engineering, providing the tools to derive meaningful insights and predictive variables from the historical data.

In [None]:
# Initialize the aggregator
# Point it to the directory where the collector saved the raw data files
aggregator = DataAggregator(data_dir="../data")

## 3. Consolidation of Historical Raw Data

**Purpose:** This crucial step involves invoking the `load_merged_data()` method of the `DataAggregator` instance. The primary objective is to consolidate all individual raw data files, previously generated and saved by the `01_data_collection.ipynb` notebook, into a single, unified `pandas` DataFrame.

**Process Details:**
*   **File Discovery:** The `DataAggregator` scans the specified `data_dir` (e.g., `../data/`) for files matching a predefined pattern (e.g., `steam_data_*.csv*` or `steam_data_*.csv.gz*`). This pattern ensures that all relevant historical data snapshots are identified.
*   **Data Loading and Concatenation:** Each identified file is loaded into a `pandas` DataFrame. These individual DataFrames are then concatenated vertically to form one large DataFrame (`merged_df`).
*   **Data Type Conversion and Sorting:** During or after loading, essential columns like `timestamp` and `release_date` are typically converted to appropriate datetime objects to facilitate time-based operations. The `merged_df` is usually sorted by `timestamp` to ensure chronological order, which is vital for time-series analysis and correct feature aggregation.
*   **Error Handling:** The process includes try-except blocks to gracefully handle potential issues, such as no data files being found or errors during file loading. If no data is loaded, an empty DataFrame is typically returned, and a message is printed.

**Expected Output Upon Successful Execution:**
*   **`merged_df` DataFrame:** A `pandas` DataFrame named `merged_df` is created. This DataFrame contains all records from all previous data collection runs.
    *   Each row represents a data snapshot for a specific game at a particular timestamp.
    *   Columns include game identifiers (`app_id`, `name`), the `timestamp` of collection, all collected metrics (player counts, viewership, social media stats, etc.), and static game information like `release_date` and `metacritic_score` (if available).
*   **Console Output:**
    *   A sample of the `merged_df` (e.g., the first few rows via `display(merged_df.head())`).
    *   The shape of `merged_df` (number of rows and columns).
    *   The overall date range covered by the `timestamp` column in `merged_df`.
*   **Error/Warning Messages:** If no data files are found or an error occurs, an appropriate message is printed to the console.

**Significance:** The `merged_df` represents the complete raw historical dataset. It serves as the foundational input for the subsequent feature aggregation step, where time-series data will be transformed into a one-row-per-game feature set.

In [None]:
# Load and merge all data files matching the default pattern 'steam_data_*.csv*'
try:
    merged_df = aggregator.load_merged_data()
    if not merged_df.empty:
        print("\n--- Merged Raw Data Sample ---")
        display(merged_df.head())
        print(f"\nShape of merged data: {merged_df.shape}")
        print(f"Date range: {merged_df['timestamp'].min()} to {merged_df['timestamp'].max()}")
    else:
        print("No raw data files found or loaded. Cannot proceed with aggregation.")
except Exception as e:
    print(f"An error occurred loading merged data: {e}")
    merged_df = pd.DataFrame() # Ensure df is empty on error

## 4. Feature Aggregation from Time-Series Data

**Purpose:** This is the core data transformation step where the longitudinal (time-series) `merged_df` is processed by the `DataAggregator`'s `aggregate_features` method. The goal is to convert the multi-entry per-game data into a single-row-per-game DataFrame, where columns represent engineered features and potential outcome variables.

**Process Details:**
*   **Input Data:** The primary input is the `merged_df` DataFrame, which contains all historical raw data.
*   **Aggregation Logic:** The `aggregate_features` method iterates through each unique game identified in `merged_df`. For each game, it uses the game's `release_date` as a crucial reference point to define specific time windows for aggregation.
*   **Configurable Time Windows:**
    *   `PRE_RELEASE_DAYS` (e.g., 30 days): Defines the period *before* the game's release from which pre-launch hype indicators (e.g., average Google Trends score, Reddit activity, YouTube views/likes) are calculated. These serve as predictive features.
    *   `POST_LAUNCH_PEAK_DAYS` (e.g., 7 days): Defines the initial period *after* launch used to determine peak performance metrics (e.g., peak Steam player count, peak Twitch viewers). These can be target variables or features.
    *   `POST_LAUNCH_AVG_DAYS` (e.g., 30 days): Defines a broader period *after* launch for calculating average performance metrics (e.g., average player count, average viewership). These can also serve as target variables or features.
*   **Feature Calculation:** Within these windows, various aggregation functions (mean, max, sum, etc.) are applied to the relevant metric columns (e.g., `player_count`, `twitch_viewer_count`, `reddit_subscribers`).
*   **Output DataFrame (`aggregated_features_df`):**
    *   Each row represents a unique game.
    *   Columns include:
        *   Static game information: `app_id`, `game_name`, `release_date`, `metacritic_score`.
        *   Pre-release aggregated features: e.g., `google_trends_avg_pre_30d`, `reddit_posts_avg_pre_30d`.
        *   Post-launch aggregated outcomes/features: e.g., `steam_peak_players_7d`, `twitch_avg_viewers_30d`.
*   **Error Handling:** The process includes checks for the availability of `merged_df`. If `merged_df` is empty, aggregation is skipped. Errors during the aggregation process itself (e.g., issues with date parsing, insufficient data for a game within defined windows) are caught, and messages are printed.

**Expected Output Upon Successful Execution:**
*   **`aggregated_features_df` DataFrame:** A new `pandas` DataFrame where each game has a single row containing its static details and the calculated pre-release and post-launch aggregated metrics.
*   **Console Output:**
    *   A message indicating the start of feature aggregation.
    *   A sample of the `aggregated_features_df` (e.g., `display(aggregated_features_df.head())`).
    *   The shape of `aggregated_features_df`.
    *   A list of all column names in `aggregated_features_df`.
*   **Warning/Error Messages:** If aggregation results in an empty DataFrame (e.g., due to issues with release dates or insufficient data range for any game), a specific warning is printed. Other exceptions during aggregation are also reported.

**Significance:** This step is pivotal as it transforms raw, granular time-series data into a structured, feature-rich dataset that is directly usable for training machine learning models. The engineered features capture temporal dynamics (pre-release hype vs. post-launch performance) crucial for predictive accuracy.

In [None]:
# Aggregate features if merged data is available
aggregated_features_df = pd.DataFrame() # Initialize empty
if 'merged_df' in locals() and not merged_df.empty:
    print("\nStarting feature aggregation...")
    try:
        # Define aggregation windows (can be adjusted)
        PRE_RELEASE_DAYS = 30
        POST_LAUNCH_PEAK_DAYS = 7
        POST_LAUNCH_AVG_DAYS = 30

        aggregated_features_df = aggregator.aggregate_features(
            merged_data=merged_df,
            pre_release_days=PRE_RELEASE_DAYS,
            post_launch_days_peak=POST_LAUNCH_PEAK_DAYS,
            post_launch_days_avg=POST_LAUNCH_AVG_DAYS
        )

        if not aggregated_features_df.empty:
            print("\n--- Aggregated Features Sample ---")
            display(aggregated_features_df.head())
            print(f"\nShape of aggregated data: {aggregated_features_df.shape}")
            print("\nColumns:", aggregated_features_df.columns.tolist())
        else:
            print("Aggregation resulted in an empty DataFrame. Check data quality (e.g., release dates, sufficient time range).")
    except Exception as e:
        print(f"An error occurred during feature aggregation: {e}")
else:
    print("Skipping aggregation because merged data is empty.")

## 5. Preliminary Analysis and Cleaning of Aggregated Features

**Purpose:** After feature aggregation, this section focuses on conducting an initial examination and basic cleaning of the `aggregated_features_df`. The goal is to understand the characteristics of the engineered feature set, identify potential data quality issues (like missing values), and perform rudimentary data preparation steps before more advanced modeling.

**Process Details:**
*   **Conditional Execution:** Analysis is performed only if `aggregated_features_df` was successfully created and is not empty.
*   **1. Basic Information and Data Types (`.info()`):**
    *   Provides a concise summary of the DataFrame, including the data type of each column, the number of non-null values, and memory usage.
    *   Helps verify that columns have been assigned appropriate types (e.g., numerical features are numeric, dates are datetime objects).
*   **2. Missing Value Analysis (`.isnull().sum()`):**
    *   Calculates the percentage of missing values for each column.
    *   Highlights columns with significant amounts of missing data, which might require imputation, removal, or indicate issues in data collection/aggregation for those features.
    *   The code includes a commented-out example of a simple imputation strategy (filling numerical NaNs with 0), emphasizing that a more sophisticated approach might be needed.
*   **3. Descriptive Statistics (`.describe()`):**
    *   Generates summary statistics (count, mean, std, min, max, quartiles) for all numerical columns in `aggregated_features_df`.
    *   Offers insights into the distribution, central tendency, and spread of each feature, helping to identify outliers or unusual data ranges.
*   **4. Correlation Analysis (Partial, Focused on a Target Variable):**
    *   Aims to understand the linear relationships between potential predictive features and a chosen target variable (e.g., `steam_peak_players_7d`).
    *   Selects only numerical columns for correlation calculation.
    *   Calculates the Pearson correlation coefficients of all numerical features with the specified target variable and sorts them.
    *   A heatmap of the correlation matrix for all numerical features is generated to visualize inter-feature correlations and feature-target correlations more broadly.
    *   This step helps in initial feature selection by identifying features that show a strong correlation (positive or negative) with the outcome of interest.

**Expected Output Upon Successful Execution:**
*   **Console Output:**
    *   Basic info printout from `aggregated_features_df.info()`.
    *   A list or Series showing columns with missing values and their respective percentages.
    *   A DataFrame displaying descriptive statistics for numerical features.
    *   A Series showing the correlation of numerical features with the defined `target_var`.
*   **Visualizations:** A heatmap of the correlation matrix will be displayed.
*   **Warning/Error Messages:** If `aggregated_features_df` is empty, a message indicating that the analysis is being skipped will be printed. Messages if the target variable for correlation is not found.

**Significance:** This initial analysis is crucial for several reasons:
*   **Data Quality Assessment:** It provides a first look at the quality and completeness of the engineered features.
*   **Informing Preprocessing:** Identifies the need for missing value imputation, outlier handling, or feature scaling.
*   **Preliminary Feature Relevance:** Correlation analysis offers early clues about which features might be most predictive for the chosen target, guiding subsequent feature selection and modeling efforts.

In [None]:
# Analyze the aggregated features if available
if 'aggregated_features_df' in locals() and not aggregated_features_df.empty:
    print("\n--- Initial Analysis of Aggregated Features ---")

    # 1. Basic Info and Data Types
    print("\nBasic Info:")
    aggregated_features_df.info()

    # 2. Missing Value Analysis
    print("\nMissing Values (%):")
    missing_percent = (aggregated_features_df.isnull().sum() / len(aggregated_features_df)) * 100
    print(missing_percent[missing_percent > 0].sort_values(ascending=False))

    # Fill NaNs with 0 for numerical columns (excluding identifiers like app_id)     
    numerical_cols = aggregated_features_df.select_dtypes(include=np.number).columns.tolist()
    # Exclude identifiers like app_id
    cols_to_fill = [col for col in numerical_cols if col not in ['app_id']]
    print(f"\nFilling NaNs with 0 for columns: {cols_to_fill}")
    aggregated_features_df[cols_to_fill] = aggregated_features_df[cols_to_fill].fillna(0)
    print("\nMissing Values after filling with 0:")
    print(aggregated_features_df.isnull().sum()[aggregated_features_df.isnull().sum() > 0].sort_values(ascending=False))
    # Note: A more sophisticated strategy (e.g., imputation) might be needed later.

    # 3. Descriptive Statistics
    print("\nDescriptive Statistics:")
    display(aggregated_features_df.describe())

    # 4. Correlation Analysis (Focus on potential features vs. target)
    print("\nCorrelation Matrix (Partial):")
    # Define potential target variable(s) using variable from previous cell
    # Ensure POST_LAUNCH_PEAK_DAYS is defined (it should be from the aggregation cell)
    if 'POST_LAUNCH_PEAK_DAYS' in locals():
        target_var = f'steam_peak_players_{POST_LAUNCH_PEAK_DAYS}d'
    else:
        # Fallback if POST_LAUNCH_PEAK_DAYS is not defined, though it should be.
        # This might indicate an issue with notebook execution order.
        print("Warning: POST_LAUNCH_PEAK_DAYS not found, using default 7 for target_var.")
        target_var = 'steam_peak_players_7d' 

    if target_var in aggregated_features_df.columns:
        # Select numerical columns for correlation
        # Ensure all columns intended for correlation are numeric after imputation
        corr_df = aggregated_features_df.select_dtypes(include=np.number)
        
        # Check if target_var is actually numeric and present in corr_df
        if target_var in corr_df.columns and pd.api.types.is_numeric_dtype(corr_df[target_var]):
            # Calculate correlation with the target variable
            correlations = corr_df.corr()[target_var].sort_values(ascending=False)
            print(f"Correlations with '{target_var}':")
            print(correlations)

            plt.figure(figsize=(15, 12)) # Adjusted size for better readability
            sns.heatmap(corr_df.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
            plt.title('Correlation Matrix of Numerical Features (Post-Imputation)')
            plt.show()
        else:
            print(f"Target variable '{target_var}' is not numeric or not found in numerical columns after imputation. Skipping correlation plot.")
            # Optionally, print all columns of corr_df to debug
            # print("Columns in corr_df for heatmap:", corr_df.columns.tolist())
            # Display a heatmap of all available numeric columns if target is problematic
            if not corr_df.empty and len(corr_df.columns) > 1:
                print("Displaying correlation matrix for all available numerical features:")
                plt.figure(figsize=(15, 12))
                sns.heatmap(corr_df.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
                plt.title('Correlation Matrix of All Numerical Features (Post-Imputation)')
                plt.show()
            else:
                print("Not enough numerical data to plot a correlation matrix.")
    else:
        print(f"Target variable '{target_var}' not found in aggregated_features_df columns.")

else:
    print("Skipping analysis: Aggregated features DataFrame is empty.")

## 6. Persisting the Aggregated Feature Set

**Purpose:** This section is dedicated to saving the final `aggregated_features_df` DataFrame to a persistent storage format, typically a CSV file. This step ensures that the engineered features are readily available for the subsequent modeling phase.

**Process Details:**
*   **Conditional Execution:** The save operation is performed only if the `aggregated_features_df` exists and is not empty. This prevents errors from attempting to save a non-existent or empty DataFrame, which could occur if previous aggregation or cleaning steps failed.
*   **File Path Definition:** A target file path is constructed using `os.path.join()`. The standard location for this output is within the project's `data` directory (e.g., `../data/`), and the file is typically named `aggregated_game_features.csv`.
*   **Saving to CSV:** The `pandas` DataFrame's `to_csv()` method is used to serialize the data. 
    *   `index=False` is specified to prevent `pandas` from writing the DataFrame index as a column in the CSV file, which is generally preferred for cleaner data loading in subsequent steps.
*   **Error Handling:** A try-except block is implemented to catch and report any potential `IOError` or other exceptions that might occur during the file writing process (e.g., disk full, insufficient permissions).

**Expected Output Upon Successful Execution:**
*   **Console Output:**
    *   A confirmation message indicating that the `aggregated_features_df` has been successfully saved, along with the full path to the output CSV file.
*   **File System:** A new CSV file (e.g., `aggregated_game_features.csv`) will be created or overwritten in the specified `data` directory.
*   **Warning/Error Messages:** If `aggregated_features_df` is empty, a message stating that the save operation is being skipped will be printed. If an error occurs during the save process, an error message detailing the issue will be displayed.

**Significance:** Saving the aggregated features is a critical checkpoint. This persisted dataset forms the direct input for the `03_modeling.ipynb` notebook, where machine learning models will be trained and evaluated. It decouples the feature engineering process from the modeling process, allowing for modularity and easier iteration on either part.

In [None]:
# Save the aggregated features DataFrame
if 'aggregated_features_df' in locals() and not aggregated_features_df.empty:
    save_path = os.path.join("..", "data", "aggregated_game_features.csv")
    try:
        aggregated_features_df.to_csv(save_path, index=False)
        print(f"\nAggregated features saved successfully to: {save_path}")
    except Exception as e:
        print(f"\nError saving aggregated features: {e}")
else:
    print("\nSkipping save: Aggregated features DataFrame is empty.")

## 7. Conclusion and Next Steps: Transition to Predictive Modeling

**Purpose:** This section concludes the feature engineering phase and outlines the critical next steps in the project, primarily focusing on the transition to the predictive modeling stage.

**Summary of Achievements in this Notebook:**
*   Successfully loaded and merged raw, time-stamped data collected by `01_data_collection.ipynb`.
*   Utilized the `DataAggregator` to engineer a comprehensive set of features, distinguishing between pre-release indicators and post-launch performance metrics for each game.
*   Conducted preliminary analysis, including missing value assessment (with a simple imputation strategy) and correlation analysis, to understand the characteristics of the aggregated feature set.
*   Persisted the final `aggregated_game_features.df` to a CSV file, making it ready for input into the modeling phase.

**Path Forward: `03_modeling.ipynb`**

The primary next step is to proceed to the `03_modeling.ipynb` notebook. This notebook will leverage the `aggregated_game_features.csv` file generated here to develop and evaluate predictive models. Key activities in the modeling phase will include:

1.  **Advanced Data Preprocessing & Feature Refinement:**
    *   **Sophisticated Imputation:** Implement more advanced techniques for handling any remaining missing values (e.g., KNN imputation, model-based imputation) if the simple fill-with-zero approach proves insufficient.
    *   **Feature Scaling/Normalization:** Apply appropriate scaling techniques (e.g., StandardScaler, MinMaxScaler) to numerical features to ensure they are on a comparable scale, which is often beneficial for many machine learning algorithms.
    *   **Outlier Detection and Handling:** Investigate and address potential outliers in the feature set that might disproportionately influence model training.
    *   **Categorical Feature Encoding:** Convert any categorical features (if present and relevant) into a numerical format suitable for modeling (e.g., one-hot encoding, label encoding).
    *   **Feature Selection/Dimensionality Reduction:** Employ techniques (e.g., RFE, PCA, feature importance from tree-based models) to select the most relevant features or reduce dimensionality, potentially improving model performance and interpretability.

2.  **Model Selection and Training:**
    *   **Algorithm Exploration:** Experiment with a variety of machine learning algorithms suitable for regression tasks (predicting numerical outcomes like peak player counts). This could include linear models (e.g., Linear Regression, Ridge, Lasso), tree-based models (e.g., Decision Trees, Random Forests, Gradient Boosting Machines like XGBoost, LightGBM), and potentially neural networks.
    *   **Train-Test Split:** Divide the dataset into training and testing sets to evaluate model generalization on unseen data.
    *   **Cross-Validation:** Utilize cross-validation techniques during training to obtain robust performance estimates and mitigate overfitting.

3.  **Hyperparameter Tuning:**
    *   Optimize the hyperparameters of the chosen models using techniques like GridSearchCV or RandomizedSearchCV to maximize their predictive performance.

4.  **Model Evaluation:**
    *   Assess model performance using appropriate regression metrics (e.g., Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared).
    *   Analyze prediction errors and identify areas where the model performs well or poorly.

5.  **Interpretation and Iteration:**
    *   Interpret the results, understand feature importances, and draw conclusions about the factors driving game popularity.
    *   Iterate on the feature engineering and modeling process based on insights gained, potentially revisiting this notebook (`02_feature_engineering.ipynb`) to create new features or refine existing ones.

**Significance:** The successful completion of this feature engineering notebook provides a clean, structured, and feature-rich dataset. This dataset is the cornerstone upon which the predictive models will be built, directly impacting the potential accuracy and insights derived from the final project.

In [None]:
# Final summary message
print("\nFeature Engineering & Aggregation Notebook Complete.")
if 'save_path' in locals() and os.path.exists(save_path):
    print(f"Aggregated features ready for modeling at: {save_path}")
else:
    print("Aggregated features were not saved (likely due to empty data or an error). Check previous cell outputs.")

---
*End of Notebook*