# 02. Feature Engineering & Aggregation

**Purpose:** This notebook takes the raw, time-series data collected by `01_data_collection.ipynb` and transforms it into a structured dataset suitable for machine learning. It uses the `DataAggregator` class to calculate pre-release features and post-launch outcomes for each game.

**Why This Matters:** Raw time-series data isn't directly usable for predicting a single outcome (like peak players). We need to aggregate signals over relevant time windows (e.g., average hype before release, peak engagement after launch) to create meaningful features and target variables.

**What to Expect:** After running this notebook, you will:
1. Load all previously saved raw data files.
2. Aggregate the data using `DataAggregator` to create one row per game.
3. Have a DataFrame containing potential features (pre-release metrics) and target variables (post-launch metrics).
4. Perform initial cleaning and analysis on the aggregated features.
5. Save the aggregated feature set for use in modeling (`03_modeling.ipynb`).

## 1. Setup and Configuration

**Purpose:** Import libraries and configure the environment.

**Why:** Ensures the `DataAggregator` and analysis libraries are available.

In [1]:
# Imports and Setup
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Add src directory to path to import modules
# Assumes notebook is run from the 'notebooks' directory
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our custom modules
from src.aggregator import DataAggregator
# from src.utils import configure_plotting # Optional

# Configure plotting (optional)
# configure_plotting()
plt.style.use('seaborn-v0_8-whitegrid')

# Display pandas DataFrames nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

# Display current time for reference
print(f"Notebook Execution Started: {datetime.now()}")

Notebook Execution Started: 2025-05-06 21:06:00.805602


## 2. Initialize Data Aggregator

**Purpose:** Create an instance of the `DataAggregator`.

**Why:** The aggregator contains the logic to load and process the historical data files.

In [2]:
# Initialize the aggregator
# Point it to the directory where the collector saved the raw data files
aggregator = DataAggregator(data_dir="../data")

## 3. Load Merged Historical Data

**Purpose:** Load all raw data files saved by the collector into a single DataFrame.

**Why:** Provides the complete time-series dataset needed for aggregation.

**Expected Output:** A large DataFrame containing records from all collection runs, sorted by timestamp.

In [3]:
# Load and merge all data files matching the default pattern 'steam_data_*.csv*'
try:
    merged_df = aggregator.load_merged_data()
    if not merged_df.empty:
        print("\n--- Merged Raw Data Sample ---")
        display(merged_df.head())
        print(f"\nShape of merged data: {merged_df.shape}")
        print(f"Date range: {merged_df['timestamp'].min()} to {merged_df['timestamp'].max()}")
    else:
        print("No raw data files found or loaded. Cannot proceed with aggregation.")
except Exception as e:
    print(f"An error occurred loading merged data: {e}")
    merged_df = pd.DataFrame() # Ensure df is empty on error

Loading data files from '..\data' matching 'steam_data_*.csv*'...
Loaded and merged 20 records from 1 files.

--- Merged Raw Data Sample ---


  df = pd.read_csv(f,


Unnamed: 0,app_id,name,category,timestamp,player_count,twitch_viewer_count,google_trends_avg,reddit_subscribers,reddit_active_users,reddit_recent_posts,twitter_recent_count,youtube_total_views,youtube_avg_views,youtube_avg_likes,release_date,metacritic_score,genres,price,is_free
0,1551360,Forza Horizon 5,experimental,2025-05-06 20:51:13.583959,11277,490,,414894.0,52.0,0.0,,15747577,3149515.4,85406.0,2021-11-08,,"Action,Adventure,Racing,Simulation,Sports",$29.99,False
1,1599340,Lost Ark,successful,2025-05-06 20:51:13.583959,19940,1295,,280334.0,147.0,0.0,,2383545,476709.0,4853.6,2022-02-11,,"Action,Adventure,Massively Multiplayer,RPG,Fre...",,True
2,271590,Grand Theft Auto V Legacy,successful,2025-05-06 20:51:13.583959,37633,130838,,,,,,275449,55089.8,632.8,2015-04-13,96.0,"Action,Adventure",,False
3,1086940,Baldur's Gate 3,successful,2025-05-06 20:51:13.583959,83148,2482,,3139841.0,750.0,0.0,,13361100,2672220.0,82628.8,2023-08-03,96.0,"Adventure,RPG,Strategy",$59.99,False
4,730,Counter-Strike 2,successful,2025-05-06 20:51:13.583959,581191,46588,,2805817.0,599.0,0.0,,2113614,422722.8,3881.6,2012-08-21,,"Action,Free To Play",,True



Shape of merged data: (20, 19)
Date range: 2025-05-06 20:51:13.583959 to 2025-05-06 20:51:13.583959


## 4. Aggregate Features

**Purpose:** Run the core aggregation logic.

**Why:** Transforms the time-series data into one row per game, calculating pre-release features and post-launch outcomes based on release dates.

**Expected Output:** A new DataFrame where each row represents a game, with columns for aggregated features and outcomes.

In [4]:
# Aggregate features if merged data is available
aggregated_features_df = pd.DataFrame() # Initialize empty
if 'merged_df' in locals() and not merged_df.empty:
    print("\nStarting feature aggregation...")
    try:
        # Define aggregation windows (can be adjusted)
        PRE_RELEASE_DAYS = 30
        POST_LAUNCH_PEAK_DAYS = 7
        POST_LAUNCH_AVG_DAYS = 30

        aggregated_features_df = aggregator.aggregate_features(
            merged_data=merged_df,
            pre_release_days=PRE_RELEASE_DAYS,
            post_launch_days_peak=POST_LAUNCH_PEAK_DAYS,
            post_launch_days_avg=POST_LAUNCH_AVG_DAYS
        )

        if not aggregated_features_df.empty:
            print("\n--- Aggregated Features Sample ---")
            display(aggregated_features_df.head())
            print(f"\nShape of aggregated data: {aggregated_features_df.shape}")
            print("\nColumns:", aggregated_features_df.columns.tolist())
        else:
            print("Aggregation resulted in an empty DataFrame. Check data quality (e.g., release dates, sufficient time range).")
    except Exception as e:
        print(f"An error occurred during feature aggregation: {e}")
else:
    print("Skipping aggregation because merged data is empty.")


Starting feature aggregation...
Preparing data for aggregation...
Aggregating features for 20 unique games...
  Processing Team Fortress 2 (App ID: 440, Release: 2007-10-10)
  Processing Dota 2 (App ID: 570, Release: 2013-07-09)
  Processing Counter-Strike 2 (App ID: 730, Release: 2012-08-21)
  Processing Kenshi (App ID: 233860, Release: 2018-12-06)
  Processing Rocket League® (App ID: 252950, Release: 2015-07-06)
  Processing Grand Theft Auto V Legacy (App ID: 271590, Release: 2015-04-13)
  Skipping game 292030 (The Witcher 3: Wild Hunt): Missing release date.
  Processing Tom Clancy's Rainbow Six® Siege (App ID: 359550, Release: 2015-12-01)
  Skipping game 578080 (PUBG: BATTLEGROUNDS): Missing release date.
  Processing Valheim (App ID: 892970, Release: 2021-02-02)
  Processing Baldur's Gate 3 (App ID: 1086940, Release: 2023-08-03)
  Processing Cyberpunk 2077 (App ID: 1091500, Release: 2020-12-09)
  Processing Apex Legends™ (App ID: 1172470, Release: 2020-11-04)
  Processing Sea of 

Unnamed: 0,app_id,game_name,release_date,metacritic_score,google_trends_avg_pre,reddit_posts_avg_pre,twitter_count_avg_pre,reddit_subs_pre,reddit_active_pre,steam_peak_players_7d,twitch_peak_viewers_7d,steam_avg_players_30d,twitch_avg_viewers_30d
0,440,Team Fortress 2,2007-10-10,92.0,,,,,,,,,
1,570,Dota 2,2013-07-09,90.0,,,,,,,,,
2,730,Counter-Strike 2,2012-08-21,,,,,,,,,,
3,233860,Kenshi,2018-12-06,75.0,,,,,,,,,
4,252950,Rocket League®,2015-07-06,86.0,,,,,,,,,



Shape of aggregated data: (18, 13)

Columns: ['app_id', 'game_name', 'release_date', 'metacritic_score', 'google_trends_avg_pre', 'reddit_posts_avg_pre', 'twitter_count_avg_pre', 'reddit_subs_pre', 'reddit_active_pre', 'steam_peak_players_7d', 'twitch_peak_viewers_7d', 'steam_avg_players_30d', 'twitch_avg_viewers_30d']


## 5. Initial Feature Analysis and Cleaning

**Purpose:** Examine the aggregated features, handle missing values, and perform basic analysis.

**Why:** Prepare the data for modeling and gain initial insights.

**Expected Output:**
- Summary statistics.
- Information about missing values.
- Potentially some visualizations (e.g., correlations, distributions).

In [5]:
# Analyze the aggregated features if available
if 'aggregated_features_df' in locals() and not aggregated_features_df.empty:
    print("\n--- Initial Analysis of Aggregated Features ---")

    # 1. Basic Info and Data Types
    print("\nBasic Info:")
    aggregated_features_df.info()

    # 2. Missing Value Analysis
    print("\nMissing Values (%):")
    missing_percent = (aggregated_features_df.isnull().sum() / len(aggregated_features_df)) * 100
    print(missing_percent[missing_percent > 0].sort_values(ascending=False))

    # --- Simple Missing Value Handling Strategy (Example) ---
    # Option 1: Fill numerical features with 0 or mean/median
    # Option 2: Drop columns/rows with too many missing values
    # For now, let's fill numerical NaNs with 0 for simplicity, but this might need refinement
    numerical_cols = aggregated_features_df.select_dtypes(include=np.number).columns.tolist()
    # Exclude identifiers like app_id
    cols_to_fill = [col for col in numerical_cols if col not in ['app_id']]
    # print(f"\nFilling NaNs with 0 for columns: {cols_to_fill}")
    # aggregated_features_df[cols_to_fill] = aggregated_features_df[cols_to_fill].fillna(0)
    # print("\nMissing Values after filling with 0:")
    # print(aggregated_features_df.isnull().sum().sort_values(ascending=False))
    # Note: A more sophisticated strategy (e.g., imputation) might be needed later.

    # 3. Descriptive Statistics
    print("\nDescriptive Statistics:")
    display(aggregated_features_df.describe())

    # 4. Correlation Analysis (Focus on potential features vs. target)
    print("\nCorrelation Matrix (Partial):")
    # Define potential target variable(s)
    target_var = f'steam_peak_players_{POST_LAUNCH_PEAK_DAYS}d' # Example target
    if target_var in aggregated_features_df.columns:
        # Select numerical columns for correlation
        corr_df = aggregated_features_df.select_dtypes(include=np.number)
        # Calculate correlation with the target variable
        correlations = corr_df.corr()[target_var].sort_values(ascending=False)
        print(f"Correlations with '{target_var}':")
        print(correlations)

        # Optional: Plot heatmap
        # plt.figure(figsize=(12, 10))
        # sns.heatmap(corr_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
        # plt.title('Correlation Matrix of Numerical Features')
        # plt.show()
    else:
        print(f"Target variable '{target_var}' not found in columns.")

else:
    print("Skipping analysis: Aggregated features DataFrame is empty.")


--- Initial Analysis of Aggregated Features ---

Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   app_id                  18 non-null     int64         
 1   game_name               18 non-null     object        
 2   release_date            18 non-null     datetime64[ns]
 3   metacritic_score        9 non-null      float64       
 4   google_trends_avg_pre   0 non-null      object        
 5   reddit_posts_avg_pre    0 non-null      object        
 6   twitter_count_avg_pre   0 non-null      object        
 7   reddit_subs_pre         0 non-null      object        
 8   reddit_active_pre       0 non-null      object        
 9   steam_peak_players_7d   0 non-null      object        
 10  twitch_peak_viewers_7d  0 non-null      object        
 11  steam_avg_players_30d   0 non-null      object    

Unnamed: 0,app_id,release_date,metacritic_score
count,18.0,18,9.0
mean,890915.6,2018-11-17 08:00:00,89.222222
min,440.0,2007-10-10 00:00:00,75.0
25%,257610.0,2015-08-12 00:00:00,86.0
50%,1089220.0,2020-11-21 12:00:00,90.0
75%,1235020.0,2022-01-18 06:00:00,94.0
max,1962660.0,2023-08-03 00:00:00,96.0
std,665389.5,,6.59124



Correlation Matrix (Partial):


KeyError: 'steam_peak_players_7d'

## 6. Save Aggregated Features

**Purpose:** Save the processed and potentially cleaned feature set.

**Why:** This dataset will be the direct input for the modeling notebook (`03_modeling.ipynb`).

**Expected Output:** Confirmation of the file save location.

In [None]:
# Save the aggregated features DataFrame
if 'aggregated_features_df' in locals() and not aggregated_features_df.empty:
    save_path = os.path.join("..", "data", "aggregated_game_features.csv")
    try:
        aggregated_features_df.to_csv(save_path, index=False)
        print(f"\nAggregated features saved successfully to: {save_path}")
    except Exception as e:
        print(f"\nError saving aggregated features: {e}")
else:
    print("\nSkipping save: Aggregated features DataFrame is empty.")

## 7. Next Steps

**Purpose:** Outline the path forward.

**Why:** Guides the project towards the final modeling stage.

**Next Actions:**
1.  **Refine Feature Engineering:** Based on the initial analysis, perform more advanced feature engineering (e.g., creating interaction terms, scaling features, better imputation).
2.  **Modeling:** Proceed to `03_modeling.ipynb` to train and evaluate machine learning models using the saved `aggregated_game_features.csv` dataset.

In [None]:
# Final summary message
print("\nFeature Engineering & Aggregation Notebook Complete.")
if 'save_path' in locals() and os.path.exists(save_path):
    print(f"Aggregated features ready for modeling at: {save_path}")
else:
    print("Aggregated features were not saved (likely due to empty data or an error). Check previous cell outputs.")

---
*End of Notebook*