# 01. Data Collection

**Purpose:** This notebook demonstrates how to collect real-time game data using our custom `DataCollector` class. It fetches data from Steam (player counts, details), Twitch (viewership), and external sources like Google Trends, Reddit, Twitter, and YouTube.

**Why This Matters:** This automated collection process provides fresh, multi-faceted data crucial for building our time series dataset and ultimately, the game popularity prediction model.

**What to Expect:** After running this notebook, you will:
1. Successfully collect current data from multiple APIs.
2. Save the combined data in a structured format (compressed CSV).
3. Understand the enhanced data collection workflow.
4. Have the first data point for building a historical dataset.

## 1. Setup and Configuration

**Purpose:** Import necessary libraries and configure the environment.

**Why:** Ensures the `DataCollector` and its dependencies can be found and used correctly.

In [1]:
# Imports and Setup
import sys
import os
import pandas as pd
from datetime import datetime

# Add src directory to path to import modules
# Assumes notebook is run from the 'notebooks' directory
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our custom modules
from src.data_collector import DataCollector
# from src.utils import configure_plotting # Optional: if plotting is needed here

# Configure plotting (optional)
# configure_plotting()

# Display pandas DataFrames nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Display current time for reference
print(f"Notebook Execution Started: {datetime.now()}")

Notebook Execution Started: 2025-05-06 22:00:33.338323


## 2. Initialize Data Collector

**Purpose:** Create an instance of the `DataCollector`.

**Why:** The collector manages API connections, game lists, and data storage. It automatically loads API keys from the `.env` file in the project root.

**Expected Output:** Confirmation of initialization and the number of games being tracked.

In [2]:
# Initialize the collector
# It will use the default game lists defined within the class
# and look for API keys in the .env file in the project root
# Ensure the data_dir path is correct relative to the notebook location
collector = DataCollector(data_dir="../data")

# Optionally, view the game IDs being tracked
all_game_ids = collector.get_all_game_ids()
print(f"Tracking {len(all_game_ids)} unique game IDs across categories.")
# print(collector.game_categories) # Uncomment to see the full category list

TwitchAPIConnector initialized
[ExternalDataCollector] Attempting to load .env file...
[ExternalDataCollector] .env file found and loaded: True
Reddit client initialized successfully (read-only mode).
Error initializing Twitter client: BaseClient.__init__() got an unexpected keyword argument 'timeout'
[ExternalDataCollector._init_pytrends] Detected pytrends library version: unknown (attribute missing)
         Consider checking your pytrends installation (target version 4.7.0+ for full support).
Pytrends client initialized successfully (fallback, with requests_args for SSL verification).
ExternalDataCollector initialized
Tracking 20 unique game IDs across categories.




## 3. Collect Current Data

**Purpose:** Execute the main data collection process.

**Why:** This fetches the latest data from all configured sources (Steam, Twitch, Google Trends, Reddit, Twitter, YouTube).

**IMPORTANT:** Before running this step:
1.  **Check API Keys/Credentials:** Ensure your API keys/secrets/tokens in the `.env` file (located in the project root `c:\\Users\\lucav\\Github\\Game-Popularity-Prediction-Modelv2`) are correct, especially for **Reddit** (Client ID, Secret, User Agent) and **YouTube** (API Key). Incorrect credentials often lead to `401` or `403` errors.
2.  **Check API Quotas/Rate Limits:** APIs like YouTube, Google Trends, and Twitter have usage limits (quotas) and rate limits. If you run this frequently or with many games, you might hit these limits, resulting in `403` (Quota Exceeded) or `429` (Too Many Requests) errors. Check your API provider dashboards (e.g., Google Cloud Console for YouTube) if you encounter persistent errors.

**Expected Output:** 
- Status messages indicating which data sources are being queried.
- Potential warnings or errors if API keys are invalid, quotas are exceeded, or rate limits are hit.
- A DataFrame containing the combined data (if collection is at least partially successful).
- A summary of the collected data shape and columns.

In [3]:
# Collect data for all tracked games
# include_details=True is needed to get game names for Twitch/External lookups
# include_twitch=True fetches Twitch viewership
# include_external=True fetches Google Trends, Reddit, Twitter, YouTube data
print("Starting data collection...")
print("This may take a few minutes depending on the number of games and API responsiveness.")
try:
    current_data_df = collector.collect_current_data(
        include_details=True,
        include_twitch=True,
        include_external=True
    )
    print("\n--- Collected Data Sample ---")
    # Display relevant columns, especially the newly added ones
    display_cols = [
        'app_id', 'name', 'category', 'timestamp', 'player_count', 'twitch_viewer_count',
        'google_trends_avg', 'reddit_subscribers', 'reddit_active_users', 'reddit_recent_posts', 'twitter_recent_count',
        'youtube_total_views', 'youtube_avg_views', 'youtube_avg_likes', 'release_date'
    ]
    display_cols_present = [col for col in display_cols if col in current_data_df.columns]
    display(current_data_df[display_cols_present].head())
    print(f"\nShape: {current_data_df.shape}")
    print("\nColumns:", current_data_df.columns.tolist())
except Exception as e:
    print(f"\nAn error occurred during data collection: {e}")
    # Optionally re-raise if debugging: raise e
    current_data_df = pd.DataFrame() # Ensure df exists but is empty on error

Starting data collection...
This may take a few minutes depending on the number of games and API responsiveness.
[2025-05-06 22:00:41.749294] Collecting data for 20 games...
  Fetching Twitch data for: Forza Horizon 5 (Querying as: 'Forza Horizon 5')
Successfully obtained Twitch Access Token.
    > Twitch ID: 1757732267, Viewers: 530
  Fetching Twitch data for: Kenshi (Querying as: 'Kenshi')
    > Twitch ID: 34025, Viewers: 39
  Fetching Twitch data for: Sea of Thieves: 2025 Edition (Querying as: 'Sea of Thieves')
    > Twitch ID: 490377, Viewers: 1474
  Fetching Twitch data for: NARAKA: BLADEPOINT (Querying as: 'NARAKA: BLADEPOINT')
    > Twitch ID: 515474, Viewers: 402
  Fetching Twitch data for: Rocket League® (Querying as: 'Rocket League')
    > Twitch ID: 30921, Viewers: 6465
  Fetching Twitch data for: PUBG: BATTLEGROUNDS (Querying as: 'PUBG: BATTLEGROUNDS')
    > Twitch ID: 493057, Viewers: 6225
  Fetching Twitch data for: Call of Duty®: Modern Warfare® II (Querying as: 'Call of



Error fetching Google Trends data for 'Valheim': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/valheim
    > Reddit Data (Subscribers: 532344, Active: 86, Posts_week: 250)
    > Querying Twitter for: "Valheim" OR #Valheim
Error: Twitter client not initialized.
    > Querying YouTube for: "Valheim" official trailer OR gameplay
    > YouTube Search Query: "Valheim" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=1695785, Avg Views=339157, Avg Likes=7867
  Processing external signals for: NARAKA: BLADEPOINT (Timeout: 30s)
    > Querying Google Trends for: 'NARAKA: BLADEPOINT'




Error fetching Google Trends data for 'NARAKA: BLADEPOINT': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/NarakaBladePoint
    > Reddit Data (Subscribers: 32179, Active: 6, Posts_week: 40)
    > Querying Twitter for: "NARAKA BLADEPOINT" OR #NARAKABLADEPOINT
Error: Twitter client not initialized.
    > Querying YouTube for: "NARAKA: BLADEPOINT" official trailer OR gameplay
    > YouTube Search Query: "NARAKA: BLADEPOINT" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=1556839, Avg Views=311368, Avg Likes=5218
  Processing external signals for: Sea of Thieves: 2025 Edition (Timeout: 30s)
    > Querying Google Trends for: 'Sea of Thieves: 2025 Edition'




Error fetching Google Trends data for 'Sea of Thieves: 2025 Edition': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/seaofthieves2025edition
Info: Subreddit r/seaofthieves2025edition not found.
    > Querying Twitter for: "Sea of Thieves: 2025 Edition" OR #SeaofThieves2025Edition
Error: Twitter client not initialized.
    > Querying YouTube for: "Sea of Thieves: 2025 Edition" official trailer OR gameplay
    > YouTube Search Query: "Sea of Thieves: 2025 Edition" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=343551, Avg Views=68710, Avg Likes=2705
  Processing external signals for: Team Fortress 2 (Timeout: 30s)
    > Querying Google Trends for: 'Team Fortress 2'




Error fetching Google Trends data for 'Team Fortress 2': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/tf2
    > Reddit Data (Subscribers: 897127, Active: 218, Posts_week: 250)
    > Querying Twitter for: "Team Fortress 2" OR #TF2
Error: Twitter client not initialized.
    > Querying YouTube for: "Team Fortress 2" official trailer OR gameplay
    > YouTube Search Query: "Team Fortress 2" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=125922820, Avg Views=25184564, Avg Likes=309560
  Processing external signals for: PUBG: BATTLEGROUNDS (Timeout: 30s)
    > Querying Google Trends for: 'PUBG'




Error fetching Google Trends data for 'PUBG': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/PUBATTLEGROUNDS
    > Reddit Data (Subscribers: 2653057, Active: 59, Posts_week: 108)
    > Querying Twitter for: "PUBG" OR #PUBG
Error: Twitter client not initialized.
    > Querying YouTube for: PUBG gameplay
    > YouTube Search Query: PUBG gameplay
    > YouTube Stats (Top 5 videos): Total Views=889936, Avg Views=177987, Avg Likes=2766
  Processing external signals for: Dota 2 (Timeout: 30s)
    > Querying Google Trends for: 'Dota 2'




Error fetching Google Trends data for 'Dota 2': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/DotA2
    > Reddit Data (Subscribers: 1709804, Active: 478, Posts_week: 250)
    > Querying Twitter for: "Dota 2" OR #dota2
Error: Twitter client not initialized.
    > Querying YouTube for: "Dota 2" official trailer OR gameplay
    > YouTube Search Query: "Dota 2" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=13992682, Avg Views=2798536, Avg Likes=23234
  Processing external signals for: Apex Legends™ (Timeout: 30s)
    > Querying Google Trends for: 'Apex Legends'




Error fetching Google Trends data for 'Apex Legends': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/apexlegends
    > Reddit Data (Subscribers: 3024436, Active: 351, Posts_week: 250)
    > Querying Twitter for: "Apex Legends" OR #ApexLegends
Error: Twitter client not initialized.
    > Querying YouTube for: "Apex Legends" gameplay
    > YouTube Search Query: "Apex Legends" gameplay
    > YouTube Stats (Top 5 videos): Total Views=233290, Avg Views=46658, Avg Likes=1424
  Processing external signals for: Cyberpunk 2077 (Timeout: 30s)
    > Querying Google Trends for: 'Cyberpunk 2077'




Error fetching Google Trends data for 'Cyberpunk 2077': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/cyberpunkgame
    > Reddit Data (Subscribers: 2208736, Active: 319, Posts_week: 250)
    > Querying Twitter for: "Cyberpunk 2077" OR #Cyberpunk2077
Error: Twitter client not initialized.
    > Querying YouTube for: "Cyberpunk 2077" official trailer OR gameplay
    > YouTube Search Query: "Cyberpunk 2077" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=78258740, Avg Views=15651748, Avg Likes=293753
  Processing external signals for: Forza Horizon 5 (Timeout: 30s)
    > Querying Google Trends for: 'Forza Horizon 5'




Error fetching Google Trends data for 'Forza Horizon 5': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/ForzaHorizon
    > Reddit Data (Subscribers: 414910, Active: 65, Posts_week: 250)
    > Querying Twitter for: "Forza Horizon 5" OR #ForzaHorizon5
Error: Twitter client not initialized.
    > Querying YouTube for: "Forza Horizon 5" official trailer OR gameplay
    > YouTube Search Query: "Forza Horizon 5" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=15747930, Avg Views=3149586, Avg Likes=85407
  Processing external signals for: Call of Duty®: Modern Warfare® II (Timeout: 30s)
    > Querying Google Trends for: 'Call of Duty®: Modern Warfare® II'




Error fetching Google Trends data for 'Call of Duty®: Modern Warfare® II': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/callofdutymodernwarfareii
Info: Subreddit r/callofdutymodernwarfareii not found.
    > Querying Twitter for: "Call of Duty®: Modern Warfare® II" OR #CallofDutyModernWarfareII
Error: Twitter client not initialized.
    > Querying YouTube for: "Call of Duty®: Modern Warfare® II" official trailer OR gameplay
    > YouTube Search Query: "Call of Duty®: Modern Warfare® II" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=3666091, Avg Views=733218, Avg Likes=18898
  Processing external signals for: Rocket League® (Timeout: 30s)
    > Querying Google Trends for: 'Rocket League®'




Error fetching Google Trends data for 'Rocket League®': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/rocketleague
    > Reddit Data (Subscribers: 1833512, Active: 205, Posts_week: 250)
    > Querying Twitter for: "Rocket League®" OR #RocketLeague
Error: Twitter client not initialized.
    > Querying YouTube for: "Rocket League®" official trailer OR gameplay
    > YouTube Search Query: "Rocket League®" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=4780631, Avg Views=956126, Avg Likes=20155
  Processing external signals for: Tom Clancy's Rainbow Six® Siege (Timeout: 30s)
    > Querying Google Trends for: 'Rainbow Six Siege'




Error fetching Google Trends data for 'Rainbow Six Siege': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/Rainbow6
    > Reddit Data (Subscribers: 2102446, Active: 123, Posts_week: 250)
    > Querying Twitter for: "Rainbow Six Siege" OR #RainbowSixSiege OR #R6S
Error: Twitter client not initialized.
    > Querying YouTube for: "Rainbow Six Siege" gameplay OR R6S
    > YouTube Search Query: "Rainbow Six Siege" gameplay OR R6S
    > YouTube Stats (Top 5 videos): Total Views=977773, Avg Views=195555, Avg Likes=4376
  Processing external signals for: Counter-Strike 2 (Timeout: 30s)
    > Querying Google Trends for: 'Counter-Strike 2'




Error fetching Google Trends data for 'Counter-Strike 2': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/GlobalOffensive
    > Reddit Data (Subscribers: 2805809, Active: 540, Posts_week: 250)
    > Querying Twitter for: "Counter-Strike 2" OR #CS2
Error: Twitter client not initialized.
    > Querying YouTube for: "Counter-Strike 2" gameplay OR CS2
    > YouTube Search Query: "Counter-Strike 2" gameplay OR CS2
    > YouTube Stats (Top 5 videos): Total Views=2113742, Avg Views=422748, Avg Likes=3882
  Processing external signals for: Call of Duty® (Timeout: 30s)
    > Querying Google Trends for: 'Call of Duty'




Error fetching Google Trends data for 'Call of Duty': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/ModernWarfareIII
    > Reddit Data (Subscribers: 347724, Active: 29, Posts_week: 57)
    > Querying Twitter for: #MW3 OR #Warzone OR "Call of Duty"
Error: Twitter client not initialized.
    > Querying YouTube for: Modern Warfare 3 gameplay OR Warzone gameplay
    > YouTube Search Query: Modern Warfare 3 gameplay OR Warzone gameplay
    > YouTube Stats (Top 5 videos): Total Views=12379615, Avg Views=2475923, Avg Likes=29736
  Processing external signals for: Kenshi (Timeout: 30s)
    > Querying Google Trends for: 'Kenshi'




Error fetching Google Trends data for 'Kenshi': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/Kenshi
    > Reddit Data (Subscribers: 166219, Active: 118, Posts_week: 250)
    > Querying Twitter for: "Kenshi game" OR #Kenshi
Error: Twitter client not initialized.
    > Querying YouTube for: "Kenshi" official trailer OR gameplay
    > YouTube Search Query: "Kenshi" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=844159, Avg Views=168832, Avg Likes=7749
  Processing external signals for: Grand Theft Auto V Legacy (Timeout: 30s)
    > Querying Google Trends for: 'Grand Theft Auto V Legacy'




Error fetching Google Trends data for 'Grand Theft Auto V Legacy': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/grandtheftautovlegacy
Info: Subreddit r/grandtheftautovlegacy is redirected. Consider updating the name.
    > Querying Twitter for: "Grand Theft Auto V Legacy" OR #GrandTheftAutoVLegacy
Error: Twitter client not initialized.
    > Querying YouTube for: "Grand Theft Auto V Legacy" official trailer OR gameplay
    > YouTube Search Query: "Grand Theft Auto V Legacy" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=210602, Avg Views=42120, Avg Likes=427
  Processing external signals for: The Witcher 3: Wild Hunt (Timeout: 30s)
    > Querying Google Trends for: 'Witcher 3'




Error fetching Google Trends data for 'Witcher 3': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/witcher
    > Reddit Data (Subscribers: 1283267, Active: 63, Posts_week: 76)
    > Querying Twitter for: "Witcher 3" OR #Witcher3
Error: Twitter client not initialized.
    > Querying YouTube for: "The Witcher 3: Wild Hunt" official trailer OR gameplay
    > YouTube Search Query: "The Witcher 3: Wild Hunt" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=9818324, Avg Views=1963665, Avg Likes=22864
  Processing external signals for: Lost Ark (Timeout: 30s)
    > Querying Google Trends for: 'Lost Ark'




Error fetching Google Trends data for 'Lost Ark': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/lostarkgame
    > Reddit Data (Subscribers: 280332, Active: 303, Posts_week: 161)
    > Querying Twitter for: "Lost Ark" OR #LostArkGame
Error: Twitter client not initialized.
    > Querying YouTube for: "Lost Ark" official trailer OR gameplay
    > YouTube Search Query: "Lost Ark" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=2383545, Avg Views=476709, Avg Likes=4854
  Processing external signals for: Baldur's Gate 3 (Timeout: 30s)
    > Querying Google Trends for: 'Baldur's Gate 3'




Error fetching Google Trends data for 'Baldur's Gate 3': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/BaldursGate3
    > Reddit Data (Subscribers: 3139889, Active: 696, Posts_week: 250)
    > Querying Twitter for: "Baldurs Gate 3" OR #BaldursGate3 OR #BG3
Error: Twitter client not initialized.
    > Querying YouTube for: "Baldurs Gate 3" gameplay OR BG3
    > YouTube Search Query: "Baldurs Gate 3" gameplay OR BG3
    > YouTube Stats (Top 5 videos): Total Views=13362300, Avg Views=2672460, Avg Likes=82632
  Processing external signals for: ELDEN RING (Timeout: 30s)
    > Querying Google Trends for: 'ELDEN RING'




Error fetching Google Trends data for 'ELDEN RING': The request failed: Google returned a response with code 429
    > Google Trends: Failed or no data.
    > Querying Reddit subreddit: r/Eldenring
    > Reddit Data (Subscribers: 4323617, Active: 755, Posts_week: 250)
    > Querying Twitter for: "ELDEN RING" OR #ELDENRING
Error: Twitter client not initialized.
    > Querying YouTube for: "ELDEN RING" official trailer OR gameplay
    > YouTube Search Query: "ELDEN RING" official trailer OR gameplay
    > YouTube Stats (Top 5 videos): Total Views=32520233, Avg Views=6504047, Avg Likes=150805
[2025-05-06 22:03:13.363554] Finished collecting external signals.
[2025-05-06 22:03:13.363595] Merging external signals into main dataset...
[2025-05-06 22:03:13.363752] Finished merging external signals.
[2025-05-06 22:03:13.404429] Data collection complete. DataFrame shape: (20, 19)

--- Collected Data Sample ---


Unnamed: 0,app_id,name,category,timestamp,player_count,twitch_viewer_count,google_trends_avg,reddit_subscribers,reddit_active_users,reddit_recent_posts,twitter_recent_count,youtube_total_views,youtube_avg_views,youtube_avg_likes,release_date
0,1551360,Forza Horizon 5,experimental,2025-05-06T22:01:00.908708,10732,530,,414910.0,65.0,250.0,,15747930,3149586.0,85406.6,"Nov 8, 2021"
1,233860,Kenshi,declining,2025-05-06T22:01:00.908708,2962,39,,166219.0,118.0,250.0,,844159,168831.8,7749.0,"Dec 6, 2018"
2,1172620,Sea of Thieves: 2025 Edition,experimental,2025-05-06T22:01:00.908708,6001,1474,,,,,,343551,68710.2,2705.2,"Jun 3, 2020"
3,1203220,NARAKA: BLADEPOINT,experimental,2025-05-06T22:01:00.908708,20303,402,,32179.0,6.0,40.0,,1556839,311367.8,5218.4,"Aug 11, 2021"
4,252950,Rocket League®,declining,2025-05-06T22:01:00.908708,19362,6465,,1833512.0,205.0,250.0,,4780631,956126.2,20154.6,"Jul 6, 2015"



Shape: (20, 19)

Columns: ['app_id', 'name', 'category', 'timestamp', 'player_count', 'twitch_viewer_count', 'google_trends_avg', 'reddit_subscribers', 'reddit_active_users', 'reddit_recent_posts', 'twitter_recent_count', 'youtube_total_views', 'youtube_avg_views', 'youtube_avg_likes', 'release_date', 'metacritic_score', 'genres', 'price', 'is_free']


## 4. Save Data to File

**Purpose:** Save the collected DataFrame for future use.

**Why:** Persistent storage allows us to build a historical dataset over time, which is essential for the `DataAggregator` and model training.

**Expected Output:** Confirmation of the file save location.

In [4]:
# Save the collected data to a compressed CSV file
try:
    # Check if the DataFrame exists and is not empty
    if 'current_data_df' in locals() and not current_data_df.empty:
        saved_filepath = collector.save_data(data=current_data_df, compress=True)
        print(f"\nData successfully saved to: {saved_filepath}")
        
        # Optional: Verify loading back
        # loaded_data = collector.load_data(saved_filepath)
        # print(f"Verification - Loaded {len(loaded_data)} rows from saved file")
    else:
        print("\nSkipping save: No data collected or collection failed.")
except Exception as e:
     print(f"\nAn error occurred while saving data: {e}")

[2025-05-06 22:04:13.075969] Data saved to ..\data\steam_data_2025-05-06-22-04.csv.gz

Data successfully saved to: ..\data\steam_data_2025-05-06-22-04.csv.gz


## 5. Initial Data Review (Optional)

**Purpose:** Perform a quick check on the collected data.

**Why:** Helps spot any immediate issues like missing values in key columns or unexpected data ranges.

**Expected Output:**
- Basic statistics for key numerical columns.

In [5]:
# Display basic statistics for numerical columns if data was collected
if 'current_data_df' in locals() and not current_data_df.empty:
    print("\nBasic Statistics for Key Metrics:")
    stats_cols = [
        'player_count', 'twitch_viewer_count', 'google_trends_avg', 
        'reddit_subscribers', 'reddit_active_users', 'reddit_recent_posts', 
        'twitter_recent_count', 'youtube_total_views', 'youtube_avg_views', 'youtube_avg_likes'
    ]
    stats_cols_present = [col for col in stats_cols if col in current_data_df.columns]
    display(current_data_df[stats_cols_present].describe())
else:
    print("\nSkipping statistics: No data available.")


Basic Statistics for Key Metrics:


Unnamed: 0,player_count,twitch_viewer_count,reddit_subscribers,reddit_active_users,reddit_recent_posts,youtube_total_views,youtube_avg_views,youtube_avg_likes
count,20.0,20.0,17.0,17.0,17.0,20.0,20.0,20.0
mean,76967.35,11483.2,1632671.0,259.647059,202.470588,16084930.0,3216986.0,53915.55
std,140316.458527,27977.02605,1275120.0,235.021792,79.571601,31447930.0,6289586.0,92843.774347
min,1.0,39.0,32179.0,6.0,40.0,210602.0,42120.4,426.8
25%,15131.5,411.0,414910.0,65.0,161.0,955813.8,191162.8,4252.45
50%,23271.5,1908.0,1709804.0,205.0,250.0,3024818.0,604963.6,13382.4
75%,78880.75,6628.25,2653057.0,351.0,250.0,13519900.0,2703979.0,42960.0
max,617562.0,124267.0,4323617.0,755.0,250.0,125922800.0,25184560.0,309560.0


## 6. Next Steps

**Purpose:** Outline the path forward.

**Why:** Guides the project towards the goal of building the prediction model.

**Next Actions:**
1.  **Repeat Collection:** Run this notebook periodically (e.g., daily, weekly) to build up historical data.
2.  **Aggregation:** Once sufficient historical data exists, use the `DataAggregator` (likely in `02_feature_engineering.ipynb`) to process the saved files into a feature set.
3.  **Feature Engineering & Modeling:** Analyze the aggregated data and build predictive models (`02_feature_engineering.ipynb`, `03_modeling.ipynb`).

In [6]:
# Final summary message
print("\nData Collection Notebook Complete.")
if 'saved_filepath' in locals():
    print(f"Latest data saved to: {saved_filepath}")
elif 'current_data_df' in locals() and current_data_df.empty:
     print("Data collection run finished, but resulted in empty data or an error occurred.")
else:
     print("Data collection run finished, but data was not saved (likely due to an error). Check previous cell outputs.")

print("\nRemember to run this notebook periodically to build your historical dataset.")


Data Collection Notebook Complete.
Latest data saved to: ..\data\steam_data_2025-05-06-22-04.csv.gz

Remember to run this notebook periodically to build your historical dataset.


---
*End of Notebook*