# 01. Data Collection

**Purpose:** This notebook demonstrates how to collect real-time game data using our custom `DataCollector` class. It fetches data from Steam (player counts, details), Twitch (viewership), and external sources like Google Trends, Reddit, Twitter, and YouTube.

**Why This Matters:** This automated collection process provides fresh, multi-faceted data crucial for building our time series dataset and ultimately, the game popularity prediction model.

**What to Expect:** After running this notebook, you will:
1. Successfully collect current data from multiple APIs.
2. Save the combined data in a structured format (compressed CSV).
3. Understand the enhanced data collection workflow.
4. Have the first data point for building a historical dataset.

## 1. Setup and Configuration

**Purpose:** Import necessary libraries and configure the environment.

**Why:** Ensures the `DataCollector` and its dependencies can be found and used correctly.

In [None]:
# Imports and Setup
import sys
import os
import pandas as pd
from datetime import datetime

# Add src directory to path to import modules
# Assumes notebook is run from the 'notebooks' directory
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our custom modules
from src.data_collector import DataCollector
# from src.utils import configure_plotting # Optional: if plotting is needed here

# Configure plotting (optional)
# configure_plotting()

# Display pandas DataFrames nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Display current time for reference
print(f"Notebook Execution Started: {datetime.now()}")

## 2. Initialize Data Collector

**Purpose:** Create an instance of the `DataCollector`.

**Why:** The collector manages API connections, game lists, and data storage. It automatically loads API keys from the `.env` file in the project root.

**Expected Output:** Confirmation of initialization and the number of games being tracked.

In [None]:
# Initialize the collector
# It will use the default game lists defined within the class
# and look for API keys in the .env file in the project root
# Ensure the data_dir path is correct relative to the notebook location
collector = DataCollector(data_dir="../data")

# Optionally, view the game IDs being tracked
all_game_ids = collector.get_all_game_ids()
print(f"Tracking {len(all_game_ids)} unique game IDs across categories.")
# print(collector.game_categories) # Uncomment to see the full category list

## 3. Collect Current Data

**Purpose:** Execute the main data collection process.

**Why:** This fetches the latest data from all configured sources (Steam, Twitch, Google Trends, Reddit, Twitter, YouTube).

**IMPORTANT:** Before running this step:
1.  **Check API Keys/Credentials:** Ensure your API keys/secrets/tokens in the `.env` file (located in the project root `c:\\Users\\lucav\\Github\\Game-Popularity-Prediction-Modelv2`) are correct, especially for **Reddit** (Client ID, Secret, User Agent) and **YouTube** (API Key). Incorrect credentials often lead to `401` or `403` errors.
2.  **Check API Quotas/Rate Limits:** APIs like YouTube, Google Trends, and Twitter have usage limits (quotas) and rate limits. If you run this frequently or with many games, you might hit these limits, resulting in `403` (Quota Exceeded) or `429` (Too Many Requests) errors. Check your API provider dashboards (e.g., Google Cloud Console for YouTube) if you encounter persistent errors.

**Expected Output:** 
- Status messages indicating which data sources are being queried.
- Potential warnings or errors if API keys are invalid, quotas are exceeded, or rate limits are hit.
- A DataFrame containing the combined data (if collection is at least partially successful).
- A summary of the collected data shape and columns.

In [None]:
# Collect data for all tracked games
# include_details=True is needed to get game names for Twitch/External lookups
# include_twitch=True fetches Twitch viewership
# include_external=True fetches Google Trends, Reddit, Twitter, YouTube data
print("Starting data collection...")
print("This may take a few minutes depending on the number of games and API responsiveness.")
try:
    current_data_df = collector.collect_current_data(
        include_details=True,
        include_twitch=True,
        include_external=True
    )
    print("\n--- Collected Data Sample ---")
    # Display relevant columns, especially the newly added ones
    display_cols = [
        'app_id', 'name', 'category', 'timestamp', 'player_count', 'twitch_viewer_count',
        'google_trends_avg', 'reddit_subscribers', 'reddit_active_users', 'reddit_recent_posts', 'twitter_recent_count',
        'youtube_total_views', 'youtube_avg_views', 'youtube_avg_likes', 'release_date'
    ]
    display_cols_present = [col for col in display_cols if col in current_data_df.columns]
    display(current_data_df[display_cols_present].head())
    print(f"\nShape: {current_data_df.shape}")
    print("\nColumns:", current_data_df.columns.tolist())
except Exception as e:
    print(f"\nAn error occurred during data collection: {e}")
    # Optionally re-raise if debugging: raise e
    current_data_df = pd.DataFrame() # Ensure df exists but is empty on error

## 4. Save Data to File

**Purpose:** Save the collected DataFrame for future use.

**Why:** Persistent storage allows us to build a historical dataset over time, which is essential for the `DataAggregator` and model training.

**Expected Output:** Confirmation of the file save location.

In [None]:
# Save the collected data to a compressed CSV file
try:
    # Check if the DataFrame exists and is not empty
    if 'current_data_df' in locals() and not current_data_df.empty:
        saved_filepath = collector.save_data(data=current_data_df, compress=True)
        print(f"\nData successfully saved to: {saved_filepath}")
        
        # Optional: Verify loading back
        # loaded_data = collector.load_data(saved_filepath)
        # print(f"Verification - Loaded {len(loaded_data)} rows from saved file")
    else:
        print("\nSkipping save: No data collected or collection failed.")
except Exception as e:
     print(f"\nAn error occurred while saving data: {e}")

## 5. Initial Data Review (Optional)

**Purpose:** Perform a quick check on the collected data.

**Why:** Helps spot any immediate issues like missing values in key columns or unexpected data ranges.

**Expected Output:**
- Basic statistics for key numerical columns.

In [None]:
# Display basic statistics for numerical columns if data was collected
if 'current_data_df' in locals() and not current_data_df.empty:
    print("\nBasic Statistics for Key Metrics:")
    stats_cols = [
        'player_count', 'twitch_viewer_count', 'google_trends_avg', 
        'reddit_subscribers', 'reddit_active_users', 'reddit_recent_posts', 
        'twitter_recent_count', 'youtube_total_views', 'youtube_avg_views', 'youtube_avg_likes'
    ]
    stats_cols_present = [col for col in stats_cols if col in current_data_df.columns]
    display(current_data_df[stats_cols_present].describe())
else:
    print("\nSkipping statistics: No data available.")

## 6. Next Steps

**Purpose:** Outline the path forward.

**Why:** Guides the project towards the goal of building the prediction model.

**Next Actions:**
1.  **Repeat Collection:** Run this notebook periodically (e.g., daily, weekly) to build up historical data.
2.  **Aggregation:** Once sufficient historical data exists, use the `DataAggregator` (likely in `02_feature_engineering.ipynb`) to process the saved files into a feature set.
3.  **Feature Engineering & Modeling:** Analyze the aggregated data and build predictive models (`02_feature_engineering.ipynb`, `03_modeling.ipynb`).

In [None]:
# Final summary message
print("\nData Collection Notebook Complete.")
if 'saved_filepath' in locals():
    print(f"Latest data saved to: {saved_filepath}")
elif 'current_data_df' in locals() and current_data_df.empty:
     print("Data collection run finished, but resulted in empty data or an error occurred.")
else:
     print("Data collection run finished, but data was not saved (likely due to an error). Check previous cell outputs.")

print("\nRemember to run this notebook periodically to build your historical dataset.")

---
*End of Notebook*