# 01. Data Collection: Acquiring Real-Time Game Metrics

**Project Context:** This notebook forms a critical component of a final project aimed at developing a game popularity prediction model. The model's accuracy relies heavily on a comprehensive and timely dataset.

**Purpose of this Notebook:** This document outlines and executes the initial data acquisition phase. It utilizes a custom-developed `DataCollector` class to programmatically fetch real-time data pertaining to video games from diverse online sources. These sources include:
*   **Steam:** Player counts and detailed game information.
*   **Twitch:** Live viewership statistics.
*   **External Platforms:** Social sentiment and engagement metrics from Google Trends, Reddit, Twitter, and YouTube.

**Significance:** The automated collection process detailed herein is fundamental for constructing a robust time-series dataset. This dataset will capture dynamic changes in game popularity indicators, providing the empirical basis for subsequent feature engineering and predictive modeling.

**Expected Outcomes:** Upon successful execution of this notebook, the following will be achieved:
1.  **Comprehensive Data Retrieval:** Current data points will be systematically collected from all specified APIs and web sources.
2.  **Structured Data Storage:** The aggregated data will be saved in a compressed CSV format, ensuring efficient storage and accessibility for later stages of the project.
3.  **Workflow Demonstration:** The notebook will illustrate the functionality and efficacy of the `DataCollector` class within the broader data pipeline.
4.  **Foundation for Historical Analysis:** This initial data collection run will produce the first entry in what will become a longitudinal dataset, essential for time-series analysis and model training.

## 1. Environment Setup and Library Configuration

**Purpose:** This section prepares the Python environment for executing the data collection script. It involves importing essential libraries and configuring system paths to ensure that custom modules, particularly the `DataCollector` class, are accessible.

**Key Actions:**
*   **Import Core Libraries:** Standard Python libraries such as `sys` (for system-specific parameters and functions), `os` (for operating system interfaces), `pandas` (for data manipulation and analysis), and `datetime` (for timestamping) are imported.
*   **Path Configuration:** The system path (`sys.path`) is dynamically updated to include the project's `src` directory. This is crucial for importing the `DataCollector` class and any utility functions residing in custom modules, assuming the notebook is executed from the `notebooks` directory.
*   **Custom Module Import:** The `DataCollector` class, which encapsulates the logic for API interactions and data aggregation, is imported from the `src.data_collector` module.
*   **Pandas Display Options:** `pandas` display settings are configured to enhance the readability of DataFrames, ensuring that all columns are visible and the output width is sufficient for comprehensive data inspection.
*   **Execution Timestamp:** The start time of the notebook execution is recorded and printed, providing a reference point for monitoring the duration of the data collection process.

**Significance:** Proper setup is foundational for the notebook's successful operation. It guarantees that all dependencies are loaded and that the `DataCollector` can be instantiated and utilized correctly, preventing import errors and ensuring a smooth data collection workflow.

In [None]:
# Imports and Setup
import sys
import os
import pandas as pd
from datetime import datetime

# Add src directory to path to import modules
# Assumes notebook is run from the 'notebooks' directory
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our custom modules
from src.data_collector import DataCollector
from src.utils import configure_plotting # Optional: if plotting is needed here

# Configure plotting (optional)
configure_plotting()

# Display pandas DataFrames nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Display current time for reference
print(f"Notebook Execution Started: {datetime.now()}")

## 2. Data Collector Instantiation

**Purpose:** This step involves creating an instance of the `DataCollector` class. This object will serve as the primary interface for all subsequent data collection operations.

**Functionality of `DataCollector`:**
*   **API Key Management:** The `DataCollector` is designed to automatically load necessary API keys and credentials from a `.env` file located in the project's root directory. This promotes security and ease of configuration.
*   **Game List Management:** It initializes with predefined lists of game titles and their corresponding App IDs, categorized for targeted data retrieval (e.g., 'successful', 'declining', 'experimental').
*   **Data Storage Configuration:** The collector is configured with a `data_dir` parameter, specifying the directory (relative to the notebook, e.g., `../data`) where collected data will be saved.
*   **Centralized Operations:** It encapsulates methods for fetching data from various sources (Steam, Twitch, external platforms) and for saving the aggregated data.

**Expected Output Upon Execution:**
*   Successful instantiation of the `DataCollector` object without errors.
*   A printed confirmation indicating the total number of unique game IDs being tracked across all defined categories. This verifies that the game lists have been loaded correctly.

In [None]:
# Initialize the collector
# It will use the default game lists defined within the class
# and look for API keys in the .env file in the project root
# Ensure the data_dir path is correct relative to the notebook location
collector = DataCollector(data_dir="../data")

# Optionally, view the game IDs being tracked
all_game_ids = collector.get_all_game_ids()
print(f"Tracking {len(all_game_ids)} unique game IDs across categories.")
print(collector.game_categories) 

## 3. Execution of Real-Time Data Collection

**Purpose:** This section executes the core data collection logic by invoking the `collect_current_data` method of the instantiated `DataCollector` object.

**Process Overview:** The method systematically queries multiple APIs and web sources to gather a comprehensive set of metrics for each tracked game. The scope of data collection can be controlled via parameters:
*   `include_details=True`: Fetches detailed game information from Steam, which is often a prerequisite for subsequent lookups on platforms like Twitch and other external sources that rely on game names rather than App IDs.
*   `include_twitch=True`: Retrieves current viewership numbers for each game on Twitch.
*   `include_external=True`: Gathers data from Google Trends (search interest), Reddit (community engagement), Twitter (social buzz), and YouTube (video statistics).

**Guidelines for Successful Operation & API Usage Considerations:**
1.  **API Key/Credential Configuration:** For seamless data retrieval, ensure that all API keys, secrets, and tokens are correctly configured in the `.env` file (located in the project root: `c:\\Users\\lucav\\Github\\Game-Popularity-Prediction-Modelv2`). Accuracy is particularly important for **Reddit** (Client ID, Secret, User Agent) and **YouTube** (API Key) credentials to prevent access issues (`401` or `403` status codes).
2.  **Adherence to API Quotas and Rate Limits:** Be mindful of the usage limits imposed by APIs such as YouTube, Google Trends, and Twitter. These services often have quotas (e.g., total requests per day) and rate limits (requests per time interval). To ensure continuous operation and avoid service interruptions (which can result in `403` - Quota Exceeded or `429` - Too Many Requests statuses), monitor usage, especially when collecting data for many games or running the notebook frequently. Consulting the API provider dashboards (e.g., Google Cloud Console for YouTube API) can help manage these constraints.

**Expected Output Upon Execution:**
*   **Progress Indicators:** Status messages will be displayed, indicating which data sources are currently being queried (e.g., "Fetching Steam player counts...", "Fetching Twitch data...").
*   **Error/Warning Notifications:** Any issues encountered during the process, such as API errors (due to invalid keys, quota limits, or network problems) or missing data for specific games, will be reported.
*   **Resultant DataFrame:** If the collection is at least partially successful, a `pandas` DataFrame (`current_data_df`) will be generated. This DataFrame will contain the aggregated metrics, with columns corresponding to each data point (e.g., `app_id`, `name`, `player_count`, `twitch_viewer_count`, `google_trends_avg`).
*   **Data Summary:** A summary of the collected DataFrame, including its shape (number of rows and columns) and a list of all column names, will be printed to confirm the structure of the retrieved data.

In [None]:
# Collect data for all tracked games
# include_details=True is needed to get game names for Twitch/External lookups
# include_twitch=True fetches Twitch viewership
# include_external=True fetches Google Trends, Reddit, Twitter, YouTube data
print("Starting data collection...")
print("This may take a few minutes depending on the number of games and API responsiveness.")
try:
    current_data_df = collector.collect_current_data(
        include_details=True,
        include_twitch=True,
        include_external=True
    )
    print("\n--- Collected Data Sample ---")
    # Display relevant columns, especially the newly added ones
    display_cols = [
        'app_id', 'name', 'category', 'timestamp', 'player_count', 'twitch_viewer_count',
        'google_trends_avg', 'reddit_subscribers', 'reddit_active_users', 'reddit_recent_posts', 'twitter_recent_count',
        'youtube_total_views', 'youtube_avg_views', 'youtube_avg_likes', 'release_date'
    ]
    display_cols_present = [col for col in display_cols if col in current_data_df.columns]
    display(current_data_df[display_cols_present].head())
    print(f"\nShape: {current_data_df.shape}")
    print("\nColumns:", current_data_df.columns.tolist())
except Exception as e:
    print(f"\nAn error occurred during data collection: {e}")
    # Optionally re-raise if debugging: raise e
    current_data_df = pd.DataFrame() # Ensure df exists but is empty on error

## 4. Data Persistence: Saving Collected Metrics

**Purpose:** This section is responsible for saving the `current_data_df` DataFrame, which contains the newly collected game metrics, to a persistent file format.

**Methodology:**
*   **Conditional Save:** The save operation is performed only if the `current_data_df` exists and is not empty, preventing errors from attempting to save null or failed collection attempts.
*   **`save_data` Method:** The `DataCollector`'s `save_data` method is utilized. This method handles the specifics of file naming (typically incorporating a timestamp to ensure unique filenames for each collection run) and serialization.
*   **Compression:** The `compress=True` argument is passed to the `save_data` method, indicating that the output CSV file should be compressed (e.g., using gzip). This is beneficial for reducing storage space, especially as the historical dataset grows.
*   **File Path:** The data is saved within the `data_dir` (e.g., `../data`) specified during the `DataCollector` instantiation.

**Significance:** Persisting the collected data is crucial for building a longitudinal dataset. Each execution of this notebook contributes a new snapshot of game metrics, which will be aggregated and processed in later stages (e.g., by a `DataAggregator` in `02_feature_engineering.ipynb`) to create a comprehensive time-series dataset suitable for model training and analysis.

**Expected Output Upon Execution:**
*   If data is successfully saved, a confirmation message will be printed, indicating the full file path of the saved (and compressed) CSV file.
*   If no data was collected or an error occurred during collection, a message indicating that the save operation is being skipped will be displayed.
*   Potential error messages if the save operation itself encounters an issue (e.g., disk full, permissions error).

In [None]:
# Save the collected data to a compressed CSV file
try:
    # Check if the DataFrame exists and is not empty
    if 'current_data_df' in locals() and not current_data_df.empty:
        saved_filepath = collector.save_data(data=current_data_df, compress=True)
        print(f"\nData successfully saved to: {saved_filepath}")
        
    else:
        print("\nSkipping save: No data collected or collection failed.")
except Exception as e:
     print(f"\nAn error occurred while saving data: {e}")

## 5. Preliminary Data Inspection (Optional)

**Purpose:** This optional step involves conducting a cursory examination of the `current_data_df` DataFrame. The primary goal is to quickly identify any conspicuous issues or anomalies in the freshly collected data.

**Methodology:**
*   **Conditional Execution:** The review is performed only if data collection was successful and `current_data_df` is not empty.
*   **Descriptive Statistics:** The `describe()` method from `pandas` is applied to a predefined list of key numerical columns. This generates summary statistics, including count, mean, standard deviation, minimum, maximum, and quartile values for these metrics.
*   **Selected Metrics for Review:** The statistics are typically generated for columns such as `player_count`, `twitch_viewer_count`, `google_trends_avg`, `reddit_subscribers`, `reddit_active_users`, `twitter_recent_count`, and various YouTube engagement metrics.

**Significance:** A quick review of basic statistics can help in early detection of potential problems, such as:
*   **Missing Data:** Unusually low counts in certain columns might indicate failures in specific API calls or data parsing.
*   **Data Range Issues:** Unexpected minimum or maximum values could point to outliers or errors in data retrieval (e.g., negative player counts, excessively high viewership).
*   **Consistency Checks:** Comparing metrics across different games or against known benchmarks can provide a sanity check.
This initial check is not a substitute for thorough exploratory data analysis (EDA) but serves as a first-pass quality control measure.

**Expected Output Upon Execution:**
*   If data is available, a `pandas` DataFrame displaying descriptive statistics for the selected key numerical columns.
*   If no data was collected or the DataFrame is empty, a message indicating that the statistics generation is being skipped.

In [None]:
# Display basic statistics for numerical columns if data was collected
if 'current_data_df' in locals() and not current_data_df.empty:
    print("\nBasic Statistics for Key Metrics:")
    stats_cols = [
        'player_count', 'twitch_viewer_count', 'google_trends_avg', 
        'reddit_subscribers', 'reddit_active_users', 'reddit_recent_posts', 
        'twitter_recent_count', 'youtube_total_views', 'youtube_avg_views', 'youtube_avg_likes'
    ]
    stats_cols_present = [col for col in stats_cols if col in current_data_df.columns]
    display(current_data_df[stats_cols_present].describe())
else:
    print("\nSkipping statistics: No data available.")

## 6. Concluding Remarks and Future Work

**Purpose:** This section outlines the subsequent phases of the project, building upon the data collection framework established in this notebook.

**Rationale:** The data collected herein is the foundational element for developing the game popularity prediction model. The following actions are critical for progressing towards this objective:

**Immediate and Long-Term Next Actions:**
1.  **Iterative Data Collection:** To construct a robust time-series dataset, this notebook (`01_data_collection.ipynb`) must be executed periodically (e.g., daily or weekly, depending on the desired granularity and API constraints). Each execution will append a new snapshot of game metrics to the historical data store.
2.  **Data Aggregation and Preprocessing:** Once a sufficient volume of historical data has been accumulated, the `DataAggregator` utility (anticipated to be a key component in `02_feature_engineering.ipynb`) will be employed. Its role will be to:
    *   Load and consolidate the individual, timestamped data files saved by this notebook.
    *   Perform necessary cleaning, transformation, and alignment of the time-series data.
    *   Potentially interpolate missing values or handle inconsistencies.
3.  **Feature Engineering:** Following aggregation, `02_feature_engineering.ipynb` will focus on deriving meaningful features from the processed time-series data. This may include creating lagged variables, rolling averages, trend indicators, and other domain-specific features relevant to game popularity.
4.  **Predictive Modeling:** The engineered features will then serve as input for the modeling phase, primarily detailed in `03_modeling.ipynb`. This will involve selecting appropriate machine learning algorithms (e.g., time-series models, regression models), training them on the historical feature set, and evaluating their performance in predicting future game popularity metrics.
5.  **Model Evaluation and Refinement:** The performance of the developed models will be rigorously assessed, and iterative refinements to features, model selection, and hyperparameter tuning will be conducted to optimize predictive accuracy.

In [None]:
# Final summary message
print("\nData Collection Notebook Complete.")
if 'saved_filepath' in locals():
    print(f"Latest data saved to: {saved_filepath}")
elif 'current_data_df' in locals() and current_data_df.empty:
     print("Data collection run finished, but resulted in empty data or an error occurred.")
else:
     print("Data collection run finished, but data was not saved (likely due to an error). Check previous cell outputs.")

print("\nRemember to run this notebook periodically to build your historical dataset.")

---
*End of Notebook*