# 01. Data Collection from Steam API

**Purpose:** This notebook demonstrates how to collect real-time game data from the Steam API using our custom data collection system. This is the foundation for building our time series dataset of game popularity metrics.

**Why This Matters:** Manual data download has limitations in terms of freshness and scalability. By implementing direct API integration, we can:
- Get real-time player count data
- Automate periodic data collection
- Build reliable time series datasets
- Expand to multiple data sources later

**What to Expect:** After running this notebook, you will:
1. Successfully collect current player data from Steam API
2. Save data in structured format for time series analysis
3. Understand the data collection workflow
4. Set up the foundation for continuous data collection

## 1. Setup and Configuration

**Purpose:** Import necessary libraries and configure the data collection environment.

**Why:** Proper setup ensures our data collection process runs smoothly and handles potential errors gracefully.

In [None]:
# Add src directory to path for importing our modules
import sys
sys.path.append('../src')

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import json
import os

# Import our custom modules
from steam_api_connector import SteamAPIConnector
from data_collector import DataCollector
from utils import configure_plotting, format_large_numbers

# Configure plotting
configure_plotting()

# Display current time for reference
print(f"Data Collection Started: {datetime.now()}")

## 2. Initialize Data Collector

**Purpose:** Create and configure our data collector instance.

**Why:** The data collector manages our game database, API connections, and data storage. Understanding its configuration helps us track what we're collecting and why.

**Expected Output:** Confirmation of initialization and display of tracked game categories.

In [None]:
# Initialize data collector
collector = DataCollector(data_dir="../data")

# Display game categories and counts
print("Game Categories in Tracking System:")
print("-" * 40)

for category, game_ids in collector.game_categories.items():
    print(f"{category.capitalize():<12} : {len(game_ids)} games")
    
total_games = len(collector.get_all_game_ids())
print("-" * 40)
print(f"Total Games : {total_games}")

# Display some example game IDs from each category
print("\nExample Games by Category:")
for category in collector.game_categories:
    example_ids = collector.game_categories[category][:3]
    print(f"\n{category.capitalize()}:")
    for app_id in example_ids:
        # Attempt to get game name (this is a lightweight operation)
        try:
            details = collector.steam_api.get_app_details(app_id)
            name = details.get('data', {}).get('name', f'App ID {app_id}')
        except:
            name = f'App ID {app_id}'
        print(f"  {app_id}: {name}")

## 3. Collect Current Data

**Purpose:** Execute our first data collection from the Steam API.

**Why:** This demonstrates the real-time data collection capability and shows what information we can gather for each game.

**Expected Output:** 
- A DataFrame containing current player counts
- Basic game information (names, categories, release dates)
- Timestamp information for tracking data collection time

In [None]:
# Collect current data for all tracked games
print("Starting data collection...")
print("This may take a few minutes due to API rate limiting.")

current_data = collector.collect_current_data(include_details=True)

# Display collection summary
print(f"\nData collection completed at: {datetime.now()}")
print(f"Total games collected: {len(current_data)}")
print(f"Total players across all games: {current_data['player_count'].sum():,}")

# Show sample of collected data
print("\nSample of collected data:")
display(current_data.head())

# Show column information
print("\nDataFrame Info:")
current_data.info()

## 4. Save Data to File

**Purpose:** Save our collected data for future analysis and build our historical dataset.

**Why:** Persistent storage is crucial for time series analysis. Each collection adds to our growing dataset of game popularity metrics.

**Expected Output:** Confirmation of file save location and file size.

In [None]:
# Save data to CSV
save_path = collector.save_data(current_data, compress=True)

# Display file information
import os
file_size_mb = os.path.getsize(save_path) / (1024 * 1024)
print(f"\nFile saved successfully:")
print(f"Location: {save_path}")
print(f"Size: {file_size_mb:.2f} MB")

# Verify we can load the data back
loaded_data = collector.load_data(save_path)
print(f"\nVerification - Loaded {len(loaded_data)} rows from saved file")

## 5. Initial Data Analysis

**Purpose:** Perform basic analysis to understand our data and ensure quality.

**Why:** Initial analysis helps validate data quality and provides insights into current game popularity patterns.

**Expected Output:**
- Summary statistics by game category
- Top games by player count
- Distribution visualizations

In [None]:
# Basic statistics by category
print("Player Count Statistics by Category:")
print("-" * 50)

category_stats = current_data.groupby('category')['player_count'].agg(['count', 'mean', 'sum'])
category_stats.columns = ['Games', 'Avg Players', 'Total Players']
category_stats['Avg Players'] = category_stats['Avg Players'].round(0).astype(int)
display(category_stats)

# Top 10 games by player count
print("\nTop 10 Games by Current Player Count:")
print("-" * 50)

top_10 = current_data.nlargest(10, 'player_count')[['name', 'player_count', 'category']]
top_10['player_count'] = top_10['player_count'].apply(lambda x: f"{x:,}")
display(top_10)

## 6. Data Visualizations

**Purpose:** Create visual representations of our collected data.

**Why:** Visualizations help identify patterns, outliers, and distribution characteristics that inform our analysis.

**Expected Output:**
- Player count distribution by category
- Top games bar chart
- Game release date timeline

In [None]:
# Plot 1: Player count distribution by category (box plot)
plt.figure(figsize=(12, 8))
sns.boxplot(x='category', y='player_count', data=current_data)
plt.title('Player Count Distribution by Game Category')
plt.xlabel('Game Category')
plt.ylabel('Player Count')
plt.xticks(rotation=0)

# Format y-axis with human-readable numbers
yticks = plt.gca().get_yticks()
plt.gca().set_yticklabels([format_large_numbers(y) for y in yticks])

plt.tight_layout()
plt.show()

# Plot 2: Top 10 games bar chart
plt.figure(figsize=(14, 8))
top_10_for_plot = current_data.nlargest(10, 'player_count')
sns.barplot(x='player_count', y='name', data=top_10_for_plot)
plt.title('Top 10 Games by Current Player Count')
plt.xlabel('Current Player Count')
plt.ylabel('Game')

# Format x-axis with human-readable numbers
xticks = plt.gca().get_xticks()
plt.gca().set_xticklabels([format_large_numbers(x) for x in xticks])

plt.tight_layout()
plt.show()

## 7. Additional Metadata Analysis

**Purpose:** Explore additional game metadata to understand our dataset better.

**Why:** Metadata like release dates, genres, and pricing helps in feature engineering for prediction models.

**Expected Output:**
- Genre distribution
- Free vs paid games analysis
- Release date patterns

In [None]:
# Genre analysis
print("Genre Analysis:")
print("-" * 50)

# Extract individual genres
all_genres = []
for genres_str in current_data['genres'].fillna(''):
    all_genres.extend([g.strip() for g in genres_str.split(',')])

# Count genre occurrences
from collections import Counter
genre_counts = Counter([g for g in all_genres if g])
top_genres = genre_counts.most_common(10)

# Display top genres
for genre, count in top_genres:
    print(f"{genre:<20} : {count} games")

# Free vs Paid analysis
print("\nFree vs Paid Games:")
print("-" * 50)

free_paid_stats = current_data.groupby('is_free').agg({
    'app_id': 'count',
    'player_count': ['mean', 'sum']
})

free_paid_stats.columns = ['Count', 'Avg Players', 'Total Players']
free_paid_stats.index = ['Paid', 'Free']
display(free_paid_stats)

## 8. API Health Check

**Purpose:** Monitor API usage and connection health.

**Why:** Understanding API performance helps us optimize data collection and diagnose potential issues.

**Expected Output:**
- API request statistics
- Error rates
- Performance metrics

In [None]:
# Get API statistics
api_stats = collector.steam_api.get_api_statistics()

print("Steam API Health Check:")
print("-" * 50)
print(f"Total API Requests: {api_stats['total_requests']}")
print(f"Total Errors: {api_stats['total_errors']}")
print(f"Error Rate: {api_stats['error_rate_percent']:.2f}%")
print(f"Cache Size: {api_stats['cache_size']} entries")

# Create a simple health dashboard
health_status = "HEALTHY" if api_stats['error_rate_percent'] < 5 else "WARNING"
print(f"\nAPI Health Status: {health_status}")

## 9. Setting Up Continuous Collection

**Purpose:** Demonstrate how to set up continuous data collection.

**Why:** For time series analysis, we need regular data collection to track changes over time.

**What to Know:** In production, this would run as a scheduled task (cron job, Airflow, etc.). Here we show how it would be configured.

**Note:** Do not run the continuous collection in this notebook - it's for demonstration only!

In [None]:
# Example configuration for continuous collection
# DO NOT RUN THIS CELL - IT'S FOR DEMONSTRATION ONLY

'''
# Run this in a separate script or scheduled task
def run_scheduled_collection():
    collector = DataCollector()
    
    # Collect data every 6 hours for 30 days
    collector.run_collection_loop(
        interval_hours=6,
        duration_days=30
    )

# For production, use:
# - Airflow DAG for orchestration
# - Docker container for isolation
# - Cloud storage for data backup
'''

print("Continuous collection configuration example displayed above.")
print("To implement continuous collection:")
print("1. Create a separate Python script using the configuration above")
print("2. Set up a cron job or use a task orchestrator")
print("3. Implement error handling and notifications")
print("4. Set up data backup mechanisms")

## 10. Next Steps

**Purpose:** Outline what to do after data collection is set up.

**Why:** Understanding the roadmap helps guide the development of our prediction model.

**Next Actions:**
1. Set up continuous data collection
2. Implement feature engineering (Notebook 02)
3. Build predictive models (Notebook 03)
4. Validate and deploy the system

In [None]:
print("Data Collection Summary:")
print("=" * 50)
print(f"Collection Time: {datetime.now()}")
print(f"Games Collected: {len(current_data)}")
print(f"Total Players: {current_data['player_count'].sum():,}")
print(f"Data Saved To: {save_path}")

print("\n\nNext Steps:")
print("-" * 50)
print("1. Review the collected data for quality")
print("2. Set up continuous data collection")
print("3. Move to notebook 02 for feature engineering")
print("4. Start building the prediction model")

# Save a collection summary for reference
summary = {
    'collection_time': datetime.now().isoformat(),
    'games_collected': len(current_data),
    'total_players': int(current_data['player_count'].sum()),
    'top_game': current_data.nlargest(1, 'player_count')[['name', 'player_count']].to_dict('records')[0],
    'category_counts': current_data['category'].value_counts().to_dict()
}

# Save summary as JSON
summary_path = f"../data/collection_summary_{datetime.now().strftime('%Y-%m-%d-%H-%M')}.json"
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=4)

print(f"\nCollection summary saved to: {summary_path}")