# Steam Game Recommender System - Data Exploration

This notebook explores the Steam Video Game and Bundle Data from Professor Julian McAuley's research repository.

In [None]:
# Import necessary libraries
import os
import sys
import json
import gzip
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add the project root directory to the Python path
import sys
sys.path.append('..')

# Import project modules
# Check if they exist, otherwise define loading functions here for standalone usage
try:
    from src.data.loader import load_json_gz, load_json, convert_to_dataframes
except ImportError:
    # Define functions locally if module import fails
    def load_json_gz(filepath):
        """Load a gzipped JSON file line by line."""
        data = []
        with gzip.open(filepath, 'rt') as f:
            for line in f:
                try:
                    data.append(json.loads(line))
                except json.JSONDecodeError:
                    continue
        return data
    
    def load_json(filepath):
        """Load a JSON file line by line."""
        data = []
        with open(filepath, 'r') as f:
            for line in f:
                try:
                    data.append(json.loads(line))
                except json.JSONDecodeError:
                    continue
        return data
    
    def convert_to_dataframes(data):
        """Convert JSON data to pandas DataFrames."""
        dfs = {}
        if 'reviews' in data and data['reviews']:
            dfs['reviews'] = pd.DataFrame(data['reviews'])
        if 'metadata' in data and data['metadata']:
            dfs['metadata'] = pd.DataFrame(data['metadata'])
        if 'bundles' in data and data['bundles']:
            bundle_rows = []
            for bundle in data['bundles']:
                for item in bundle.get('items', []):
                    bundle_row = {
                        'bundle_id': bundle.get('bundle_id'),
                        'bundle_name': bundle.get('bundle_name'),
                        'bundle_price': bundle.get('bundle_price'),
                        'item_id': item.get('item_id'),
                        'item_name': item.get('item_name'),
                        'genre': item.get('genre')
                    }
                    bundle_rows.append(bundle_row)
            dfs['bundles'] = pd.DataFrame(bundle_rows)
        return dfs

# Set up plotting
plt.style.use('ggplot')
sns.set(style="whitegrid")
%matplotlib inline

## 1. Loading the Data

First, let's define the paths to our data files and load them.

In [None]:
# Define data paths
DATA_DIR = '../data'
REVIEWS_PATH = os.path.join(DATA_DIR, 'reviews_v2.json.gz')
METADATA_PATH = os.path.join(DATA_DIR, 'items_v2.json.gz')
BUNDLES_PATH = os.path.join(DATA_DIR, 'bundles.json')

# Check if files exist
files_exist = all(os.path.exists(path) for path in [REVIEWS_PATH, METADATA_PATH, BUNDLES_PATH])

if not files_exist:
    print("Some data files are missing. Please run the download_data.py script first.")
    print("You can run: python ../scripts/download_data.py")
else:
    print("All data files found.")

In [None]:
# Function to load a sample of data for exploration
def load_sample_data(reviews_path, metadata_path, bundles_path, sample_size=10000):
    """Load a sample of data for exploration."""
    data = {}
    
    # Load a sample of reviews
    print("Loading reviews sample...")
    if os.path.exists(reviews_path):
        with gzip.open(reviews_path, 'rt') as f:
            data['reviews'] = []
            for i, line in enumerate(f):
                if i >= sample_size:
                    break
                try:
                    data['reviews'].append(json.loads(line))
                except json.JSONDecodeError:
                    continue
    
    # Load a sample of metadata
    print("Loading metadata sample...")
    if os.path.exists(metadata_path):
        with gzip.open(metadata_path, 'rt') as f:
            data['metadata'] = []
            for i, line in enumerate(f):
                if i >= sample_size // 10:  # Load fewer metadata items
                    break
                try:
                    data['metadata'].append(json.loads(line))
                except json.JSONDecodeError:
                    continue
    
    # Load bundles
    print("Loading bundles...")
    if os.path.exists(bundles_path):
        data['bundles'] = load_json(bundles_path)
    
    return data

# Load a sample of data
if files_exist:
    sample_data = load_sample_data(REVIEWS_PATH, METADATA_PATH, BUNDLES_PATH)
    dfs = convert_to_dataframes(sample_data)
    
    # Print summary of loaded data
    print("\nLoaded data summary:")
    for key, df in dfs.items():
        print(f"{key}: {len(df)} records")

## 2. Examining the Data Structure

Let's look at the structure of each dataset.

In [None]:
# Examine reviews data
if 'reviews' in dfs:
    print("Reviews data columns:")
    print(dfs['reviews'].columns.tolist())
    print("\nReviews data sample:")
    display(dfs['reviews'].head())
else:
    print("No reviews data loaded.")

In [None]:
# Examine metadata
if 'metadata' in dfs:
    print("Metadata columns:")
    print(dfs['metadata'].columns.tolist())
    print("\nMetadata sample:")
    display(dfs['metadata'].head())
else:
    print("No metadata loaded.")

In [None]:
# Examine bundles data
if 'bundles' in dfs:
    print("Bundles columns:")
    print(dfs['bundles'].columns.tolist())
    print("\nBundles sample:")
    display(dfs['bundles'].head())
else:
    print("No bundles data loaded.")

## 3. Data Analysis

Let's analyze the key aspects of the data that will be relevant for building a recommendation system.

### 3.1 Reviews and Playtime Analysis

In [None]:
if 'reviews' in dfs:
    reviews_df = dfs['reviews']
    
    # Basic statistics
    print(f"Number of unique users: {reviews_df['user_id'].nunique()}")
    print(f"Number of unique games: {reviews_df['item_id'].nunique()}")
    
    # Playtime analysis
    if 'playtime_forever' in reviews_df.columns:
        print("\nPlaytime statistics:")
        print(reviews_df['playtime_forever'].describe())
        
        # Visualization of playtime distribution
        plt.figure(figsize=(10, 6))
        reviews_df['playtime_forever'].hist(bins=50)
        plt.title('Distribution of Playtime')
        plt.xlabel('Playtime (minutes)')
        plt.ylabel('Number of users')
        plt.xscale('log')  # Use log scale for better visualization
        plt.show()
    
    # Games per user
    games_per_user = reviews_df.groupby('user_id')['item_id'].count()
    print("\nGames per user statistics:")
    print(games_per_user.describe())
    
    plt.figure(figsize=(10, 6))
    games_per_user.hist(bins=30)
    plt.title('Number of Games per User')
    plt.xlabel('Number of games')
    plt.ylabel('Number of users')
    plt.show()
    
    # Users per game
    users_per_game = reviews_df.groupby('item_id')['user_id'].count()
    print("\nUsers per game statistics:")
    print(users_per_game.describe())
    
    plt.figure(figsize=(10, 6))
    users_per_game.hist(bins=30)
    plt.title('Number of Users per Game')
    plt.xlabel('Number of users')
    plt.ylabel('Number of games')
    plt.show()
    
    # Top games by number of players
    top_games = users_per_game.sort_values(ascending=False).head(20)
    
    plt.figure(figsize=(12, 8))
    top_games.plot(kind='bar')
    plt.title('Top 20 Games by Number of Players')
    plt.xlabel('Game ID')
    plt.ylabel('Number of players')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

### 3.2 Genre Analysis

In [None]:
if 'metadata' in dfs:
    metadata_df = dfs['metadata']
    
    # Analyze genres if available
    if 'genre' in metadata_df.columns:
        # Extract all genres
        all_genres = []
        for genres in metadata_df['genre'].dropna():
            if isinstance(genres, str):
                all_genres.extend([g.strip() for g in genres.split(',')])
        
        # Count genre occurrences
        genre_counts = pd.Series(all_genres).value_counts()
        
        print("Top 20 genres:")
        print(genre_counts.head(20))
        
        # Visualize top genres
        plt.figure(figsize=(12, 8))
        genre_counts.head(15).plot(kind='bar')
        plt.title('Top 15 Game Genres')
        plt.xlabel('Genre')
        plt.ylabel('Number of games')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

### 3.3 Bundle Analysis

In [None]:
if 'bundles' in dfs:
    bundles_df = dfs['bundles']
    
    # Analyze bundle sizes
    bundle_sizes = bundles_df.groupby('bundle_id')['item_id'].count()
    
    print("Bundle size statistics:")
    print(bundle_sizes.describe())
    
    plt.figure(figsize=(10, 6))
    bundle_sizes.hist(bins=20)
    plt.title('Distribution of Bundle Sizes')
    plt.xlabel('Number of games in bundle')
    plt.ylabel('Number of bundles')
    plt.show()
    
    # Top bundles by size
    top_bundles = bundle_sizes.sort_values(ascending=False).head(10)
    
    plt.figure(figsize=(12, 6))
    top_bundles.plot(kind='bar')
    plt.title('Top 10 Largest Bundles')
    plt.xlabel('Bundle ID')
    plt.ylabel('Number of games')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

## 4. Creating a User-Game Interaction Matrix

Let's create a small sample of the user-game interaction matrix to see how sparse it is.

In [None]:
if 'reviews' in dfs:
    reviews_df = dfs['reviews']
    
    # Take a small subset for visualization
    top_users = reviews_df['user_id'].value_counts().head(20).index
    top_games = reviews_df['item_id'].value_counts().head(20).index
    
    small_df = reviews_df[
        reviews_df['user_id'].isin(top_users) & 
        reviews_df['item_id'].isin(top_games)
    ]
    
    # Create a pivot table
    if 'playtime_forever' in small_df.columns:
        # Use playtime as values
        pivot = small_df.pivot_table(
            index='user_id',
            columns='item_id',
            values='playtime_forever',
            fill_value=0
        )
    else:
        # Use binary indicators if playtime is not available
        small_df['interaction'] = 1
        pivot = small_df.pivot_table(
            index='user_id',
            columns='item_id',
            values='interaction',
            fill_value=0
        )
    
    # Visualize the matrix
    plt.figure(figsize=(12, 10))
    sns.heatmap(pivot > 0, cmap='Blues', cbar=False)
    plt.title('User-Game Interaction Matrix (Binary)')
    plt.xlabel('Games')
    plt.ylabel('Users')
    plt.show()
    
    # Calculate sparsity
    sparsity = (pivot == 0).sum().sum() / (pivot.shape[0] * pivot.shape[1])
    print(f"Matrix sparsity: {sparsity:.4f} ({sparsity*100:.2f}% of entries are zeros)")

## 5. Key Findings and Implications for Recommendation System

Based on our exploratory data analysis, here are some key observations and their implications for building a recommendation system:

1. **Data Sparsity**: The user-game interaction matrix is extremely sparse, which means we'll need techniques that can handle sparse data well, such as matrix factorization.

2. **Playtime Distribution**: Playtime is heavily skewed, with many short plays and few extremely long plays. We may need to normalize or transform playtime data.

3. **User and Game Distributions**: There's a large variation in the number of games per user and users per game. This indicates potential cold-start problems for new users or games.

4. **Genre Information**: Genre data can be valuable for content-based filtering or hybrid approaches that combine collaborative and content-based methods.

5. **Bundles**: Bundle information provides additional context that could be used for package recommendations.

In the next notebook, we'll preprocess this data to prepare it for model training.

## 6. Next Steps

1. **Data Preprocessing**:
   - Handle missing values
   - Normalize playtime data
   - Filter out users with few interactions and games with few players
   - Split data into training and testing sets

2. **Model Development**:
   - Implement baseline collaborative filtering models
   - Implement advanced SVD-based models
   - Consider hybrid approaches using genre information

3. **Evaluation**:
   - Measure model performance using metrics like precision@k and hit rate
   - Compare different approaches
   - Tune hyperparameters for optimal performance