# IMDB Data Analysis - Example Usage

This notebook demonstrates how to use the new data loading system for IMDB analysis.

## Setup

In [None]:
import sys
sys.path.append('../src')

from data_loader import IMDBDataLoader, create_ml_dataset, load_ratings, load_basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## Data Overview

In [None]:
# Initialize data loader
loader = IMDBDataLoader()

# Get overview of available datasets
datasets = loader.get_available_datasets()
print("Available datasets:")
for dataset in datasets:
    print(f"  - {dataset}")

print("\nDataset Information:")
print("-" * 50)
info = loader.get_dataset_info()
for dataset, details in info.items():
    print(f"{dataset}:")
    print(f"  Chunks: {details['chunks']}")
    print(f"  Total rows: {details['total_rows']:,}")
    print(f"  Sample available: {details['sample_available']}")
    print()

## Quick Data Exploration with Samples

In [None]:
# Load sample datasets for quick exploration
ratings_sample = loader.load_sample('title.ratings')
basics_sample = loader.load_sample('title.basics')

print("Ratings sample:")
print(ratings_sample.head())
print(f"Shape: {ratings_sample.shape}")

print("\nBasics sample:")
print(basics_sample.head())
print(f"Shape: {basics_sample.shape}")

## Loading Larger Datasets

In [None]:
# Load specific number of chunks for analysis
ratings_medium = loader.load_chunks('title.ratings', max_chunks=3)
print(f"Loaded ratings dataset: {ratings_medium.shape}")
print(f"Rating distribution:")
print(ratings_medium['averageRating'].describe())

## Creating ML-Ready Dataset

In [None]:
# Create a dataset ready for machine learning
ml_data = create_ml_dataset(sample_size=25000, random_state=42)

print(f"ML dataset shape: {ml_data.shape}")
print(f"ML dataset columns: {list(ml_data.columns)}")
print("\nFirst few rows:")
print(ml_data.head())

## Quick Visualization

In [None]:
# Create some quick visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Rating distribution
axes[0,0].hist(ml_data['averageRating'], bins=50, edgecolor='black', alpha=0.7)
axes[0,0].set_title('Rating Distribution')
axes[0,0].set_xlabel('Average Rating')
axes[0,0].set_ylabel('Frequency')

# Votes vs Rating scatter
sample_for_scatter = ml_data.sample(5000)  # Use sample for cleaner plot
axes[0,1].scatter(sample_for_scatter['averageRating'], sample_for_scatter['numVotes'], alpha=0.6)
axes[0,1].set_title('Votes vs Rating')
axes[0,1].set_xlabel('Average Rating')
axes[0,1].set_ylabel('Number of Votes')
axes[0,1].set_yscale('log')

# Title types
title_counts = ml_data['titleType'].value_counts().head(10)
axes[1,0].bar(range(len(title_counts)), title_counts.values)
axes[1,0].set_title('Top Title Types')
axes[1,0].set_xlabel('Title Type')
axes[1,0].set_ylabel('Count')
axes[1,0].set_xticks(range(len(title_counts)))
axes[1,0].set_xticklabels(title_counts.index, rotation=45)

# Runtime distribution (where available)
runtime_data = ml_data['runtimeMinutes'].dropna()
runtime_data = runtime_data[runtime_data < 300]  # Filter extreme outliers
axes[1,1].hist(runtime_data, bins=50, edgecolor='black', alpha=0.7)
axes[1,1].set_title('Runtime Distribution')
axes[1,1].set_xlabel('Runtime (minutes)')
axes[1,1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## Data Summary

In [None]:
print("=" * 60)
print("IMDB DATASET SUMMARY")
print("=" * 60)

print(f"\nTotal records in ML dataset: {len(ml_data):,}")
print(f"Average rating: {ml_data['averageRating'].mean():.2f}")
print(f"Median rating: {ml_data['averageRating'].median():.2f}")
print(f"Rating range: {ml_data['averageRating'].min():.1f} - {ml_data['averageRating'].max():.1f}")

print(f"\nTotal votes: {ml_data['numVotes'].sum():,}")
print(f"Average votes per title: {ml_data['numVotes'].mean():.0f}")
print(f"Median votes per title: {ml_data['numVotes'].median():.0f}")

print(f"\nTop 5 Title Types:")
top_types = ml_data['titleType'].value_counts().head(5)
for title_type, count in top_types.items():
    print(f"  {title_type}: {count:,} ({count/len(ml_data)*100:.1f}%)")

# Year range
years = ml_data['startYear'].dropna()
if len(years) > 0:
    print(f"\nYear range: {int(years.min())} - {int(years.max())}")
    print(f"Most common decade: {int(years.mode().iloc[0]//10*10)}s")

## Next Steps

This notebook demonstrates the basic usage of the new data loading system. You can now:

1. **Use the other analysis notebooks** with the updated data loading system
2. **Load larger datasets** by adjusting `max_chunks` parameter
3. **Create custom datasets** by merging multiple tables
4. **Build machine learning models** using the `create_ml_dataset()` function

Key functions to remember:
- `loader.load_sample(dataset_name)` - Quick samples for exploration
- `loader.load_chunks(dataset_name, max_chunks=N)` - Load N chunks of data
- `create_ml_dataset(sample_size=N)` - Ready-to-use ML dataset
- `loader.create_merged_dataset([datasets], join_column)` - Merge datasets