# Rotten Tomatoes Movie Data Pipeline Demo

This notebook demonstrates the complete data acquisition and cleaning pipeline for the Rotten Tomatoes movie analysis project. We'll use the automated task system to download datasets, clean the data, and scrape additional movie information from Rotten Tomatoes.

## Project Overview

This project analyzes movie reviews and ratings data from Rotten Tomatoes, combining:
- **Kaggle Dataset**: Movie critic reviews and ratings
- **Web Scraping**: Additional movie details (titles, descriptions, release years) from Rotten Tomatoes website

## Data Pipeline Steps

1. **Environment Setup**: Install dependencies and verify setup
2. **Data Download**: Download raw datasets from Kaggle
3. **Data Cleaning**: Clean and validate the downloaded data
4. **Web Scraping**: Scrape additional movie details from Rotten Tomatoes
5. **Data Exploration**: Validate and explore the final cleaned datasets

## 1. Environment Setup

First, let's verify that our environment is properly set up and all dependencies are installed.

In [1]:
# Check Python version and environment
import sys
print(f"Python version: {sys.version}")
print(f"Current working directory: {sys.path[0]}")

# Verify required packages are available
try:
    import polars as pl
    import pandas as pd
    import numpy as np
    print("‚úÖ Core data packages available")
except ImportError as e:
    print(f"‚ùå Missing packages: {e}")

Python version: 3.12.8 (main, Jan 14 2025, 22:49:36) [MSC v.1942 64 bit (AMD64)]
Current working directory: C:\Users\anton\AppData\Roaming\uv\python\cpython-3.12.8-windows-x86_64-none\python312.zip
‚úÖ Core data packages available
‚úÖ Core data packages available


In [2]:
# Sync dependencies using uv
!uv sync

[2mResolved [1m114 packages[0m [2min 2.11s[0m[0m
[2mAudited [1m110 packages[0m [2min 273ms[0m[0m


## 2. Data Download

Now let's download the raw datasets from Kaggle. This includes the Rotten Tomatoes critic reviews dataset.

In [3]:
# Download datasets from Kaggle
# This will download and save the Rotten Tomatoes datasets locally
!task download-data

[32mtask: [sync] uv sync
[0m[2mResolved [1m114 packages[0m [2min 1ms[0m[0m
[2mAudited [1m110 packages[0m [2min 10ms[0m[0m
[32mtask: [download-data] uv run python src/02807_project/data_loader.py
[0m


In [None]:
# Verify downloaded data
import os
from pathlib import Path

data_dir = Path("data/raw")
if data_dir.exists():
    files = list(data_dir.glob("*.csv"))
    print(f"üìÅ Found {len(files)} CSV files in {data_dir}:")
    for file in files:
        size = file.stat().st_size / 1024 / 1024  # Size in MB
        print(".2f")
else:
    print("‚ùå Data directory not found")

## 3. Data Cleaning

Next, let's clean the downloaded data by removing null values, invalid ratings, and standardizing the format.

In [None]:
# Clean the downloaded datasets
# This removes nulls, invalid ratings, and maps review scores to numeric values
!task clean-data

In [None]:
# Quick exploration of cleaned data
cleaned_files = list(Path("data/clean").glob("*.csv"))
if cleaned_files:
    print(f"üìÅ Cleaned datasets available:")
    for file in cleaned_files:
        df = pl.read_csv(file)
        print(f"\nüìä {file.name}:")
        print(f"   Rows: {len(df):,}")
        print(f"   Columns: {len(df.columns)}")
        print(f"   Columns: {', '.join(df.columns)}")
else:
    print("‚ùå No cleaned data files found")

## 4. Web Scraping - Rotten Tomatoes Data

Now let's scrape additional movie details (titles, descriptions, release years) from Rotten Tomatoes using the movie IDs from our cleaned dataset.

In [None]:
# Scrape movie details from Rotten Tomatoes
# This will scrape titles, descriptions, and release years for movies in our dataset
# Note: This may take several minutes depending on the number of movies
!task scrape-rt-movies

## 5. Handling Failed Scrapes

If some movies failed to scrape (due to network issues, rate limiting, etc.), we can retry them specifically.

In [None]:
# Check for failed scrapes and retry them
# This will only re-scrape movies that failed in the previous attempt
# Note: Run this only if you suspect some scrapes failed
!task retry-failed-scrapes

## 6. Final Data Exploration

Let's explore our complete dataset including the scraped Rotten Tomatoes data.

In [None]:
# Explore all datasets (raw and cleaned)
# This provides comprehensive statistics and summaries
!task explore-data

## 7. Summary and Next Steps

### What We've Accomplished

‚úÖ **Environment Setup**: Verified Python environment and installed dependencies
‚úÖ **Data Download**: Downloaded Rotten Tomatoes datasets from Kaggle
‚úÖ **Data Cleaning**: Cleaned data by removing nulls and standardizing formats
‚úÖ **Web Scraping**: Scraped additional movie details from Rotten Tomatoes
‚úÖ **Error Handling**: Implemented retry logic for failed scrapes
‚úÖ **Data Exploration**: Validated and explored the final datasets

### Available Data Files

After running this pipeline, you'll have:

- `data/raw/rotten_tomatoes_movies.csv` - Raw movie data from Kaggle
- `data/raw/rotten_tomatoes_critic_reviews.csv` - Raw critic reviews from Kaggle
- `data/clean/rotten_tomatoes_movies_clean.csv` - Cleaned movie data
- `data/clean/rotten_tomatoes_critic_reviews_clean.csv` - Cleaned critic reviews
- `data/raw/rotten_tomatoes_movie_details.csv` - Scraped movie details (titles, descriptions, years)

### Quick Pipeline Command

For future runs, you can use the complete pipeline in one command:
```bash
task setup-data
```

This runs: sync ‚Üí download-data ‚Üí clean-data all in sequence.

### Next Steps

With your data ready, you can now:
- Perform sentiment analysis on critic reviews
- Analyze the relationship between critic scores and audience ratings
- Build recommendation models based on movie features
- Visualize trends in movie ratings over time

Happy analyzing! üé¨üìä