# Movie Summary Generation with Gemini API

## Overview
This notebook generates AI-powered movie summaries for the MovieLens dataset using Google's Gemini 2.5 Flash via the new **google-genai** SDK with **parallel batch processing**.

‚ö†Ô∏è **IMPORTANT:** If you're updating from an older version of this notebook, please **restart the runtime** before running:
- Click: **Runtime ‚Üí Restart runtime**
- Then run all cells from the beginning

**Features:**
- Test mode (25 movies) and Full mode (~9,700 movies)
- **Parallel processing** with 50 concurrent workers
- **Batch processing** for optimal throughput
- Progress saving every 100 movies
- Automatic retry logic for API failures
- Clean CSV output ready for database import
- Comprehensive error logging with inline error details

**Performance:**
- ~50x faster with parallel processing
- Full dataset: **15-30 minutes** (vs 5-8 hours sequential)
- ~300-600 movies per minute throughput
- No grounding needed (movies are in training data)

**Requirements:**
- Colab Enterprise environment
- Vertex AI API enabled in your GCP project
- Proper IAM permissions (Vertex AI User role)

**SDK:** Uses the new unified `google-genai` SDK (released late 2024)

**Output Format:** `summaries.csv` with columns: `movieId,summary`

## Setup and Configuration

In [1]:
# Install required packages
# Using google-genai - the new unified SDK for Gemini API
!pip install -q google-genai pandas tqdm

In [2]:
import pandas as pd
import time
import re
import csv
from pathlib import Path
from tqdm.notebook import tqdm
from datetime import datetime
from typing import Optional, Tuple
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

In [3]:
import threading
_thread_local = threading.local()

def get_client():
    if getattr(_thread_local, "client", None) is None:
        from google import genai
        import google.auth
        credentials, project_id = google.auth.default()
        _thread_local.client = genai.Client(
            vertexai=True,
            project=project_id,
            location='us-central1',
        )
    return _thread_local.client


In [None]:
# Configuration
class Config:
    """Central configuration for the movie summary generator"""

    # Mode: 'test' for 25 movies, 'full' for all movies
    MODE = 'full'  # Change to 'full' for production run
    DEBUG = True   # Set to True to see detailed error messages

    # Data URLs and paths
    INPUT_URL = 'https://raw.githubusercontent.com/haggman/cymbalflix/main/starter/data/ml-latest-small/movies.csv'
    OUTPUT_FILE = 'summaries.csv'
    PROGRESS_FILE = 'summary_progress.csv'
    ERROR_LOG = 'summary_errors.csv'

    # API Configuration
    MODEL_NAME = 'gemini-2.5-flash'  # Latest fast model

    # Processing settings
    TEST_MODE_LIMIT = 25
    SAVE_INTERVAL = 100  # Save progress every N movies
    MAX_RETRIES = 3
    RETRY_DELAY = 2  # seconds

    # Batch processing for speed
    BATCH_SIZE = 100  # Process this many movies concurrently
    MAX_WORKERS = 50  # Number of parallel threads

    # Summary requirements
    MIN_SUMMARY_LENGTH = 50  # characters
    MAX_SUMMARY_LENGTH = 3000  # characters

# Display current configuration
print(f"Configuration:")
print(f"  Mode: {Config.MODE.upper()}")
print(f"  Model: {Config.MODEL_NAME}")
print(f"  Movies to process: {Config.TEST_MODE_LIMIT if Config.MODE == 'test' else 'ALL (~9,700)'}")
print(f"  Max workers: {Config.MAX_WORKERS}")
print(f"  Save interval: {Config.SAVE_INTERVAL} movies")

Configuration:
  Mode: FULL
  Model: gemini-2.5-flash
  Movies to process: ALL (~9,700)
  Max workers: 50
  Save interval: 100 movies


## Google GenAI SDK Setup

**Note:** Using the new unified `google-genai` SDK. In Colab Enterprise, authentication is automatic via your project's default credentials.

**Performance:** Grounding is disabled for speed since these movies (pre-2019) are already in Gemini's training data.

In [None]:
# Initialize Google GenAI client for Vertex AI
from google import genai
from google.genai import types

# Get project from environment
try:
    import google.auth

    credentials, project_id = google.auth.default()

    print(f"‚úÖ Authentication successful")
    print(f"   Project: {project_id}")
    print(f"   Location: us-central1")
    print(f"   Using Vertex AI: True")

except Exception as e:
    print(f"‚ùå Authentication error: {e}")
    print("Please ensure you're running in Colab Enterprise with proper permissions")
    raise

print(f"\n‚úÖ Ready to use model: {Config.MODEL_NAME}")

‚úÖ Authentication successful
   Project: qwiklabs-gcp-01-5c40e6907a04
   Location: us-central1
   Using Vertex AI: True

‚úÖ Ready to use model: gemini-2.5-flash


## Summary Generation Functions

In [None]:
def create_summary_prompt(title: str, year: Optional[int]) -> str:
    # Convert pandas NA to None for boolean check
    if pd.isna(year):
        year = None

    year_str = f" ({year})" if year else ""

    """Create a prompt for Gemini to generate a movie summary.

    Args:
        title: Clean movie title without year
        year: Release year (if known)

    Returns:
        Formatted prompt string
    """

    prompt = f"""Write a 1-2 paragraph summary of the movie "{title}"{year_str}.

Include:
- Brief plot overview (spoiler-free)
- Notable cast and director
- Critical reception and cultural impact

Write 150-250 words in a neutral, encyclopedic tone."""

    return prompt


def generate_summary(title: str, year: Optional[int], retries: int = Config.MAX_RETRIES) -> Tuple[Optional[str], Optional[str]]:
    """Generate a movie summary using Gemini API with retry logic.

    Args:
        title: Clean movie title
        year: Release year
        retries: Number of retry attempts

    Returns:
        Tuple of (summary, error_message). If successful, error_message is None.
    """
    prompt = create_summary_prompt(title, year)

    # Configure generation parameters (NO grounding for speed)
    generate_config = types.GenerateContentConfig(
        temperature=1.3,
        tools=[types.Tool(google_search=types.GoogleSearch())],
    )

    for attempt in range(retries):
        try:
            response = get_client().models.generate_content(
                model=Config.MODEL_NAME,
                contents=prompt,
                config=generate_config,
            )

            if Config.DEBUG and Config.MODE == 'test' and attempt == 0:
                print(f"\n  DEBUG - Movie: {title} ({year})")
                print(f"  Response type: {type(response)}")
                print(f"  Has text attr: {hasattr(response, 'text')}")
                if hasattr(response, 'candidates'):
                    print(f"  Candidates: {len(response.candidates) if response.candidates else 0}")

            # Extract text from response
            summary = None

            if hasattr(response, 'text') and response.text:
                summary = response.text
                if Config.DEBUG and Config.MODE == 'test' and attempt == 0:
                    print(f"  Got text via response.text: {len(summary)} chars")
            elif hasattr(response, 'candidates') and response.candidates:
                candidate = response.candidates[0]
                if hasattr(candidate, 'content') and hasattr(candidate.content, 'parts'):
                    parts = candidate.content.parts
                    if parts and hasattr(parts[0], 'text'):
                        summary = parts[0].text
                        if Config.DEBUG and Config.MODE == 'test' and attempt == 0:
                            print(f"  Got text via candidates: {len(summary)} chars")

            # Check if we got valid text
            if not summary:
                if Config.DEBUG and Config.MODE == 'test':
                    print(f"  No summary extracted on attempt {attempt + 1}")
                if attempt < retries - 1:
                    time.sleep(Config.RETRY_DELAY)
                    continue
                return None, "No summary text in API response"

            summary = summary.strip()

            # Validate summary quality
            if not summary or len(summary) < Config.MIN_SUMMARY_LENGTH:
                if attempt < retries - 1:
                    time.sleep(Config.RETRY_DELAY)
                    continue
                return None, f"Summary too short ({len(summary)} chars)"

            if len(summary) > Config.MAX_SUMMARY_LENGTH:
                summary = summary[:Config.MAX_SUMMARY_LENGTH] + "..."

            return summary, None

        except AttributeError as e:
            error_msg = f"AttributeError: {str(e)}"
            if Config.DEBUG and Config.MODE == 'test':
                print(f"  {error_msg} on attempt {attempt + 1}")
            if attempt < retries - 1:
                time.sleep(Config.RETRY_DELAY * (attempt + 1))
                continue
            return None, error_msg
        except Exception as e:
            error_msg = f"{type(e).__name__}: {str(e)}"
            if Config.DEBUG and Config.MODE == 'test':
                print(f"  {error_msg} on attempt {attempt + 1}")
            if attempt < retries - 1:
                time.sleep(Config.RETRY_DELAY * (attempt + 1))
                continue
            return None, error_msg

    return None, "Max retries exceeded"


def clean_summary_for_csv(summary: str) -> str:
    """Clean summary text for CSV output following MovieLens format.

    Args:
        summary: Raw summary text

    Returns:
        Cleaned summary suitable for CSV (double-quote escaped, UTF-8)
    """
    if not summary:
        return ""

    # Remove excessive whitespace
    summary = re.sub(r'\s+', ' ', summary)

    # CSV standard: escape double quotes by doubling them
    summary = summary.replace('"', '""')

    return summary.strip()

print("‚úÖ Summary generation functions defined")

‚úÖ Summary generation functions defined


## Test API Connection

Quick test to verify the API is working correctly:

In [None]:
# Test with a simple movie to verify API response format
try:
    test_config = types.GenerateContentConfig(
        temperature=0.7,
        max_output_tokens=200,
    )

    test_response = get_client().models.generate_content(
        model=Config.MODEL_NAME,
        contents='Write one sentence about the movie Toy Story.',
        config=test_config,
    )

    print("‚úÖ API Test Successful!")
    print(f"Response type: {type(test_response)}")
    print(f"Has text attribute: {hasattr(test_response, 'text')}")

    if hasattr(test_response, 'text'):
        print(f"Text is None: {test_response.text is None}")
        if test_response.text:
            print(f"Sample text: {test_response.text[:100]}")
    else:
        print(f"Response object: {test_response}")
        print(f"Response dir: {[attr for attr in dir(test_response) if not attr.startswith('_')]}")

except Exception as e:
    print(f"‚ùå API Test Failed: {e}")
    import traceback
    traceback.print_exc()

# Test generating a movie summary
print("\n" + "="*60)
print("Testing summary generation...")
print("="*60)

test_summary, test_error = generate_summary("Toy Story", 1995)
if test_summary:
    print(f"\n‚úÖ Summary generation working!")
    print(f"Length: {len(test_summary)} characters")
    print(f"Preview: {test_summary[:200]}...")
else:
    print(f"\n‚ùå Summary generation failed: {test_error}")

‚úÖ API Test Successful!
Response type: <class 'google.genai.types.GenerateContentResponse'>
Has text attribute: True
Text is None: False
Sample text: When a new, high-tech action figure named Buzz Lightyear threatens his status, a pull-string cowboy 

Testing summary generation...

‚úÖ Summary generation working!
Length: 1589 characters
Preview: "Toy Story" (1995), directed by John Lasseter, is a pioneering animated film that introduced audiences to a world where toys come to life when humans are absent. The plot centers on Woody (voiced by T...


## Load and Prepare Movie Data

In [None]:
def extract_year_from_title(title: str) -> Tuple[str, Optional[int]]:
    """Extract year from movie title.

    Args:
        title: Movie title potentially containing year in format "Title (YYYY)"

    Returns:
        Tuple of (clean_title, year)
    """
    # Match year in parentheses at end of title
    match = re.search(r'\((\d{4})\)\s*$', title)

    if match:
        year = int(match.group(1))
        clean_title = title[:match.start()].strip()
        return clean_title, year

    return title.strip(), None


# Load movie data
print(f"Loading movie data from: {Config.INPUT_URL}")
df = pd.read_csv(Config.INPUT_URL)

# Apply mode limit
if Config.MODE == 'test':
    df = df.head(Config.TEST_MODE_LIMIT)
    print(f"\n‚ö†Ô∏è  TEST MODE: Processing first {Config.TEST_MODE_LIMIT} movies only")
else:
    print(f"\n‚úÖ FULL MODE: Processing all {len(df)} movies")

# Extract years from titles
df[['clean_title', 'year']] = df['title'].apply(
    lambda x: pd.Series(extract_year_from_title(x))
)

# Convert year to nullable integer type (keeps integers, allows None)
df['year'] = df['year'].astype('Int64')  # Capital I - nullable integer dtype


# Validate years (should be reasonable movie years)
valid_years = df['year'].notna() & (df['year'] >= 1888) & (df['year'] <= 2030)
invalid_year_count = (~valid_years & df['year'].notna()).sum()
if invalid_year_count > 0:
    print(f"\n‚ö†Ô∏è  Warning: Found {invalid_year_count} movies with invalid years, setting to None")
    df.loc[~valid_years, 'year'] = None

print(f"\nDataset info:")
print(f"  Total movies: {len(df)}")
print(f"  Movies with years: {df['year'].notna().sum()}")
if df['year'].notna().sum() > 0:
    print(f"  Year range: {df['year'].min():.0f} - {df['year'].max():.0f}")

# Display sample
print(f"\nSample movies:")
print(df[['movieId', 'title', 'clean_title', 'year', 'genres']].head())

Loading movie data from: https://raw.githubusercontent.com/haggman/cymbalflix/main/starter/data/ml-latest-small/movies.csv

‚úÖ FULL MODE: Processing all 9742 movies

Dataset info:
  Total movies: 9742
  Movies with years: 9729
  Year range: 1902 - 2018

Sample movies:
   movieId                               title                  clean_title  \
0        1                    Toy Story (1995)                    Toy Story   
1        2                      Jumanji (1995)                      Jumanji   
2        3             Grumpier Old Men (1995)             Grumpier Old Men   
3        4            Waiting to Exhale (1995)            Waiting to Exhale   
4        5  Father of the Bride Part II (1995)  Father of the Bride Part II   

   year                                       genres  
0  1995  Adventure|Animation|Children|Comedy|Fantasy  
1  1995                   Adventure|Children|Fantasy  
2  1995                               Comedy|Romance  
3  1995                         Com

## Progress Tracking Functions

In [None]:
def save_progress(df_processed: pd.DataFrame, filename: str = Config.PROGRESS_FILE):
    """Save processed summaries to CSV.

    Args:
        df_processed: DataFrame with movieId and summary columns
        filename: Output filename
    """
    # Select only the required columns: movieId and summary
    output_cols = ['movieId', 'summary']
    df_output = df_processed[output_cols].copy()

    # Save to CSV with UTF-8 encoding and proper quoting for MovieLens format
    df_output.to_csv(filename, index=False, encoding='utf-8', quoting=csv.QUOTE_MINIMAL)
    print(f"  üíæ Progress saved: {filename} ({len(df_output)} movies)")


def log_error(movie_id: int, title: str, error_msg: str, filename: str = Config.ERROR_LOG):
    """Log errors to CSV file.

    Args:
        movie_id: Movie ID
        title: Movie title
        error_msg: Error message
        filename: Error log filename
    """
    error_data = {
        'timestamp': datetime.now().isoformat(),
        'movieId': movie_id,
        'title': title,
        'error': error_msg
    }

    file_exists = Path(filename).exists()

    with open(filename, 'a', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['timestamp', 'movieId', 'title', 'error'])
        if not file_exists:
            writer.writeheader()
        writer.writerow(error_data)

print("‚úÖ Progress tracking functions defined")

‚úÖ Progress tracking functions defined


## Parallel Batch Processing

**Speed optimization:** Processes movies in batches with multiple concurrent workers for ~50x faster throughput.

In [None]:
def process_single_movie(row: pd.Series) -> dict:
    """Process a single movie (called by worker threads).

    Args:
        row: DataFrame row with movie data

    Returns:
        Dictionary with movie result
    """
    movie_id = row['movieId']
    title = row['title']
    clean_title = row['clean_title']
    year = row['year']

    # Print individual movie progress only in test mode
    if Config.MODE == 'test':
        print(f"  Processing: {title}")

    # Generate summary
    summary, error = generate_summary(clean_title, year)

    if summary:
        summary_clean = clean_summary_for_csv(summary)
        return {
            'movieId': movie_id,
            'summary': summary_clean,
            'error': None
        }
    else:
        # Log error to separate file
        log_error(movie_id, title, error)

        # Include error details inline in the summary field
        error_summary = f"[Summary generation failed: {error}]"
        return {
            'movieId': movie_id,
            'summary': error_summary,
            'error': error
        }


def process_movies(df: pd.DataFrame) -> pd.DataFrame:
    """Process all movies with parallel batch processing for speed.

    Automatically resumes from progress file if it exists, skipping already-processed movies.

    Args:
        df: Input DataFrame with movies

    Returns:
        DataFrame with summaries added
    """
    results = []
    errors = []

    # Check for existing progress and load it
    progress_path = Path(Config.PROGRESS_FILE)
    processed_ids = set()

    if progress_path.exists():
        print(f"\nüìÇ Found existing progress file: {Config.PROGRESS_FILE}")
        df_progress = pd.read_csv(Config.PROGRESS_FILE)

        # Only skip movies that have successful summaries OR non-error summaries
        # Re-process any that have error messages
        successful_progress = df_progress[~df_progress['summary'].str.contains(r'\[Summary generation failed', na=False)]
        processed_ids = set(successful_progress['movieId'])

        # Keep successful summaries in results
        for _, row in successful_progress.iterrows():
            results.append({
                'movieId': row['movieId'],
                'summary': row['summary'],
                'error': None
            })

        # Check for failed summaries to retry
        failed_ids = set(df_progress[df_progress['summary'].str.contains(r'\[Summary generation failed', na=False)]['movieId'])

        print(f"  ‚úÖ Loaded {len(successful_progress)} successful summaries")
        print(f"  üîÑ Will retry {len(failed_ids)} failed summaries")
        print(f"  üìù Remaining to process: {len(df) - len(processed_ids)}")

    # Filter to only unprocessed movies
    df_to_process = df[~df['movieId'].isin(processed_ids)].copy()

    if len(df_to_process) == 0:
        print(f"\n‚úÖ All movies already processed! Loading final results...")
        df_final = pd.read_csv(Config.PROGRESS_FILE)
        return df_final

    print(f"\n{'='*60}")
    print(f"Starting PARALLEL summary generation")
    print(f"  Total movies in dataset: {len(df)}")
    print(f"  Already processed: {len(processed_ids)}")
    print(f"  Movies to process: {len(df_to_process)}")
    print(f"  Batch size: {Config.BATCH_SIZE}")
    print(f"  Workers: {Config.MAX_WORKERS}")
    print(f"{'='*60}\n")

    start_time = time.time()

    # Process in batches with parallel workers
    total_batches = (len(df_to_process) + Config.BATCH_SIZE - 1) // Config.BATCH_SIZE

    with tqdm(total=len(df_to_process), desc="Generating summaries", disable=(Config.MODE == 'test')) as pbar:
        for batch_num in range(total_batches):
            start_idx = batch_num * Config.BATCH_SIZE
            end_idx = min(start_idx + Config.BATCH_SIZE, len(df_to_process))
            batch_df = df_to_process.iloc[start_idx:end_idx]

            # Process batch in parallel
            with ThreadPoolExecutor(max_workers=Config.MAX_WORKERS) as executor:
                # Submit all movies in batch
                futures = {
                    executor.submit(process_single_movie, row): idx
                    for idx, row in batch_df.iterrows()
                }

                # Collect results as they complete
                for future in as_completed(futures):
                    result = future.result()
                    results.append(result)

                    if result['error']:
                        errors.append(result['movieId'])

                    if Config.MODE != 'test':
                        pbar.update(1)

            # Save progress after each batch
            if len(results) % Config.SAVE_INTERVAL == 0 or end_idx == len(df_to_process):
                df_progress = pd.DataFrame(results)
                # Remove error column before saving
                df_progress = df_progress.drop(columns=['error'])
                save_progress(df_progress, Config.PROGRESS_FILE)

    # Final save
    df_results = pd.DataFrame(results)
    df_results = df_results.drop(columns=['error'])
    save_progress(df_results, Config.OUTPUT_FILE)

    # Summary statistics
    elapsed = time.time() - start_time
    print(f"\n{'='*60}")
    print(f"Processing Complete!")
    print(f"{'='*60}")
    print(f"  Total movies in dataset: {len(df)}")
    print(f"  Newly processed: {len(df_to_process)}")
    print(f"  Total with summaries: {len(df_results)}")
    print(f"  Successful: {len(df_results) - len(errors)}")
    print(f"  Errors: {len(errors)}")
    if len(df_to_process) > 0:
        print(f"  Time elapsed: {elapsed/60:.1f} minutes ({elapsed/3600:.1f} hours)")
        print(f"  Average time per movie: {elapsed/len(df_to_process):.2f} seconds")
        print(f"  Effective throughput: {len(df_to_process)/elapsed*60:.1f} movies/minute")
    print(f"\n  Output file: {Config.OUTPUT_FILE}")

    if errors:
        print(f"  Error log: {Config.ERROR_LOG}")
        print(f"  Error rate: {len(errors)/len(df_results)*100:.1f}%")

    return df_results

print("‚úÖ Parallel processing functions defined")

‚úÖ Parallel processing functions defined


## Run Summary Generation

In [None]:
# Run the processing
df_with_summaries = process_movies(df)


üìÇ Found existing progress file: summary_progress.csv
  ‚úÖ Loaded 9742 successful summaries
  üîÑ Will retry 0 failed summaries
  üìù Remaining to process: 0

‚úÖ All movies already processed! Loading final results...


In [None]:
import pandas as pd

# Load the summary_progress.csv file
try:
    df_progress = pd.read_csv('/content/summary_progress.csv')

    # Filter for failed summaries
    # Correcting the regex to properly escape the '[' character to avoid SyntaxWarning
    failed_summaries = df_progress[df_progress['summary'].str.contains(r'\[Summary generation failed', na=False)]

    if not failed_summaries.empty:
        print(f"Found {len(failed_summaries)} failed summary messages:")
        display(failed_summaries)
    else:
        print("No failed summary messages found in 'summary_progress.csv'.")

except FileNotFoundError:
    print("Error: 'summary_progress.csv' not found. Please ensure the file exists.")
except Exception as e:
    print(f"An error occurred: {e}")

No failed summary messages found in 'summary_progress.csv'.


  failed_summaries = df_progress[df_progress['summary'].str.contains('\[Summary generation failed', na=False)]


## Verify Results

In [None]:
# Display sample results
print("\nüìä Sample Results:\n")
print("="*80)

for idx in range(min(3, len(df_with_summaries))):
    row = df_with_summaries.iloc[idx]
    print(f"\nMovie #{idx+1}:")
    print(f"  ID: {row['movieId']}")
    print(f"  Summary: {row['summary'][:200]}...")
    print(f"  Summary length: {len(row['summary'])} characters")
    print("-"*80)


üìä Sample Results:


Movie #1:
  ID: 24
  Summary: The 1995 science fiction drama ""Powder,"" directed and written by Victor Salva, tells the story of Jeremy ""Powder"" Reed, an albino teenager with extraordinary intellect and unique telepathic and ps...
  Summary length: 1587 characters
--------------------------------------------------------------------------------

Movie #2:
  ID: 6
  Summary: Michael Mann's 1995 crime drama, ""Heat,"" meticulously chronicles the high-stakes cat-and-mouse game between a seasoned crew of professional thieves and a determined unit of LAPD robbery-homicide det...
  Summary length: 1370 characters
--------------------------------------------------------------------------------

Movie #3:
  ID: 1
  Summary: ""Toy Story,"" released in 1995, is an American animated adventure comedy film produced by Pixar Animation Studios and distributed by Walt Disney Pictures. It tells the story of a group of toys who co...
  Summary length: 1713 characters
----------

In [None]:
# Quality checks
print("\nüîç Quality Checks:\n")

# Check for missing summaries
missing = df_with_summaries['summary'].isna().sum()
failed = df_with_summaries['summary'].str.contains(r'\[Summary generation failed', na=False).sum()
print(f"  Missing summaries: {missing}")
print(f"  Failed summaries: {failed}")

# Summary length distribution
df_with_summaries['summary_length'] = df_with_summaries['summary'].str.len()
print(f"\n  Summary length statistics:")
print(f"    Min: {df_with_summaries['summary_length'].min()} characters")
print(f"    Max: {df_with_summaries['summary_length'].max()} characters")
print(f"    Mean: {df_with_summaries['summary_length'].mean():.0f} characters")
print(f"    Median: {df_with_summaries['summary_length'].median():.0f} characters")

# Check for summaries within target range (excluding failures)
successful_summaries = df_with_summaries[~df_with_summaries['summary'].str.contains(r'\[Summary generation failed', na=False)]
in_range = successful_summaries[
    (successful_summaries['summary_length'] >= 900) &
    (successful_summaries['summary_length'] <= 1800)
]
if len(successful_summaries) > 0:
    print(f"\n  Summaries in target range (150-250 chars): {len(in_range)} ({len(in_range)/len(successful_summaries)*100:.1f}%)")

# Check for completeness - every movie should have a summary
print(f"\n  Completeness check:")
original_movie_ids = set(df['movieId'])
summary_movie_ids = set(df_with_summaries['movieId'])

missing_summaries = original_movie_ids - summary_movie_ids
extra_summaries = summary_movie_ids - original_movie_ids

print(f"    Original movies: {len(original_movie_ids)}")
print(f"    Movies with summaries: {len(summary_movie_ids)}")
print(f"    Missing summaries: {len(missing_summaries)}")
print(f"    Extra summaries: {len(extra_summaries)}")

if len(missing_summaries) == 0 and len(extra_summaries) == 0:
    print(f"    ‚úÖ Perfect match! All movies have summaries.")
elif len(missing_summaries) > 0:
    print(f"    ‚ö†Ô∏è  Missing summaries for movie IDs: {sorted(list(missing_summaries))[:10]}{'...' if len(missing_summaries) > 10 else ''}")
if len(extra_summaries) > 0:
    print(f"    ‚ö†Ô∏è  Extra summaries for movie IDs: {sorted(list(extra_summaries))[:10]}{'...' if len(extra_summaries) > 10 else ''}")


üîç Quality Checks:

  Missing summaries: 0
  Failed summaries: 0

  Summary length statistics:
    Min: 673 characters
    Max: 3014 characters
    Mean: 1484 characters
    Median: 1477 characters

  Summaries in target range (150-250 chars): 9332 (95.8%)

  Completeness check:
    Original movies: 9742
    Movies with summaries: 9742
    Missing summaries: 0
    Extra summaries: 0
    ‚úÖ Perfect match! All movies have summaries.


## Download Results

Your results are saved to `summaries.csv`. You can download it from the files panel.

In [None]:
# Display final file info
from pathlib import Path

output_path = Path(Config.OUTPUT_FILE)
if output_path.exists():
    file_size = output_path.stat().st_size / 1024  # KB
    print(f"\n‚úÖ Output file ready for download:")
    print(f"  Filename: {Config.OUTPUT_FILE}")
    print(f"  Size: {file_size:.1f} KB")
    print(f"  Records: {len(df_with_summaries)}")
    print(f"  Format: movieId,summary")
    print(f"\n  Download from the files panel on the left ‚Üí")
else:
    print(f"‚ùå Output file not found: {Config.OUTPUT_FILE}")


‚úÖ Output file ready for download:
  Filename: summaries.csv
  Size: 14309.9 KB
  Records: 9742
  Format: movieId,summary

  Download from the files panel on the left ‚Üí


## Resume Processing

**Good news!** The notebook now automatically resumes from where it left off.

- If processing is interrupted, simply re-run the "Run Summary Generation" cell
- Already-processed successful summaries will be loaded automatically
- Only failed summaries and unprocessed movies will be regenerated
- Progress is saved every 100 movies to minimize data loss

The resume logic:
- ‚úÖ Skips movies with successful summaries
- üîÑ Retries movies that previously failed
- üìù Processes any remaining movies

To start completely fresh, delete these files:
- `summary_progress.csv`
- `summaries.csv`
- `summary_errors.csv`

## Troubleshooting

### Common Issues

**403 Authentication Error:**
- Ensure Vertex AI API is enabled in your GCP project
- Verify you have the 'Vertex AI User' IAM role
- Check that you're running in Colab Enterprise (not regular Colab)

**Rate Limiting:**
- Reduce `Config.MAX_WORKERS` (try 25 or 10)
- Increase `Config.RETRY_DELAY` to slow down retries

**API Quota Exceeded:**
- Check your Vertex AI quotas in GCP Console
- Reduce `Config.MAX_WORKERS` to lower concurrent requests
- Run in test mode first to verify everything works

**To Enable Vertex AI:**
```bash
gcloud services enable aiplatform.googleapis.com
```

## Generate Movie Embeddings for Vector Search

Now that we have summaries, let's generate embeddings (numerical vector representations) for each movie. These embeddings will power semantic similarity search - finding movies that are conceptually similar even if they don't share keywords.

**Features:**
- Uses Gemini Embedding model (gemini-embedding-001)
- Output dimension: 1536
- Parallel processing with 50 workers
- Combines title, genres, and summary for rich context
- Output: embeddings.csv (movieId, embedding)

In [4]:
# Embedding generation configuration
class EmbeddingConfig:
    """Configuration for embedding generation"""

    # Model settings
    EMBEDDING_MODEL = 'gemini-embedding-001'
    OUTPUT_DIMENSION = 1536

    # Input/Output files
    SUMMARIES_FILE = 'summaries.csv'
    OUTPUT_FILE = 'embeddings.csv'
    PROGRESS_FILE = 'embedding_progress.csv'
    ERROR_LOG = 'embedding_errors.csv'

    # Processing settings
    BATCH_SIZE = 100
    MAX_WORKERS = 50
    SAVE_INTERVAL = 100
    MAX_RETRIES = 3
    RETRY_DELAY = 2

print("Embedding generation configuration loaded")
print(f"  Model: {EmbeddingConfig.EMBEDDING_MODEL}")
print(f"  Dimensions: {EmbeddingConfig.OUTPUT_DIMENSION}")
print(f"  Max workers: {EmbeddingConfig.MAX_WORKERS}")

Embedding generation configuration loaded
  Model: gemini-embedding-001
  Dimensions: 1536
  Max workers: 50


In [11]:
from google.genai import types

print("‚úÖ Google GenAI types imported")

‚úÖ Google GenAI types imported


In [5]:
import json
from typing import Optional, List

def create_embedding_text(title: str, genres: List[str], summary: str) -> str:
    """Create rich text from movie data for embedding.

    Args:
        title: Movie title
        genres: List of genres
        summary: Movie summary

    Returns:
        Combined text optimized for embedding
    """
    genre_text = ', '.join(genres) if genres else ''

    parts = [
        f"Title: {title}",
        f"Genres: {genre_text}" if genre_text else "",
        f"Summary: {summary}" if summary else ""
    ]

    return '. '.join(filter(None, parts))


def generate_embedding(movie_id: int, title: str, genres: List[str], summary: str,
                       retries: int = EmbeddingConfig.MAX_RETRIES) -> tuple[Optional[List[float]], Optional[str]]:
    """Generate embedding for a single movie.

    Args:
        movie_id: Movie ID
        title: Movie title
        genres: List of genres
        summary: Movie summary
        retries: Number of retry attempts

    Returns:
        Tuple of (embedding_vector, error_message)
    """
    text = create_embedding_text(title, genres, summary)

    for attempt in range(retries):
        try:
            response = get_client().models.embed_content(
                model=EmbeddingConfig.EMBEDDING_MODEL,
                contents=text,
                config=types.EmbedContentConfig(
                    output_dimensionality=EmbeddingConfig.OUTPUT_DIMENSION
                )
            )

            # Extract embedding values
            if hasattr(response, 'embeddings') and response.embeddings:
                embedding = response.embeddings[0]
                if hasattr(embedding, 'values') and embedding.values:
                    return list(embedding.values), None

            if attempt < retries - 1:
                time.sleep(EmbeddingConfig.RETRY_DELAY)
                continue

            return None, "No embedding values in API response"

        except Exception as e:
            error_msg = f"{type(e).__name__}: {str(e)}"
            if attempt < retries - 1:
                time.sleep(EmbeddingConfig.RETRY_DELAY * (attempt + 1))
                continue
            return None, error_msg

    return None, "Max retries exceeded"


def save_embeddings_progress(df_processed: pd.DataFrame, filename: str = EmbeddingConfig.PROGRESS_FILE):
    """Save processed embeddings to CSV.

    Args:
        df_processed: DataFrame with movieId and embedding columns
        filename: Output filename
    """
    # Convert embedding arrays to JSON strings for CSV storage
    df_output = df_processed.copy()
    df_output['embedding'] = df_output['embedding'].apply(json.dumps)

    # Save with only required columns
    df_output[['movieId', 'embedding']].to_csv(filename, index=False, encoding='utf-8')
    print(f"  üíæ Progress saved: {filename} ({len(df_output)} movies)")


def log_embedding_error(movie_id: int, title: str, error_msg: str,
                        filename: str = EmbeddingConfig.ERROR_LOG):
    """Log embedding generation errors.

    Args:
        movie_id: Movie ID
        title: Movie title
        error_msg: Error message
        filename: Error log filename
    """
    error_data = {
        'timestamp': datetime.now().isoformat(),
        'movieId': movie_id,
        'title': title,
        'error': error_msg
    }

    file_exists = Path(filename).exists()

    with open(filename, 'a', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['timestamp', 'movieId', 'title', 'error'])
        if not file_exists:
            writer.writeheader()
        writer.writerow(error_data)

print("‚úÖ Embedding generation functions defined")

‚úÖ Embedding generation functions defined


In [8]:
# Load the movies data (with summaries)
print("Loading movies data with summaries...")

# Load original movies data from URL to get title and genres
MOVIES_URL = 'https://raw.githubusercontent.com/haggman/cymbalflix/main/starter/data/ml-latest-small/movies.csv'
df_movies = pd.read_csv(MOVIES_URL)
print(f"‚úì Loaded {len(df_movies)} movies from dataset")

# Parse genres into lists
df_movies['genres'] = df_movies['genres'].apply(
    lambda x: x.split('|') if x and x != '(no genres listed)' else []
)

# Read the summaries.csv that's been uploaded to this notebook environment
df_summaries = pd.read_csv('summaries.csv')
print(f"‚úì Loaded {len(df_summaries)} movie summaries from uploaded file")

# Merge: summaries + (title and genres from movies dataset)
df_movies_with_summaries = df_summaries.merge(
    df_movies[['movieId', 'title', 'genres']],
    on='movieId',
    how='inner'
)
print(f"‚úì Merged data: {len(df_movies_with_summaries)} movies with complete information")

# Display sample
print("\nSample movie data for embedding:")
sample = df_movies_with_summaries[['movieId', 'title', 'genres', 'summary']].head(3)
for _, row in sample.iterrows():
    print(f"\n  Movie: {row['title']}")
    print(f"  Genres: {row['genres']}")
    print(f"  Summary preview: {row['summary'][:100]}...")

Loading movies data with summaries...
‚úì Loaded 9742 movies from dataset
‚úì Loaded 9742 movie summaries from uploaded file
‚úì Merged data: 9742 movies with complete information

Sample movie data for embedding:

  Movie: Powder (1995)
  Genres: ['Drama', 'Sci-Fi']
  Summary preview: The 1995 science fiction drama "Powder," directed and written by Victor Salva, tells the story of Je...

  Movie: Heat (1995)
  Genres: ['Action', 'Crime', 'Thriller']
  Summary preview: Michael Mann's 1995 crime drama, "Heat," meticulously chronicles the high-stakes cat-and-mouse game ...

  Movie: Toy Story (1995)
  Genres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']
  Summary preview: "Toy Story," released in 1995, is an American animated adventure comedy film produced by Pixar Anima...


In [13]:
# Test embedding generation with a single movie
print("Testing embedding generation with a single movie...\n")

# Get one movie with a summary
test_movie = df_movies_with_summaries.iloc[0]
print(f"Test Movie: {test_movie['title']}")
print(f"Genres: {test_movie['genres']}")
print(f"Summary preview: {test_movie['summary'][:150]}...\n")

# Create embedding text
test_text = create_embedding_text(
    test_movie['title'],
    test_movie['genres'],
    test_movie['summary']
)
print(f"Embedding text (first 200 chars):\n{test_text[:200]}...\n")

# Try to generate embedding with explicit dimensions
try:
    print(f"Calling Gemini API (requesting {EmbeddingConfig.OUTPUT_DIMENSION} dimensions)...")
    response = get_client().models.embed_content(
        model=EmbeddingConfig.EMBEDDING_MODEL,
        contents=test_text,
        config=types.EmbedContentConfig(
            output_dimensionality=EmbeddingConfig.OUTPUT_DIMENSION
        )
    )

    print("‚úÖ API call successful!")
    print(f"Response type: {type(response)}")
    print(f"Has embeddings attr: {hasattr(response, 'embeddings')}")

    if hasattr(response, 'embeddings') and response.embeddings:
        embedding = response.embeddings[0]
        print(f"Embedding type: {type(embedding)}")
        print(f"Has values attr: {hasattr(embedding, 'values')}")

        if hasattr(embedding, 'values') and embedding.values:
            values = list(embedding.values)
            print(f"\n‚úÖ Successfully generated embedding!")
            print(f"   Dimensions: {len(values)}")
            print(f"   Expected: {EmbeddingConfig.OUTPUT_DIMENSION}")
            print(f"   First 5 values: {values[:5]}")
            print(f"   Last 5 values: {values[-5:]}")

            if len(values) == EmbeddingConfig.OUTPUT_DIMENSION:
                print(f"\n‚úÖ Dimension check PASSED!")
            else:
                print(f"\n‚ö†Ô∏è  Dimension mismatch! Got {len(values)}, expected {EmbeddingConfig.OUTPUT_DIMENSION}")
        else:
            print("‚ùå No values in embedding")
    else:
        print("‚ùå No embeddings in response")
        print(f"Response: {response}")

except Exception as e:
    print(f"‚ùå Error: {type(e).__name__}: {e}")
    import traceback
    traceback.print_exc()

Testing embedding generation with a single movie...

Test Movie: Powder (1995)
Genres: ['Drama', 'Sci-Fi']
Summary preview: The 1995 science fiction drama "Powder," directed and written by Victor Salva, tells the story of Jeremy "Powder" Reed, an albino teenager with extrao...

Embedding text (first 200 chars):
Title: Powder (1995). Genres: Drama, Sci-Fi. Summary: The 1995 science fiction drama "Powder," directed and written by Victor Salva, tells the story of Jeremy "Powder" Reed, an albino teenager with ex...

Calling Gemini API (requesting 1536 dimensions)...
‚úÖ API call successful!
Response type: <class 'google.genai.types.EmbedContentResponse'>
Has embeddings attr: True
Embedding type: <class 'google.genai.types.ContentEmbedding'>
Has values attr: True

‚úÖ Successfully generated embedding!
   Dimensions: 1536
   Expected: 1536
   First 5 values: [0.006422614213079214, -0.03108868934214115, 0.007843773812055588, -0.05639827623963356, -0.001457727630622685]
   Last 5 values: [-0.0

In [14]:
def process_single_movie_embedding(row: pd.Series) -> dict:
    """Process a single movie to generate embedding.

    Args:
        row: DataFrame row with movie data

    Returns:
        Dictionary with movie result
    """
    movie_id = row['movieId']
    title = row['title']
    genres = row['genres']
    summary = row['summary']

    # Generate embedding
    embedding, error = generate_embedding(movie_id, title, genres, summary)

    if embedding:
        return {
            'movieId': movie_id,
            'embedding': embedding,
            'error': None
        }
    else:
        # Log error
        log_embedding_error(movie_id, title, error)
        return {
            'movieId': movie_id,
            'embedding': None,
            'error': error
        }


def process_movie_embeddings(df: pd.DataFrame) -> pd.DataFrame:
    """Process all movies with parallel embedding generation.

    Args:
        df: Input DataFrame with movies

    Returns:
        DataFrame with embeddings added
    """
    results = []
    errors = []

    # Check for existing progress
    progress_path = Path(EmbeddingConfig.PROGRESS_FILE)
    processed_ids = set()

    if progress_path.exists():
        print(f"\nüìÇ Found existing progress file: {EmbeddingConfig.PROGRESS_FILE}")
        df_progress = pd.read_csv(EmbeddingConfig.PROGRESS_FILE)

        # Parse JSON embeddings and filter out failed ones
        df_progress['embedding'] = df_progress['embedding'].apply(
            lambda x: json.loads(x) if isinstance(x, str) and x.startswith('[') else None
        )
        successful_progress = df_progress[df_progress['embedding'].notna()]
        processed_ids = set(successful_progress['movieId'])

        # Keep successful embeddings in results
        for _, row in successful_progress.iterrows():
            results.append({
                'movieId': row['movieId'],
                'embedding': row['embedding'],
                'error': None
            })

        print(f"  ‚úÖ Loaded {len(successful_progress)} successful embeddings")
        print(f"  üìù Remaining to process: {len(df) - len(processed_ids)}")

    # Filter to only unprocessed movies
    df_to_process = df[~df['movieId'].isin(processed_ids)].copy()

    if len(df_to_process) == 0:
        print(f"\n‚úÖ All movies already processed!")
        # Load and return final results
        df_final = pd.read_csv(EmbeddingConfig.PROGRESS_FILE)
        df_final['embedding'] = df_final['embedding'].apply(json.loads)
        return df_final

    print(f"\n{'='*60}")
    print(f"Starting PARALLEL embedding generation")
    print(f"  Total movies: {len(df)}")
    print(f"  Already processed: {len(processed_ids)}")
    print(f"  Movies to process: {len(df_to_process)}")
    print(f"  Batch size: {EmbeddingConfig.BATCH_SIZE}")
    print(f"  Workers: {EmbeddingConfig.MAX_WORKERS}")
    print(f"  Model: {EmbeddingConfig.EMBEDDING_MODEL}")
    print(f"  Dimensions: {EmbeddingConfig.OUTPUT_DIMENSION}")
    print(f"{'='*60}\n")

    start_time = time.time()

    # Process in batches with parallel workers
    total_batches = (len(df_to_process) + EmbeddingConfig.BATCH_SIZE - 1) // EmbeddingConfig.BATCH_SIZE

    with tqdm(total=len(df_to_process), desc="Generating embeddings") as pbar:
        for batch_num in range(total_batches):
            start_idx = batch_num * EmbeddingConfig.BATCH_SIZE
            end_idx = min(start_idx + EmbeddingConfig.BATCH_SIZE, len(df_to_process))
            batch_df = df_to_process.iloc[start_idx:end_idx]

            # Process batch in parallel
            with ThreadPoolExecutor(max_workers=EmbeddingConfig.MAX_WORKERS) as executor:
                futures = {
                    executor.submit(process_single_movie_embedding, row): idx
                    for idx, row in batch_df.iterrows()
                }

                for future in as_completed(futures):
                    result = future.result()
                    results.append(result)

                    if result['error']:
                        errors.append(result['movieId'])

                    pbar.update(1)

            # Save progress after each batch
            if len(results) % EmbeddingConfig.SAVE_INTERVAL == 0 or end_idx == len(df_to_process):
                df_progress = pd.DataFrame(results)
                df_progress = df_progress[df_progress['error'].isna()]  # Only save successful
                save_embeddings_progress(df_progress, EmbeddingConfig.PROGRESS_FILE)

    # Final save
    df_results = pd.DataFrame(results)
    df_results = df_results[df_results['error'].isna()]
    save_embeddings_progress(df_results, EmbeddingConfig.OUTPUT_FILE)

    # Summary statistics
    elapsed = time.time() - start_time
    print(f"\n{'='*60}")
    print(f"Embedding Generation Complete!")
    print(f"{'='*60}")
    print(f"  Total movies: {len(df)}")
    print(f"  Newly processed: {len(df_to_process)}")
    print(f"  Total with embeddings: {len(df_results)}")
    print(f"  Successful: {len(df_results) - len(errors)}")
    print(f"  Errors: {len(errors)}")
    if len(df_to_process) > 0:
        print(f"  Time elapsed: {elapsed/60:.1f} minutes")
        print(f"  Average time per movie: {elapsed/len(df_to_process):.2f} seconds")
        print(f"  Throughput: {len(df_to_process)/elapsed*60:.1f} movies/minute")
    print(f"\n  Output file: {EmbeddingConfig.OUTPUT_FILE}")

    if errors:
        print(f"  Error log: {EmbeddingConfig.ERROR_LOG}")
        print(f"  Error rate: {len(errors)/len(df_results)*100:.1f}%")

    return df_results

print("‚úÖ Parallel embedding processing functions defined")

‚úÖ Parallel embedding processing functions defined


In [16]:
# Generate embeddings for all movies
df_with_embeddings = process_movie_embeddings(df_movies_with_summaries)


üìÇ Found existing progress file: embedding_progress.csv
  ‚úÖ Loaded 500 successful embeddings
  üìù Remaining to process: 9242

Starting PARALLEL embedding generation
  Total movies: 9742
  Already processed: 500
  Movies to process: 9242
  Batch size: 100
  Workers: 50
  Model: gemini-embedding-001
  Dimensions: 1536



Generating embeddings:   0%|          | 0/9242 [00:00<?, ?it/s]

  üíæ Progress saved: embedding_progress.csv (600 movies)
  üíæ Progress saved: embedding_progress.csv (700 movies)
  üíæ Progress saved: embedding_progress.csv (800 movies)
  üíæ Progress saved: embedding_progress.csv (900 movies)
  üíæ Progress saved: embedding_progress.csv (1000 movies)
  üíæ Progress saved: embedding_progress.csv (1100 movies)
  üíæ Progress saved: embedding_progress.csv (1200 movies)
  üíæ Progress saved: embedding_progress.csv (1300 movies)
  üíæ Progress saved: embedding_progress.csv (1400 movies)
  üíæ Progress saved: embedding_progress.csv (1500 movies)
  üíæ Progress saved: embedding_progress.csv (1600 movies)
  üíæ Progress saved: embedding_progress.csv (1700 movies)
  üíæ Progress saved: embedding_progress.csv (1800 movies)
  üíæ Progress saved: embedding_progress.csv (1900 movies)
  üíæ Progress saved: embedding_progress.csv (2000 movies)
  üíæ Progress saved: embedding_progress.csv (2100 movies)
  üíæ Progress saved: embedding_progress.cs

In [18]:
print("\nüìä Embedding Generation Results:\n")
print("="*60)

# Check dimensions
if len(df_with_embeddings) > 0:
    sample_embedding = df_with_embeddings.iloc[0]['embedding']
    print(f"Embedding dimension: {len(sample_embedding)}")
    print(f"Expected dimension: {EmbeddingConfig.OUTPUT_DIMENSION}")
    print(f"‚úì Dimension check: {'PASS' if len(sample_embedding) == EmbeddingConfig.OUTPUT_DIMENSION else 'FAIL'}")

    print(f"\nTotal movies with embeddings: {len(df_with_embeddings)}")
    print(f"\nSample embedding (first 10 values):")
    print(f"  Movie: {df_with_embeddings.iloc[0]['movieId']}")
    print(f"  Vector: {sample_embedding[:10]}...")

    # Check file size
    output_path = Path(EmbeddingConfig.OUTPUT_FILE)
    if output_path.exists():
        file_size_mb = output_path.stat().st_size / (1024 * 1024)
        print(f"\nOutput file size: {file_size_mb:.1f} MB")

    print("\n‚úÖ Embeddings ready for import!")
else:
    print("‚ùå No embeddings generated")





üìä Embedding Generation Results:

Embedding dimension: 1536
Expected dimension: 1536
‚úì Dimension check: PASS

Total movies with embeddings: 9742

Sample embedding (first 10 values):
  Movie: 38
  Vector: [0.00706884590908885, -0.005984412506222725, -0.036130573600530624, -0.059679530560970306, -0.027341127395629883, 0.010750919580459595, -0.008083591237664223, 0.0008646799251437187, 0.0018894285894930363, -0.006449270527809858]...

Output file size: 321.2 MB

‚úÖ Embeddings ready for import!
