# Generate Scene Descriptions for Test Set

This notebook generates Gemini scene descriptions for `test_sequences.pkl` (54,530 sequences).

**Workflow:**
1. **Part A**: Setup and load test sequences from GCS
2. **Part B**: Create JSONL batch input files
3. **Part C**: Upload to GCS
4. **Part D**: Submit Gemini batch jobs (use curl in Cloud Shell)
5. **Part E**: Check output files
6. **Part F**: Download and parse results
7. **Part G**: Merge and upload final file

**Estimated Cost:** ~$5-6  
**Estimated Time:** ~4-6 hours

---

---
# PART A: Setup and Configuration
---

In [1]:
# A1: Install dependencies (run once)
!pip install google-cloud-storage --quiet

In [2]:
# A2: Imports
import json
import pickle
import os
from pathlib import Path
from google.cloud import storage
from typing import List, Dict
import math

print("Imports successful!")

Imports successful!


In [3]:
# A3: Configuration
PROJECT_ID = "fluent-justice-478703-f8"
BUCKET = "harshasekar-comics-data"
REGION = "us-central1"

# Paths
TEST_SEQUENCES_PATH = "training_sequences/test_sequences.pkl"
BATCH_INPUT_PREFIX = "batch_inputs/test_descriptions/"
BATCH_OUTPUT_PREFIX = "test_descriptions/outputs/"
PANEL_IMAGES_PREFIX = "raw_panel_images/"

# Batch settings
SEQUENCES_PER_SHARD = 35000  # ~35K per shard
MODEL = "gemini-2.5-flash-lite"

# Local working directory
WORK_DIR = Path("./test_batch_work")
WORK_DIR.mkdir(exist_ok=True)

print(f"Project: {PROJECT_ID}")
print(f"Bucket: {BUCKET}")
print(f"Model: {MODEL}")
print(f"Work dir: {WORK_DIR}")

Project: fluent-justice-478703-f8
Bucket: harshasekar-comics-data
Model: gemini-2.5-flash-lite
Work dir: test_batch_work


In [4]:
# A4: Load test sequences from GCS
print("Loading test sequences from GCS...")

client = storage.Client(project=PROJECT_ID)
bucket_obj = client.bucket(BUCKET)

# Download test_sequences.pkl
blob = bucket_obj.blob(TEST_SEQUENCES_PATH)
local_pkl = WORK_DIR / "test_sequences.pkl"
blob.download_to_filename(str(local_pkl))

with open(local_pkl, "rb") as f:
    test_sequences = pickle.load(f)

print(f"Loaded {len(test_sequences):,} test sequences")

# Show sample
sample = test_sequences[0]
print(f"\nSample keys: {list(sample.keys())}")
print(f"Context panels: {len(sample.get('context', []))}")
print(f"Target text preview: {sample.get('target_text', '')[:100]}...")

Loading test sequences from GCS...
Loaded 54,530 test sequences

Sample keys: ['comic_no', 'story_idx', 'context', 'target', 'target_text', 'context_texts']
Context panels: 5
Target text preview: JUST GOT MY ORDERS, SWEETHEART! I'M ROCKET- TING UP TONIGHT! SEALED ORDERS-TO GET THOSE PIRATES! ONE...


---
# PART B: Create JSONL Batch Input Files
---

In [5]:
# B1: Define the prompt template (same as training)

def create_prompt(context_texts: List[str], target_text: str) -> str:
    """Create the prompt for Gemini - matches LLaVA fine-tuning format."""
    prompt = """You are looking at 6 consecutive panels from a comic book.

Here is the text from each panel:
"""
    # Add context panels (1-5)
    for i, text in enumerate(context_texts[-5:], 1):  # Last 5 context panels
        if text and text.strip():
            prompt += f"Panel {i}: {text.strip()[:400]}\n"
        else:
            prompt += f"Panel {i}: [No text]\n"
    
    # Add target panel (6)
    if target_text and target_text.strip():
        prompt += f"Panel 6: {target_text.strip()[:400]}\n"
    else:
        prompt += f"Panel 6: [No text]\n"
    
    prompt += """
Based on what you see in these panels, describe what happens in Panel 6 (the last panel).

Include the scene, any dialogue, and sound effects.

Write your response as a single flowing paragraph. Do not use bullet points, numbered lists, bold text, asterisks, or any markdown formatting. Weave the dialogue naturally into your description."""
    
    return prompt

# Test prompt
sample_prompt = create_prompt(
    test_sequences[0].get("context_texts", []),
    test_sequences[0].get("target_text", "")
)
print("Sample prompt:")
print("=" * 60)
print(sample_prompt)
print("=" * 60)

Sample prompt:
You are looking at 6 consecutive panels from a comic book.

Here is the text from each panel:
Panel 1: [No text]
Panel 2: IN 1962 WITH A ROCKET LAND ING ON THE MOON. IN 1977, MAN SET FOOT ON MARS. A CENTURY LATER ON ALPHA CENTAURI, THE NEAREST STAF BY 3750, MANY STAR CLUSTERS HAD BEEN EXPLORED, THEIR PLANETARY SYSTEMS JOINED WITH EARTH FEDERATION. TO POLICE THIS VAST AREA OF BILLIONS OF MILES OF EMPTY SPACE-TO GUARD THE TREAS- URE-LADEN CARGO SPACERS, THE STAR PATROL WAS BORN. DAVE KENTON WAS A STAR PATROL MAN, HIS H
Panel 3: SACK SINCE 1950, HAVE BEEN RAIDINS THE TREASURE-HEAVY SPACERS... HEAVE OVER, BOYS! WE'RE ALMOST ABOVE HER!
Panel 4: THE SCREAMS AND MOANS OF THEIR VICTIMS SOUNDED FOR A TIME ABOVE THE WHIRR OF THE PIRATES' BEAM-GUNS- AND THEN SILENCE FELL, AND THE LOOTING BEGAN... SURRENDER NOW- AND YOU LIVE! FIGHT- AND DIE!
Panel 5: ON THE TINY PLANET OF FLAYAL-HUNDREDS OF LIGHT YEARS FROM THE EARTH-YOUNG STAR PATROLMAN DAVE KENTON RECEIVES WORD OF THE SPACE DISAST

In [6]:
# B2: Create batch request for a single sequence

def create_batch_request(seq: Dict, seq_idx: int) -> Dict:
    """Create a single batch request with 6 images + prompt."""
    
    # Get image URIs for context panels (last 5)
    context_panels = seq.get("context", [])[-5:]
    target_panel = seq.get("target", {})
    
    # Build image parts
    image_parts = []
    
    # Add 5 context panel images
    for panel in context_panels:
        img_path = panel.get("image_path", "")
        # Convert local path to GCS URI
        if "/raw_panel_images/" in img_path:
            gcs_path = img_path.split("/raw_panel_images/")[-1]
        else:
            gcs_path = img_path.split("/")[-1] if "/" in img_path else img_path
        
        gcs_uri = f"gs://{BUCKET}/{PANEL_IMAGES_PREFIX}{gcs_path}"
        image_parts.append({
            "fileData": {
                "mimeType": "image/jpeg",
                "fileUri": gcs_uri
            }
        })
    
    # Add target panel image (panel 6)
    target_img_path = target_panel.get("image_path", "")
    if "/raw_panel_images/" in target_img_path:
        gcs_path = target_img_path.split("/raw_panel_images/")[-1]
    else:
        gcs_path = target_img_path.split("/")[-1] if "/" in target_img_path else target_img_path
    
    gcs_uri = f"gs://{BUCKET}/{PANEL_IMAGES_PREFIX}{gcs_path}"
    image_parts.append({
        "fileData": {
            "mimeType": "image/jpeg",
            "fileUri": gcs_uri
        }
    })
    
    # Create prompt
    context_texts = seq.get("context_texts", [])
    target_text = seq.get("target_text", "")
    prompt = create_prompt(context_texts, target_text)
    
    # Build the request
    request = {
        "request": {
            "contents": [
                {
                    "role": "user",
                    "parts": image_parts + [{"text": prompt}]
                }
            ],
            "generationConfig": {
                "temperature": 0.3,
                "max_output_tokens": 512,
                "top_p": 0.9
            }
        }
    }
    
    # Add custom ID for tracking
    comic_no = seq.get("comic_no", 0)
    request["customId"] = f"test_{seq_idx}_comic{comic_no}"
    
    return request

# Test with first sequence
test_request = create_batch_request(test_sequences[0], 0)
print("Sample request structure:")
print(f"  customId: {test_request['customId']}")
print(f"  Number of image parts: {len(test_request['request']['contents'][0]['parts']) - 1}")
print(f"  First image URI: {test_request['request']['contents'][0]['parts'][0]['fileData']['fileUri'][:80]}...")

Sample request structure:
  customId: test_0_comic1
  Number of image parts: 6
  First image URI: gs://harshasekar-comics-data/raw_panel_images/47_0.jpg...


In [None]:
# B3: Create sharded JSONL files

num_sequences = len(test_sequences)
num_shards = math.ceil(num_sequences / SEQUENCES_PER_SHARD)

print(f"Total sequences: {num_sequences:,}")
print(f"Sequences per shard: {SEQUENCES_PER_SHARD:,}")
print(f"Number of shards: {num_shards}")
print()

shard_files = []

for shard_idx in range(num_shards):
    start_idx = shard_idx * SEQUENCES_PER_SHARD
    end_idx = min(start_idx + SEQUENCES_PER_SHARD, num_sequences)
    
    shard_file = WORK_DIR / f"test_shard_{shard_idx:04d}.jsonl"
    shard_files.append(shard_file)
    
    with open(shard_file, "w") as f:
        for seq_idx in range(start_idx, end_idx):
            seq = test_sequences[seq_idx]
            request = create_batch_request(seq, seq_idx)
            f.write(json.dumps(request) + "\n")
    
    file_size_mb = shard_file.stat().st_size / (1024 * 1024)
    print(f"Shard {shard_idx}: {end_idx - start_idx:,} sequences, {file_size_mb:.1f} MB ‚Üí {shard_file.name}")

print(f"\nCreated {len(shard_files)} shard files")

---
# PART C: Upload JSONL Files to GCS
---

In [None]:
# C1: Upload shard files to GCS

print("Uploading shard files to GCS...")
print()

gcs_shard_uris = []

for shard_file in shard_files:
    gcs_path = f"{BATCH_INPUT_PREFIX}{shard_file.name}"
    blob = bucket_obj.blob(gcs_path)
    blob.upload_from_filename(str(shard_file))
    
    gcs_uri = f"gs://{BUCKET}/{gcs_path}"
    gcs_shard_uris.append(gcs_uri)
    print(f"Uploaded: {gcs_uri}")

print(f"\n‚úÖ Uploaded {len(gcs_shard_uris)} shard files to GCS")

---
# PART D: Submit Batch Jobs
---

**Run these curl commands in Google Cloud Shell (not in this notebook).**

The cell below generates the commands for you to copy.

In [None]:
# D1: Generate curl commands for batch submission

print("="*80)
print("CURL COMMANDS FOR CLOUD SHELL")
print("="*80)
print()
print("Copy and paste these commands into Google Cloud Shell:")
print()

for i, gcs_uri in enumerate(gcs_shard_uris):
    output_uri = f"gs://{BUCKET}/{BATCH_OUTPUT_PREFIX}job_{i:04d}/"
    
    curl_cmd = f'''# Batch {i}
curl -X POST \\
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \\
  -H "Content-Type: application/json" \\
  https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/batchPredictionJobs \\
  -d '{{
    "displayName": "test-desc-batch-{i}",
    "model": "publishers/google/models/{MODEL}",
    "inputConfig": {{
      "instancesFormat": "jsonl",
      "gcsSource": {{"uris": ["{gcs_uri}"]}}
    }},
    "outputConfig": {{
      "predictionsFormat": "jsonl",
      "gcsDestination": {{"outputUriPrefix": "{output_uri}"}}
    }}
  }}'
'''
    print(curl_cmd)
    print()

print("="*80)
print("MONITORING COMMAND")
print("="*80)
print()
print(f'''curl -X GET \\
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \\
  "https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/batchPredictionJobs" | python3 -m json.tool''')

---
# PART E: Check Output Files (After Jobs Complete)
---

**Wait for all batch jobs to complete before running this section.**

Check job status with the monitoring command above. Look for `"state": "JOB_STATE_SUCCEEDED"`.

In [None]:
# E1: List output files in GCS

print("Checking output files in GCS...")
print()

output_blobs = list(bucket_obj.list_blobs(prefix=BATCH_OUTPUT_PREFIX))
jsonl_files = [b for b in output_blobs if b.name.endswith('.jsonl')]

print(f"Found {len(jsonl_files)} JSONL output files:")
print()

total_size = 0
for blob in jsonl_files:
    size_mb = blob.size / (1024 * 1024)
    total_size += blob.size
    print(f"  {blob.name} ({size_mb:.1f} MB)")

print(f"\nTotal output size: {total_size / (1024*1024):.1f} MB")

---
# PART F: Download and Parse Results
---

In [None]:
# F1: Download and parse all results

print("Downloading and parsing results...")
print()

results = {}  # customId -> scene_description
errors = []

for blob in jsonl_files:
    print(f"Processing: {blob.name}")
    
    # Download content
    content = blob.download_as_text()
    
    for line in content.strip().split("\n"):
        if not line.strip():
            continue
        
        try:
            data = json.loads(line)
            custom_id = data.get("customId", "")
            
            # Extract the response text
            response = data.get("response", {})
            candidates = response.get("candidates", [])
            
            if candidates:
                content_parts = candidates[0].get("content", {}).get("parts", [])
                if content_parts:
                    scene_desc = content_parts[0].get("text", "")
                    results[custom_id] = scene_desc.strip()
            else:
                errors.append({"customId": custom_id, "error": "No candidates"})
                
        except Exception as e:
            errors.append({"line": line[:100], "error": str(e)})

print()
print(f"‚úÖ Parsed {len(results):,} scene descriptions")
print(f"‚ùå Errors: {len(errors)}")

if errors[:3]:
    print("\nSample errors:")
    for e in errors[:3]:
        print(f"  {e}")

In [None]:
# F2: Show sample results

print("Sample scene descriptions:")
print("=" * 60)

sample_ids = list(results.keys())[:3]
for custom_id in sample_ids:
    desc = results[custom_id]
    print(f"\n{custom_id}:")
    print(f"  {desc[:300]}...")
    print()

---
# PART G: Merge and Upload Final File
---

In [None]:
# G1: Merge scene descriptions into test sequences

print("Merging scene descriptions into test sequences...")

merged_count = 0
missing_count = 0

for seq_idx, seq in enumerate(test_sequences):
    comic_no = seq.get("comic_no", 0)
    custom_id = f"test_{seq_idx}_comic{comic_no}"
    
    if custom_id in results:
        seq["scene_description"] = results[custom_id]
        merged_count += 1
    else:
        seq["scene_description"] = ""  # Empty for missing
        missing_count += 1

print(f"\n‚úÖ Merged: {merged_count:,}")
print(f"‚ö†Ô∏è  Missing: {missing_count:,}")
print(f"üìä Coverage: {merged_count/len(test_sequences)*100:.1f}%")

In [None]:
# G2: Save locally and upload to GCS

# Save locally
local_output = WORK_DIR / "test_sequences_with_descriptions.pkl"
with open(local_output, "wb") as f:
    pickle.dump(test_sequences, f)

file_size_mb = local_output.stat().st_size / (1024 * 1024)
print(f"Saved locally: {local_output} ({file_size_mb:.1f} MB)")

# Upload to GCS
gcs_output_path = "training_sequences/test_sequences_with_descriptions.pkl"
blob = bucket_obj.blob(gcs_output_path)
blob.upload_from_filename(str(local_output))

print(f"\n‚úÖ Uploaded to: gs://{BUCKET}/{gcs_output_path}")

In [None]:
# G3: Show comparison - OCR vs Scene Description

print("Comparison: OCR Text vs Scene Description")
print("=" * 70)

for i in range(3):
    seq = test_sequences[i]
    print(f"\n--- Example {i+1} ---")
    print(f"\nüìù OCR (target_text):")
    print(f"   {seq.get('target_text', '[Empty]')[:200]}")
    print(f"\nüé¨ Scene Description:")
    print(f"   {seq.get('scene_description', '[Empty]')[:300]}")
    print()

---
# Summary
---

## Files Created

```
gs://harshasekar-comics-data/training_sequences/
‚îú‚îÄ‚îÄ test_sequences.pkl                      ‚Üê Original
‚îî‚îÄ‚îÄ test_sequences_with_descriptions.pkl    ‚Üê NEW (with Gemini descriptions)
```

## Next Steps

1. **Copy to Delta:**
   ```bash
   gsutil cp gs://harshasekar-comics-data/training_sequences/test_sequences_with_descriptions.pkl \
     /scratch/bftl/hsekar/comics_project/data/processed/
   ```

2. **Update fine-tuning notebook** with the 5 changes identified earlier

3. **Re-run fine-tuning** with proper scene description targets