<a href="https://colab.research.google.com/github/mshumer/sora-extend/blob/main/Sora_Extend.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sora 2 — AI‑Planned, Scene‑Exact Prompts with Continuity (Chained >12s)

Built originally by [Matt Shumer](https://x.com/mattshumer_).

Pipeline:
1) Use an LLM (“GPT‑5 Thinking”) to plan N scene prompts from a base idea. The LLM is prompted to do this intelligently to enable continuity.
2) Render each segment with Sora 2; for continuity, pass the prior segment’s **final frame** as `input_reference`.
3) Concatenate segments into a single MP4.

In [2]:
# @title 1) Install & imports

import sys, subprocess, importlib.util, shutil, os, textwrap, tempfile

# Detect environment for informational purposes
def is_colab():
    try:
        import google.colab
        return True
    except ImportError:
        return False

IN_COLAB = is_colab()
print(f"Environment: {'Colab' if IN_COLAB else 'Local'}")

def pip_install(*pkgs):
    """Install packages using uv pip"""
    subprocess.check_call(["uv", "pip", "install", "-q", "-U", *pkgs])

def ensure(spec_name, *pip_pkgs):
    """Check if a package is available, install if needed"""
    if importlib.util.find_spec(spec_name) is None:
        print(f"Installing {spec_name}...")
        pip_install(*pip_pkgs)
    return importlib.util.find_spec(spec_name) is not None

# Check/install core dependencies
print("Checking dependencies...")
MOVIEPY_AVAILABLE = ensure("moviepy", "moviepy>=2.0.0", "imageio", "imageio-ffmpeg")

# Only import MoviePy if it's actually available
if MOVIEPY_AVAILABLE:
    try:
        from moviepy.editor import VideoFileClip, concatenate_videoclips
        print("✓ MoviePy imported successfully")
    except ImportError:
        MOVIEPY_AVAILABLE = False
        print("⚠ MoviePy installation failed, falling back to ffmpeg")

# Fallback: ensure ffmpeg is available if MoviePy isn't
if not MOVIEPY_AVAILABLE:
    try:
        import imageio_ffmpeg
        FFMPEG_BIN = imageio_ffmpeg.get_ffmpeg_exe()
    except Exception:
        FFMPEG_BIN = shutil.which("ffmpeg")

    if not FFMPEG_BIN:
        # Final attempt to get ffmpeg via pip
        print("Installing imageio-ffmpeg...")
        pip_install("imageio-ffmpeg")
        try:
            import imageio_ffmpeg
            FFMPEG_BIN = imageio_ffmpeg.get_ffmpeg_exe()
        except Exception:
            FFMPEG_BIN = None

    if not FFMPEG_BIN:
        raise RuntimeError(
            "FFmpeg not found and MoviePy unavailable. "
            "Install ffmpeg on your system or allow pip installs."
        )
    print(f"✓ FFmpeg available at: {FFMPEG_BIN}")

print(f"✓ Video processing backend: {'MoviePy' if MOVIEPY_AVAILABLE else 'FFmpeg'}")

# Install remaining dependencies
print("Installing additional packages...")
if IN_COLAB:
    get_ipython().system('uv pip -q install --upgrade openai requests opencv-python-headless imageio[ffmpeg] fal-client')
else:
    # Use pip_install for consistency in local environments too
    pip_install("openai", "requests", "opencv-python-headless", "imageio[ffmpeg]", "fal-client")

# Standard imports
import os, re, io, json, time, math, mimetypes
from pathlib import Path
import requests
import cv2
from IPython.display import Video as IPyVideo, display
from openai import OpenAI

print("✓ All imports successful")

Environment: Local
Checking dependencies...
⚠ MoviePy installation failed, falling back to ffmpeg
✓ FFmpeg available at: /Users/jquintanilla/Developer/sora-extend/.venv/lib/python3.11/site-packages/imageio_ffmpeg/binaries/ffmpeg-macos-aarch64-v7.1
✓ Video processing backend: FFmpeg
Installing additional packages...
✓ All imports successful


# 2) Config

Fill in your `.env` file with:
- `OPENROUTER_API_KEY` - Your OpenRouter API key
- `FAL_KEY` - Your Fal AI API key
- `PLANNER_MODEL` - If you have access to "GPT-5 Thinking", set it below. Otherwise, fallback to a strong reasoning model you have.

Configure:
- `RESOLUTION` - Video resolution: `"720p"` or `"1080p"` (default: `"720p"`)
- `ASPECT_RATIO` - Video aspect ratio: `"16:9"` or `"9:16"` (default: `"16:9"`)
- `SECONDS_PER_SEGMENT` - Duration per segment: `4`, `8`, or `12` seconds (default: `8`)
- `NUM_GENERATIONS` - Total number of segments to generate (default: `2`)

**Continuity Logic:**
- First segment uses `text-to-video` endpoint (no prior frame)
- Subsequent segments use `image-to-video` endpoint (with last frame from previous segment)

Total video length = `SECONDS_PER_SEGMENT * NUM_GENERATIONS`

In [None]:
from dotenv import load_dotenv
import os

load_dotenv()

openrouter_key = os.getenv("OPENROUTER_KEY")
fal_key = os.getenv("FAL_KEY")

# Set up OpenRouter client (uses OpenAI SDK with custom base_url)
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=openrouter_key,
)

# OpenRouter model - GPT-5 with reasoning enabled
PLANNER_MODEL = os.environ.get("PLANNER_MODEL", "openai/gpt-5")

# Set Fal AI key as environment variable for fal_client
os.environ["FAL_KEY"] = fal_key

# Sora 2 endpoints (dynamically selected based on whether input image exists)
SORA_TEXT_TO_VIDEO = "fal-ai/sora-2/text-to-video/pro"
SORA_IMAGE_TO_VIDEO = "fal-ai/sora-2/image-to-video/pro"

# Video settings for Fal AI Sora API
RESOLUTION = "720p"      # Options: "720p", "1080p"
ASPECT_RATIO = "16:9"    # Options: "16:9", "9:16"

BASE_PROMPT          = "Gameplay footage of a game releasing in 2027, a car driving through a futuristic city"
SECONDS_PER_SEGMENT  = 8  # Options: 4, 8, 12 (per Fal API duration enum)
NUM_GENERATIONS      = 2

# Output directory
OUT_DIR = Path("sora_output")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Progress display
PRINT_PROGRESS_BAR = True

# 3) The planner system prompt

We’ll ask the planner model to output a clean JSON object with one prompt per generation.
The prompts contain context and the actual shot details, maximizing continuity.

This isn't super optimized and was a first pass done by GPT. If people like this notebook, let me know on X, and I'll improve this!

In [None]:
PLANNER_SYSTEM_INSTRUCTIONS = r"""
You are a senior prompt director for Sora 2. Your job is to transform:
- a Base prompt (broad idea),
- a fixed generation length per segment (seconds),
- and a total number of generations (N),

into **N crystal-clear shot prompts** with **maximum continuity** across segments.

Rules:
1) Return **valid JSON** only. Structure:
   {
     "segments": [
       {
         "title": "Generation 1",
         "seconds": 6,
         "prompt": "<prompt block to send into Sora>"
       },
       ...
     ]
   }
   - `seconds` MUST equal the given generation length for ALL segments.
   - `prompt` should include a **Context** section for model guidance AND a **Prompt** line for the shot itself,
     exactly like in the example below.
2) Continuity:
   - Segment 1 starts fresh from the BASE PROMPT.
   - Segment k (k>1) must **begin exactly at the final frame** of segment k-1.
   - Maintain consistent visual style, tone, lighting, and subject identity unless explicitly told to change.
3) Safety & platform constraints:
   - Do not depict real people (including public figures) or copyrighted characters.
   - Avoid copyrighted music and avoid exact trademark/logos if policy disallows them; use brand-safe wording.
   - Keep content suitable for general audiences.
4) Output only JSON (no Markdown, no backticks).
5) Keep the **Context** lines inside the prompt text (they're for the AI, not visible).
6) Make the writing specific and cinematic; describe camera, lighting, motion, and subject focus succinctly.

Below is an **EXAMPLE (verbatim)** of exactly how to structure prompts with context and continuity:

Example:
Base prompt: "Intro video for the iPhone 19"
Generation length: 6 seconds each
Total generations: 3

Clearly defined prompts with maximum continuity and context:

### Generation 1:

<prompt>
First shot introducing the new iPhone 19. Initially, the screen is completely dark. The phone, positioned vertically and facing directly forward, emerges slowly and dramatically out of darkness, gradually illuminated from the center of the screen outward, showcasing a vibrant, colorful, dynamic wallpaper on its edge-to-edge glass display. The style is futuristic, sleek, and premium, appropriate for an official Apple product reveal.
<prompt>

---

### Generation 2:

<prompt>
Context (not visible in video, only for AI guidance):

* You are creating the second part of an official intro video for Apple's new iPhone 19.
* The previous 6-second scene ended with the phone facing directly forward, clearly displaying its vibrant front screen and colorful wallpaper.

Prompt: Second shot begins exactly from the final frame of the previous scene, showing the front of the iPhone 19 with its vibrant, colorful display clearly visible. Now, smoothly rotate the phone horizontally, turning it from the front to reveal the back side. Focus specifically on the advanced triple-lens camera module, clearly highlighting its premium materials, reflective metallic surfaces, and detailed lenses. Maintain consistent dramatic lighting, sleek visual style, and luxurious feel matching the official Apple product introduction theme.
</prompt>

---

### Generation 3:

<prompt>
Context (not visible in video, only for AI guidance):

* You are creating the third and final part of an official intro video for Apple's new iPhone 19.
* The previous 6-second scene ended clearly showing the back of the iPhone 19, focusing specifically on its advanced triple-lens camera module.

Prompt: Final shot begins exactly from the final frame of the previous scene, clearly displaying the back side of the iPhone 19, with special emphasis on the triple-lens camera module. Now, have a user's hand gently pick up the phone, naturally rotating it from the back to the front and bringing it upward toward their face. Clearly show the phone smoothly and quickly unlocking via Face ID recognition, transitioning immediately to a vibrant home screen filled with updated app icons. Finish the scene by subtly fading the home screen into the iconic Apple logo. Keep the visual style consistent, premium, and elegant, suitable for an official Apple product launch.
</prompt>

--

Notice how we broke up the initial prompt into multiple prompts that provide context and continuity so this all works seamlessly.
""".strip()


# 4) Planner: ask the LLM to generate prompts (JSON)

In [None]:
def plan_prompts_with_ai(base_prompt: str, seconds_per_segment: int, num_generations: int):
    """
    Calls OpenRouter API (via OpenAI SDK) to produce a JSON object:
    {
      "segments": [
        {"title": "...", "seconds": <int>, "prompt": "<full prompt block>"},
        ...
      ]
    }
    Uses GPT-5 with reasoning enabled at medium effort level.
    """
    # Compose a single plain-text input with the variables:
    user_input = f"""
BASE PROMPT: {base_prompt}

GENERATION LENGTH (seconds): {seconds_per_segment}
TOTAL GENERATIONS: {num_generations}

Return exactly {num_generations} segments.
""".strip()

    # Use OpenRouter via chat completions with reasoning enabled
    response = client.chat.completions.create(
        model=PLANNER_MODEL,
        messages=[
            {"role": "system", "content": PLANNER_SYSTEM_INSTRUCTIONS},
            {"role": "user", "content": user_input}
        ],
        max_tokens=4000,  # tokens for the final answer
        temperature=0.7,
        extra_body={
            "reasoning": {
                "effort": "medium",  # "low" | "medium" | "high"
            }
        },
    )

    text = response.choices[0].message.content

    # Extract the first JSON object found in the response text.
    m = re.search(r'\{[\s\S]*\}', text)
    if not m:
        raise ValueError("Planner did not return JSON. Inspect response and adjust instructions.")
    data = json.loads(m.group(0))

    # Basic validation and enforcement
    segments = data.get("segments", [])
    
    # Fail fast if we don't have enough segments
    if len(segments) < num_generations:
        raise ValueError(
            f"Planner returned {len(segments)} segments but {num_generations} were requested. "
            f"LLM output needs fixing. Response: {text[:500]}"
        )
    
    # Truncate if we got more than requested
    if len(segments) > num_generations:
        segments = segments[:num_generations]

    # Force durations to the requested number (some models might deviate)
    for seg in segments:
        seg["seconds"] = int(seconds_per_segment)

    return segments

segments_plan = plan_prompts_with_ai(BASE_PROMPT, SECONDS_PER_SEGMENT, NUM_GENERATIONS)

print("AI‑planned segments:\n")
for i, seg in enumerate(segments_plan, start=1):
    print(f"[{i:02d}] {seg['seconds']}s — {seg.get('title','(untitled)')}")
    print(seg["prompt"])
    print("-" * 80)

# 5) Sora helpers (create → poll → download → extract final frame)

In [None]:
import fal_client
from pathlib import Path

def create_and_poll_video(prompt: str, resolution: str, aspect_ratio: str, duration: int, input_image: Path | None = None):
    """
    Create a video using Fal AI's Sora 2 Pro endpoint.
    Automatically selects the appropriate endpoint:
    - text-to-video if no input_image provided (first segment)
    - image-to-video if input_image provided (subsequent segments for continuity)
    
    Returns the result dict from Fal AI.
    
    Args:
        prompt: The text prompt for video generation
        resolution: Video resolution - "720p" or "1080p"
        aspect_ratio: Video aspect ratio - "16:9" or "9:16"
        duration: Duration in seconds - 4, 8, or 12
        input_image: Optional input image for image-to-video continuity
    """
    
    def on_queue_update(update):
        if isinstance(update, fal_client.InProgress):
            for log in update.logs:
                if PRINT_PROGRESS_BAR:
                    print(f"  {log['message']}")
    
    # Select endpoint based on whether we have an input image
    if input_image is not None:
        endpoint = SORA_IMAGE_TO_VIDEO
        print(f"  Using image-to-video endpoint (continuity mode)")
        print(f"  Input image: {input_image.name}")
    else:
        endpoint = SORA_TEXT_TO_VIDEO
        print(f"  Using text-to-video endpoint (first segment)")
    
    # Build arguments for Fal API (matching official API documentation)
    arguments = {
        "prompt": prompt,
        "resolution": resolution,
        "aspect_ratio": aspect_ratio,
        "duration": duration
    }
    
    # If we have an input image, upload it and add to arguments
    if input_image is not None:
        image_url = fal_client.upload_file(str(input_image))
        arguments["image_url"] = image_url
    
    print(f"  Submitting to Fal AI ({endpoint})...")
    print(f"  Resolution: {resolution}, Aspect ratio: {aspect_ratio}, Duration: {duration}s")
    
    # Subscribe and wait for completion
    result = fal_client.subscribe(
        endpoint,
        arguments=arguments,
        with_logs=True,
        on_queue_update=on_queue_update,
    )
    
    return result


def download_fal_video(result: dict, out_path: Path) -> Path:
    """
    Download the video from Fal AI result.
    Fal AI returns a dict with 'video' key containing the URL.
    """
    video_url = result.get("video", {}).get("url")
    if not video_url:
        raise RuntimeError(f"No video URL in Fal AI result: {result}")
    
    print(f"  Downloading video from {video_url}...")
    
    with requests.get(video_url, stream=True, timeout=600) as r:
        r.raise_for_status()
        with open(out_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
    
    return out_path


def extract_last_frame(video_path: Path, out_image_path: Path) -> Path:
    """
    Extract the last frame from a video file using OpenCV.
    """
    cap = cv2.VideoCapture(str(video_path))
    if not cap.isOpened():
        raise RuntimeError(f"Failed to open {video_path}")

    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) or 0
    success, frame = False, None

    if total > 0:
        cap.set(cv2.CAP_PROP_POS_FRAMES, total - 1)
        success, frame = cap.read()
    if not success or frame is None:
        cap.release()
        cap = cv2.VideoCapture(str(video_path))
        while True:
            ret, f = cap.read()
            if not ret: break
            frame = f
            success = True
    cap.release()

    if not success or frame is None:
        raise RuntimeError(f"Could not read last frame from {video_path}")

    out_image_path.parent.mkdir(parents=True, exist_ok=True)
    ok = cv2.imwrite(str(out_image_path), frame)
    if not ok:
        raise RuntimeError(f"Failed to write {out_image_path}")
    return out_image_path

# 6) Chain generator (use planner output; continuity via final frame)

In [None]:
def chain_generate_sora(segments, resolution: str, aspect_ratio: str, seconds_per_segment: int):
    """
    segments: list of {"title": str, "seconds": int, "prompt": str}
    resolution: Video resolution - "720p" or "1080p"
    aspect_ratio: Video aspect ratio - "16:9" or "9:16"
    seconds_per_segment: Duration per segment in seconds - 4, 8, or 12
    Returns list of video segment Paths.
    
    For continuity:
    - First segment uses text-to-video (no input image)
    - Subsequent segments use image-to-video with the last frame from previous segment
    """
    input_ref = None
    segment_paths = []

    for i, seg in enumerate(segments, start=1):
        secs   = int(seg["seconds"])
        prompt = seg["prompt"]

        print(f"\n=== Generating Segment {i}/{len(segments)} — {secs}s ===")
        
        # Generate video with Fal AI Sora 2 Pro
        # First segment: input_ref is None (text-to-video)
        # Subsequent segments: input_ref contains last frame (image-to-video)
        result = create_and_poll_video(
            prompt=prompt, 
            resolution=resolution, 
            aspect_ratio=aspect_ratio,
            duration=seconds_per_segment,
            input_image=input_ref
        )
        
        # Download the video
        seg_path = OUT_DIR / f"segment_{i:02d}.mp4"
        download_fal_video(result, seg_path)
        print(f"  Saved {seg_path}")
        segment_paths.append(seg_path)

        # Extract final frame for the next segment (if not the last segment)
        if i < len(segments):
            frame_path = OUT_DIR / f"segment_{i:02d}_last.jpg"
            extract_last_frame(seg_path, frame_path)
            print(f"  Extracted last frame -> {frame_path}")
            input_ref = frame_path

    return segment_paths


def concatenate_segments(segment_paths, out_path: Path) -> Path:
    """
    Concatenate video segments using MoviePy or ffmpeg fallback.
    """
    if not segment_paths:
        raise ValueError("No segments to concatenate")
    
    if MOVIEPY_AVAILABLE:
        # Use MoviePy
        clips = [VideoFileClip(str(p)) for p in segment_paths]
        target_fps = clips[0].fps or 24
        result = concatenate_videoclips(clips, method="compose")
        result.write_videofile(
            str(out_path),
            codec="libx264",
            audio_codec="aac",
            fps=target_fps,
            preset="medium",
            threads=0
        )
        for c in clips:
            c.close()
    else:
        # Use ffmpeg directly
        print("Using ffmpeg for concatenation...")
        
        # Create a temporary file list for ffmpeg concat
        list_file = OUT_DIR / "concat_list.txt"
        with open(list_file, "w") as f:
            for seg_path in segment_paths:
                # ffmpeg concat demuxer requires absolute paths
                f.write(f"file '{seg_path.absolute()}'\n")
        
        # Run ffmpeg concat
        import subprocess
        cmd = [
            FFMPEG_BIN,
            "-f", "concat",
            "-safe", "0",
            "-i", str(list_file),
            "-c", "copy",
            str(out_path)
        ]
        
        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode != 0:
            raise RuntimeError(f"ffmpeg concat failed: {result.stderr}")
        
        # Clean up
        list_file.unlink()
    
    return out_path

# 7) Run the whole pipeline

In [None]:
# 1) (Already ran) Plan prompts with AI -> segments_plan
# 2) Generate with Fal AI Sora 2 Pro in a chain
segment_paths = chain_generate_sora(
    segments_plan, 
    resolution=RESOLUTION, 
    aspect_ratio=ASPECT_RATIO,
    seconds_per_segment=SECONDS_PER_SEGMENT
)

# 3) Concatenate
final_path = OUT_DIR / "combined.mp4"
concatenate_segments(segment_paths, final_path)
print("\nWrote combined video:", final_path)

# 4) Inline preview
display(IPyVideo(str(final_path), embed=True, width=768))