# Cosmos Predict2 Full Pipeline on A100

This notebook runs both T5 encoding and Cosmos Predict2 inference on a single A100 GPU.

**Requirements:**
- Google Colab with A100 runtime
- 40GB GPU memory

**Note:** Make sure to select `Runtime > Change runtime type > A100 GPU` before running.

## 1. Installation Setup

Choose installation method: GitHub source (latest features) or PyPI (stable release).

In [None]:
# Set installation method
USE_GITHUB = True  # Set to True for latest features from GitHub, False for stable PyPI release

if USE_GITHUB:
    print("📦 Installing Cosmos Predict2 from GitHub source...")
else:
    print("📦 Installing Cosmos Predict2 from PyPI...")

### Install from GitHub Source

In [None]:
%%capture
if USE_GITHUB:
    # Clone the repository
    !git clone https://github.com/nvidia-cosmos/cosmos-predict2.git /content/cosmos-predict2
    
    # Change to the repo directory
    import os
    os.chdir('/content/cosmos-predict2')
    
    # Install PyTorch with CUDA support
    !pip install -q --upgrade pip
    !pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    
    # Install cosmos-predict2 from source with CUDA support
    !pip install -q -e ".[cu126]" --extra-index-url https://nvidia-cosmos.github.io/cosmos-dependencies/cu126_torch260/simple
    
    # Add to Python path
    import sys
    sys.path.insert(0, '/content/cosmos-predict2')
    
    print("✅ Installed from GitHub source")

### Install from PyPI

In [None]:
%%capture
if not USE_GITHUB:
    # Install PyTorch with CUDA support
    !pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    
    # Install Cosmos Predict2 from PyPI
    !pip install -q "cosmos-predict2[cu126]" --extra-index-url https://nvidia-cosmos.github.io/cosmos-dependencies/cu126_torch260/simple
    
    print("✅ Installed from PyPI")

### Install Additional Dependencies

In [None]:
%%capture
# Install other required dependencies
!pip install -q transformers accelerate bitsandbytes
!pip install -q decord einops imageio[ffmpeg]
!pip install -q opencv-python-headless pillow

print("✅ Additional dependencies installed")

## 2. Verify Installation

In [None]:
# Verify installations and setup paths
import pkg_resources
import os
import sys
import torch

# Add cosmos-predict2 to path if using GitHub installation
if os.path.exists('/content/cosmos-predict2'):
    sys.path.insert(0, '/content/cosmos-predict2')
    COSMOS_PATH = '/content/cosmos-predict2'
    print(f"✅ Using Cosmos Predict2 from GitHub: {COSMOS_PATH}")
else:
    COSMOS_PATH = None
    print("✅ Using Cosmos Predict2 from pip installation")

# Check GPU
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"🖥️ GPU: {gpu_name} ({gpu_memory:.1f} GB)")
else:
    print("❌ No GPU detected!")

# Test import
try:
    from cosmos_predict2.inference import Video2WorldPipeline
    print("✅ Cosmos Predict2 imports working correctly")
except ImportError as e:
    print(f"❌ Import error: {e}")

## 3. Mount Google Drive (Optional but Recommended)

Mount your Google Drive to auto-save outputs and prevent data loss.

In [None]:
# Mount Google Drive for automatic saving
mount_drive = True  # Set to True to auto-save outputs to Google Drive

if mount_drive:
    from google.colab import drive
    drive.mount('/content/drive')
    print("✅ Google Drive mounted at /content/drive")
    
    # Create output directory in Drive
    from datetime import datetime
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    drive_output_dir = f"/content/drive/MyDrive/cosmos_outputs_{timestamp}"
    os.makedirs(drive_output_dir, exist_ok=True)
    print(f"📁 Output directory created: {drive_output_dir}")
    print("💾 All outputs will be automatically saved to Google Drive")
else:
    print("⚠️ WARNING: Google Drive not mounted - outputs may be lost if runtime disconnects!")
    print("   Set mount_drive=True to enable automatic saving")
    drive_output_dir = None

## 4. Download Model Checkpoints

In [None]:
# Select model size based on GPU
if gpu_memory >= 40:  # A100
    MODEL_SIZE = "14B"  # Can use largest model
elif gpu_memory >= 16:  # T4 or similar
    MODEL_SIZE = "5B"  # Medium model
else:
    MODEL_SIZE = "2B"  # Smallest model

print(f"🤖 Selected Cosmos Predict2-{MODEL_SIZE} based on {gpu_memory:.1f}GB GPU")
print("Downloading checkpoint (this may take a few minutes)...")

from huggingface_hub import snapshot_download

# Download checkpoint
checkpoint_base_dir = "/content/cosmos_checkpoints"
checkpoint_dir = snapshot_download(
    repo_id=f"nvidia/Cosmos-Predict2-{MODEL_SIZE}-Video2World",
    cache_dir=checkpoint_base_dir,
    resume_download=True
)

print(f"✅ Checkpoint downloaded to: {checkpoint_dir}")

## 5. Initialize T5 Text Encoder

In [None]:
from transformers import T5EncoderModel, T5Tokenizer
import torch

class OptimizedT5Encoder:
    def __init__(self, model_name="google-t5/t5-11b"):
        self.model_name = model_name
        self.model = None
        self.tokenizer = None
        
    def load(self, use_fp16=True, use_8bit=False):
        """Load T5 model with memory optimizations."""
        print(f"Loading T5 encoder: {self.model_name}")
        
        # Load tokenizer
        self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
        
        # Load model with optimizations
        if use_8bit:
            from transformers import BitsAndBytesConfig
            quantization_config = BitsAndBytesConfig(load_in_8bit=True)
            self.model = T5EncoderModel.from_pretrained(
                self.model_name,
                quantization_config=quantization_config,
                device_map="auto"
            )
            print("✅ Loaded in 8-bit mode")
        else:
            self.model = T5EncoderModel.from_pretrained(self.model_name)
            if use_fp16:
                self.model = self.model.half()
                print("✅ Using FP16 precision")
            self.model = self.model.to("cuda")
        
        self.model.eval()
        print(f"✅ T5 encoder loaded")
        
    def encode(self, text, max_length=77):
        """Encode text to embeddings."""
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            max_length=max_length,
            padding="max_length",
            truncation=True
        ).to("cuda")
        
        with torch.no_grad():
            outputs = self.model(**inputs)
            
        return {
            "encoder_hidden_states": outputs.last_hidden_state,
            "attention_mask": inputs.attention_mask
        }
    
    def unload(self):
        """Free memory by unloading the model."""
        if self.model:
            del self.model
            self.model = None
        if self.tokenizer:
            del self.tokenizer
            self.tokenizer = None
        torch.cuda.empty_cache()
        print("✅ T5 encoder unloaded")

In [None]:
# Choose T5 model based on available memory
if gpu_memory >= 40:  # A100
    t5_model = "google-t5/t5-11b"  # Best quality
    print(f"Using T5-11B (best quality) on {gpu_name}")
elif gpu_memory >= 16:  # T4 or similar
    t5_model = "google/flan-t5-xl"  # Efficient
    print(f"Using Flan-T5-XL (efficient) on {gpu_name}")
else:
    t5_model = "google/flan-t5-base"  # Minimal
    print(f"Using Flan-T5-Base (minimal) on {gpu_name}")

# Initialize and load T5 encoder
t5_encoder = OptimizedT5Encoder(model_name=t5_model)
t5_encoder.load(use_fp16=True, use_8bit=False)

print(f"💾 GPU memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")

## 6. Encode Text Prompts

In [None]:
# Define prompts for robot manipulation tasks
prompts = [
    "A robotic arm picks up white paper and places it into a red square target area on the table.",
    "High-definition video of SO-100 robot manipulating paper with precise movements.",
    "Robot gripper grasps paper and moves it to designated red square zone.",
    "Automated paper handling: robot transfers white sheet to red target area.",
    "The robot arm carefully picks up a sheet of paper from the table.",
]

# Encode all prompts
print("Encoding prompts...")
encoded_prompts = {}

for i, prompt in enumerate(prompts, 1):
    encoded = t5_encoder.encode(prompt)
    encoded_prompts[prompt] = encoded["encoder_hidden_states"]
    print(f"  [{i}/{len(prompts)}] ✅ Encoded: '{prompt[:50]}...'")

print(f"\n💾 Current GPU memory: {torch.cuda.memory_allocated()/1024**3:.2f} GB")

## 7. Load Cosmos Predict2 Pipeline

In [None]:
from cosmos_predict2.inference import (
    Video2WorldPipeline,
    get_cosmos_predict2_video2world_pipeline,
)

print(f"Loading Cosmos Predict2-{MODEL_SIZE} pipeline...")

# Create pipeline configuration
config = get_cosmos_predict2_video2world_pipeline(model_size=MODEL_SIZE)

# Update config to use our downloaded checkpoint
config['dit_checkpoint_path'] = os.path.join(
    checkpoint_dir,
    "model-720p-16fps.pt"  # or "model-720p-10fps.pt" for 10fps
)

# Initialize pipeline
try:
    cosmos_pipe = Video2WorldPipeline.from_config(config)
    cosmos_pipe = cosmos_pipe.to("cuda")
    cosmos_pipe.eval()
    
    print(f"✅ Cosmos Predict2-{MODEL_SIZE} pipeline loaded successfully")
    print(f"💾 Current GPU memory: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    
except Exception as e:
    print(f"❌ Error loading pipeline: {e}")
    raise

## 8. Create or Load Input Video/Image

In [None]:
import numpy as np
import cv2
from IPython.display import HTML, display, Image
from PIL import Image as PILImage
import base64

def create_test_image(output_path="test_input.jpg", width=1280, height=720):
    """Create a simple test image simulating a robot workspace."""
    # Create base image
    img = np.zeros((height, width, 3), dtype=np.uint8)
    
    # Add gradient background (table surface)
    for y in range(height):
        img[y, :] = [100 + int(50 * y / height), 80, 60]
    
    # Add white paper rectangle
    paper_x, paper_y = width // 3, height // 2
    paper_w, paper_h = 200, 150
    cv2.rectangle(img, (paper_x, paper_y), (paper_x + paper_w, paper_y + paper_h), 
                  (255, 255, 255), -1)
    cv2.rectangle(img, (paper_x, paper_y), (paper_x + paper_w, paper_y + paper_h), 
                  (200, 200, 200), 2)
    
    # Add red target square
    target_x, target_y = 2 * width // 3, height // 2
    target_size = 150
    cv2.rectangle(img, (target_x, target_y), (target_x + target_size, target_y + target_size),
                  (50, 50, 200), -1)
    cv2.rectangle(img, (target_x, target_y), (target_x + target_size, target_y + target_size),
                  (30, 30, 150), 3)
    
    # Add text labels
    cv2.putText(img, "Paper", (paper_x + 70, paper_y - 10), 
               cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
    cv2.putText(img, "Target", (target_x + 40, target_y - 10), 
               cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
    
    # Save image
    cv2.imwrite(output_path, cv2.cvtColor(img, cv2.COLOR_RGB2BGR))
    print(f"✅ Created test image: {output_path}")
    return output_path

def display_image(image_path):
    """Display image in notebook."""
    img = PILImage.open(image_path)
    display(img)

# Create or upload input
use_test_input = True  # Set to False to upload your own image/video

if use_test_input:
    input_path = create_test_image()
    print("\nTest input image:")
    display_image(input_path)
else:
    from google.colab import files
    print("Please upload an image or video file:")
    uploaded = files.upload()
    input_path = list(uploaded.keys())[0]
    print(f"✅ Uploaded: {input_path}")
    if input_path.endswith(('.jpg', '.jpeg', '.png')):
        display_image(input_path)

## 9. Generate Video with Cosmos Predict2

In [None]:
import decord
from einops import rearrange
import time

def generate_video_cosmos(input_path, prompt_embedding, num_frames=16, fps=16):
    """Generate video using Cosmos Predict2."""
    
    # Load input frame
    if input_path.endswith(('.jpg', '.jpeg', '.png')):
        # Input is an image
        img = PILImage.open(input_path)
        frames = np.array(img)[np.newaxis, ...]  # Add time dimension
    else:
        # Input is a video - use first frame
        vr = decord.VideoReader(input_path)
        frames = vr[:1].asnumpy()
    
    # Prepare input tensor
    frames_tensor = torch.from_numpy(frames).float() / 255.0
    frames_tensor = rearrange(frames_tensor, "t h w c -> 1 c t h w")
    frames_tensor = frames_tensor.to("cuda")
    
    print(f"📊 Input shape: {frames_tensor.shape}")
    print(f"🎬 Generating {num_frames} frames at {fps} FPS...")
    
    start_time = time.time()
    
    with torch.no_grad():
        with torch.cuda.amp.autocast():
            output = cosmos_pipe(
                frames_tensor,
                prompt_embedding,
                num_frames=num_frames,
                fps=fps,
                seed=42
            )
    
    generation_time = time.time() - start_time
    print(f"✅ Generation complete in {generation_time:.2f} seconds")
    print(f"⚡ Speed: {num_frames/generation_time:.2f} frames/second")
    
    return output

In [None]:
# Configure generation parameters based on GPU
if gpu_memory >= 40:  # A100
    generation_params = {
        "num_frames": 121,  # ~7.5 seconds at 16fps
        "fps": 16
    }
    print("🚀 Using A100 optimized settings")
elif gpu_memory >= 16:  # T4
    generation_params = {
        "num_frames": 61,  # ~3.8 seconds at 16fps
        "fps": 16
    }
    print("Using T4 optimized settings")
else:
    generation_params = {
        "num_frames": 16,  # 1 second at 16fps
        "fps": 16
    }
    print("Using conservative settings")

print(f"Generation parameters: {generation_params}")

# Select prompt and generate
selected_prompt = prompts[0]  # Use first prompt
print(f"\n📝 Selected prompt: '{selected_prompt[:80]}...'")

# Get the pre-encoded embedding
prompt_embedding = encoded_prompts[selected_prompt]

# Generate video
print("\n🎬 Starting video generation...")
output_video = generate_video_cosmos(
    input_path,
    prompt_embedding,
    num_frames=generation_params['num_frames'],
    fps=generation_params['fps']
)

## 10. Save and Display Results

In [None]:
import imageio
import shutil

def save_video(tensor, output_path="output_video.mp4", fps=16, auto_backup=True):
    """Save tensor as video file with automatic Google Drive backup."""
    # Convert tensor to numpy
    if isinstance(tensor, torch.Tensor):
        video = tensor.cpu().numpy()
    else:
        video = tensor
    
    # Rearrange dimensions if needed
    if video.ndim == 5:  # B C T H W
        video = video[0]  # Remove batch
    if video.shape[0] == 3:  # C T H W
        video = np.transpose(video, (1, 2, 3, 0))  # T H W C
    
    # Normalize to 0-255
    if video.max() <= 1.0:
        video = (video * 255).astype(np.uint8)
    
    # Save video locally
    writer = imageio.get_writer(output_path, fps=fps)
    for frame in video:
        writer.append_data(frame)
    writer.close()
    
    print(f"✅ Saved video locally: {output_path}")
    
    # Auto-backup to Google Drive
    if auto_backup and drive_output_dir:
        drive_path = os.path.join(drive_output_dir, os.path.basename(output_path))
        shutil.copy2(output_path, drive_path)
        print(f"☁️ Backed up to Drive: {drive_path}")
        
        # Save metadata
        metadata_path = drive_path.replace('.mp4', '_metadata.txt')
        with open(metadata_path, 'w') as f:
            f.write(f"Prompt: {selected_prompt}\n")
            f.write(f"Frames: {generation_params['num_frames']}\n")
            f.write(f"FPS: {generation_params['fps']}\n")
            f.write(f"Model: Cosmos-Predict2-{MODEL_SIZE}\n")
            f.write(f"Timestamp: {datetime.now().isoformat()}\n")
        print(f"📝 Metadata saved")
    
    return output_path

def display_video(video_path):
    """Display video in notebook."""
    video = open(video_path, 'rb').read()
    encoded = base64.b64encode(video).decode('ascii')
    display(HTML(f'''
    <video width="640" height="360" controls autoplay loop>
        <source src="data:video/mp4;base64,{encoded}" type="video/mp4">
    </video>
    '''))

In [None]:
# Save the generated video
output_filename = f"cosmos_output_{datetime.now().strftime('%H%M%S')}.mp4"
output_path = save_video(output_video, output_filename, fps=16, auto_backup=True)

# Display the result
print("\n🎥 Generated video:")
display_video(output_path)

# Optional download
from google.colab import files
download = input("\nDownload video to your computer? (y/n): ")
if download.lower() == 'y':
    files.download(output_path)
    print("✅ Download started")

## 11. Batch Processing (Optional)

Process multiple prompts efficiently with automatic Drive backup.

In [None]:
# Batch process all prompts
batch_process = True  # Set to True to process all prompts

if batch_process:
    results = {}
    
    print(f"🎬 Batch processing {len(prompts)} prompts...")
    if drive_output_dir:
        print(f"📁 All outputs will be saved to: {drive_output_dir}")
    
    for i, prompt in enumerate(prompts):
        print(f"\n[{i+1}/{len(prompts)}] Processing...")
        print(f"  Prompt: {prompt[:60]}...")
        
        try:
            # Generate video
            output = generate_video_cosmos(
                input_path,
                encoded_prompts[prompt],
                num_frames=generation_params['num_frames'],
                fps=generation_params['fps']
            )
            
            # Save with descriptive filename
            output_file = f"batch_{i:02d}_{datetime.now().strftime('%H%M%S')}.mp4"
            save_video(output, output_file, fps=16, auto_backup=True)
            results[prompt] = output_file
            
            # Clear cache between generations
            torch.cuda.empty_cache()
            
        except Exception as e:
            print(f"  ❌ Failed: {e}")
            continue
    
    print(f"\n✅ Batch processing complete!")
    print(f"Successfully generated {len(results)}/{len(prompts)} videos")
    
    # Display summary
    print("\n📊 Results summary:")
    for prompt, file in results.items():
        print(f"  - {prompt[:40]}... -> {file}")

## 12. Memory Management and Session Info

In [None]:
# Display session status
print("📊 Session Status:")
print("="*50)
print(f"GPU: {gpu_name}")
print(f"Total GPU memory: {gpu_memory:.1f} GB")
print(f"GPU allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
print(f"GPU reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")
print(f"GPU free: {(torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated())/1024**3:.2f} GB")

if drive_output_dir:
    print(f"\n✅ Outputs saved to Google Drive:")
    print(f"   {drive_output_dir}")
    print("\n💾 Your outputs are safe even if the session disconnects!")
else:
    print("\n⚠️ No Google Drive backup - outputs will be lost if session disconnects!")

In [None]:
# Optional: Clean up memory
cleanup = False  # Set to True to free all memory

if cleanup:
    print("🧹 Cleaning up memory...")
    
    # Unload models
    if 't5_encoder' in locals():
        t5_encoder.unload()
        del t5_encoder
    
    if 'cosmos_pipe' in locals():
        del cosmos_pipe
    
    # Clear cache
    import gc
    gc.collect()
    torch.cuda.empty_cache()
    
    print(f"✅ Cleanup complete")
    print(f"GPU allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
else:
    print("💡 Set cleanup=True to free GPU memory")

## Tips and Troubleshooting

### Memory Optimization:
- **A100 (40GB)**: Can run T5-11B + Cosmos-14B with 121 frames
- **T4 (16GB)**: Use Flan-T5-XL + Cosmos-5B with 61 frames
- **Low memory**: Use 8-bit quantization or smaller models

### Performance Tips:
- Enable TF32 on A100 for 2-3x speedup
- Use FP16 (half precision) for memory efficiency
- Batch encode prompts before generation
- Clear cache between generations in batch processing

### Common Issues:
1. **OOM Error**: Reduce `num_frames` or use smaller models
2. **Slow generation**: Check GPU type, use appropriate settings
3. **Import errors**: Restart runtime after installing packages
4. **Drive not mounting**: Check browser permissions for Google Drive

### Recovery from Disconnection:
If your session disconnects but you had Drive mounted, your outputs are safe!
Simply remount Drive and navigate to your output directory to access generated videos.