<h1 style="text-align:left; font-size:28px; color:#006064; font-weight:700;">Stable Diffusion ‚Äî Text ‚Üí Video (Zeroscope)</h1>

<h2 style="text-align:center;">0 ‚Äî One-line Goal</h2>

Create a short video by generating multiple frames from a text prompt using a diffusion pipeline (frame-by-frame), then export the frames to a video file.


<h2 style="text-align:center;">1 ‚Äî Theory (concise)</h2>

- Diffusion image models can be used frame-by-frame to create motion by sampling multiple frames from a prompt or with temporal conditioning.  
- This notebook uses a diffusion pipeline that supports `num_frames` to produce a sequence and then stitches frames to a video.  
- This is a simple demo to illustrate generation ‚Üí export workflow; it's not a full temporal model (no explicit motion model or consistency constraints).


<h2 style="text-align:center;">2 ‚Äî Environment Notes</h2>

- Recommended: GPU runtime with enough VRAM (16GB+ is best); FP16 reduces memory.  
- Use the provided `requirements.txt` or the pip installs below in Colab.  
- If using a private/gated model, ensure you have Hugging Face auth token set via `huggingface-cli login` or env var.


In [None]:
# Run this cell in Colab / first-time environment.
# Installs the diffusers repo (latest), transformers, accelerate, and torch.
# If you already have suitable versions installed, skip this cell.

# ‚û§ WARNING: This installs packages and may restart the runtime in Colab.
!pip install -q git+https://github.com/huggingface/diffusers.git
!pip install -q transformers accelerate imageio numpy
# Install a suitable torch for your GPU environment. If you already have torch, omit.
# For Colab with CUDA 11.x the following often works; adjust per your environment:
!pip install -q torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for diffusers (pyproject.toml) ... [?25l[?25hdone


In [None]:
# Standard imports and device setup
import os
import torch
import numpy as np  # ‚û§ numeric ops
import imageio      # ‚û§ video writing
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video  # optional helper

# Device selection
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device ->", device)  # ‚û§ prints which device is used


Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.


Device -> cuda


In [None]:
# 1 ‚Äî Secure Hugging Face Login (Recommended)
import os, getpass
from huggingface_hub import login

# If the token is NOT present, ask the user securely
if "HUGGINGFACE_TOKEN" not in os.environ or not os.environ["HUGGINGFACE_TOKEN"]:
    os.environ["HUGGINGFACE_TOKEN"] = getpass.getpass(
        "üîê Enter your HuggingFace token (hf_...): "
    )

# Login to HuggingFace Hub
login(token=os.environ["HUGGINGFACE_TOKEN"])
print("‚úÖ Hugging Face login successful.")  # ‚û§


üîê Enter your HuggingFace token (hf_...): ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
‚úÖ Hugging Face login successful.


In [None]:
# 6 ‚Äî Pipeline load
# -------------------------
# Load the pipeline
# -------------------------
# NOTES:
# - The model used in your original script: "cerspense/zeroscope_v2_576w"
# - If the repo is private or gated, make sure you are authenticated on Hugging Face.
# - Use torch_dtype=torch.float16 on CUDA for memory savings; fallback to float32 on CPU.

model_id = "cerspense/zeroscope_v2_576w"  # ‚û§ model used in original code
torch_dtype = torch.float16 if device.type == "cuda" else torch.float32

try:
    pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch_dtype, use_auth_token=os.environ["HUGGINGFACE_TOKEN"] )  # << REQUIRED
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    # Memory and speed helpers (only enable where supported)
    try:
        if device.type == "cuda":
            pipe.enable_model_cpu_offload()
            pipe.enable_vae_slicing()
            # Chunking can reduce memory at cost of compute; keep as configured
            pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
    except Exception as e:
        print("Optional memory helpers not fully enabled:", e)
    pipe = pipe.to(device)
    print(f"‚úÖ Pipeline loaded: {model_id} on {device}")  # ‚û§
except Exception as e:
    raise RuntimeError(f"Failed to load pipeline {model_id}. Error: {e}")


Keyword arguments {'use_auth_token': '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'} are not expected by TextToVideoSDPipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

An error occurred while trying to fetch /root/.cache/huggingface/hub/models--cerspense--zeroscope_v2_576w/snapshots/6963642a64dbefa93663d1ecebb4ceda2d9ecb28/vae: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--cerspense--zeroscope_v2_576w/snapshots/6963642a64dbefa93663d1ecebb4ceda2d9ecb28/vae.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /root/.cache/huggingface/hub/models--cerspense--zeroscope_v2_576w/snapshots/6963642a64dbefa93663d1ecebb4ceda2d9ecb28/unet: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--cerspense--zeroscope_v2_576w/snapshots/6963642a64dbefa93663d1ecebb4ceda2d9ecb28/unet.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
The TextToVideoSDPipeline has been deprecated and will not receive bug fixes or feature updates after 

‚úÖ Pipeline loaded: cerspense/zeroscope_v2_576w on cuda


In [None]:
# 7 ‚Äî Prompt, generation settings, and frame generation
# -------------------------
# Generation settings
# -------------------------
prompt = "A cat play with ball"         # ‚û§ change this prompt as you wish
num_inference_steps = 40               # ‚û§ diffusion steps per frame
height = 320                           # ‚û§ frame height (pixels)
width = 576                            # ‚û§ frame width (pixels)
num_frames = 36                        # ‚û§ how many frames to generate
fps = 10                               # ‚û§ frames per second for final video

# Generate frames using pipeline's video mode (if supported)
# Some pipelines support `num_frames` argument to produce coherent sequences.
# If not supported, you'd need to sample multiple prompts/seed variations.
print("Generating frames ‚Äî this may take several minutes...")  # ‚û§
video_out = pipe(prompt,
                 num_inference_steps=num_inference_steps,
                 height=height,
                 width=width,
                 num_frames=num_frames).frames  # ‚û§ frames tensor/array

# video_out shape check and conversion
print("Raw frames shape:", getattr(video_out, "shape", "unknown"))  # ‚û§


Generating frames ‚Äî this may take several minutes...


  0%|          | 0/40 [00:00<?, ?it/s]

Raw frames shape: (1, 36, 320, 576, 3)


In [None]:
# 8 ‚Äî Convert & save frames to disk
# The pipeline produced frames with batch dimension; convert to uint8 and save temporarily.
# Original code assumed frames[0] exists. We'll handle both cases robustly.

import pathlib
out_dir = pathlib.Path("output_frames")
out_dir.mkdir(exist_ok=True)

# Convert to uint8 properly, handle different array shapes
frames_np = np.array(video_out)  # ensure numpy array
# If shape is (B, T, H, W, C) or (T, H, W, C) or (B, H, W, C)
if frames_np.ndim == 5:
    # assume (batch, frames, H, W, C) -> drop batch
    frames_np = frames_np[0]
elif frames_np.ndim == 4 and frames_np.shape[0] == num_frames:
    # (frames, H, W, C) good
    pass
elif frames_np.ndim == 4 and frames_np.shape[0] != num_frames:
    # maybe (batch, H, W, C) single frame per batch -> expand?
    frames_np = frames_np
else:
    # fallback: try to squeeze
    frames_np = np.squeeze(frames_np)

# Ensure range 0..255 and uint8
# Some outputs might already be 0..1 float ‚Äî multiply by 255 if max <=1.0
if frames_np.dtype == np.float32 or frames_np.dtype == np.float64:
    if frames_np.max() <= 1.0:
        frames_uint8 = (frames_np * 255).astype(np.uint8)
    else:
        frames_uint8 = frames_np.astype(np.uint8)
else:
    frames_uint8 = frames_np.astype(np.uint8)

# Save frames as PNGs (optional) ‚Äî good for debugging
for idx, frame in enumerate(frames_uint8):
    imageio.imsave(str(out_dir / f"frame_{idx:03d}.png"), frame)
print(f"Saved {len(frames_uint8)} frames to {out_dir}")  # ‚û§


Saved 36 frames to output_frames


In [None]:
# 9 ‚Äî Stitch frames into video
# -------------------------
# Export frames to a video file (MP4)
# -------------------------
output_video_path = "output_video.mp4"
with imageio.get_writer(output_video_path, fps=fps, codec='libx264') as writer:
    for frame in frames_uint8:
        # Some frames may be (H, W, C) or (C, H, W); ensure correct shape
        if frame.shape[0] == 3 and frame.ndim == 3:
            # It's CHW ‚Äî convert to HWC
            frame_to_write = np.transpose(frame, (1, 2, 0))
        else:
            frame_to_write = frame
        writer.append_data(frame_to_write)

print("Saved video:", output_video_path)  # ‚û§


Saved video: output_video.mp4


In [None]:
# 10 ‚Äî Quick playback in notebook
# Display the generated video inline (works in Jupyter/Colab)
from IPython.display import HTML
from base64 import b64encode

mp4 = open(output_video_path, "rb").read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML(f'<video width=640 controls><source src="{data_url}" type="video/mp4"></video>')


<h2 style="text-align:center;">Summary & Limitations</h2>

- ‚úÖ This notebook demonstrates how to produce multiple frames from a text prompt and export them to a video file.  
- ‚ö†Ô∏è **Limitations**:
  - Frame-to-frame coherence is model- & pipeline-dependent. Generating temporally consistent motion typically requires specialized video or temporal models.
  - Memory & compute: generating many frames at high resolution needs decent GPU (16GB+ recommended).
  - If the pipeline does not natively support `num_frames`, repeated independent sampling will produce flicker/incoherence.
- üîÅ **Run checklist**:
  1. Ensure GPU runtime enabled.  
  2. Install required packages (Cell 4).  
  3. Adjust prompt/num_frames/height/width per your hardware.
