# üé¨ Text-to-Video Fine-Tuning + Upload + Space Demo

This notebook installs libraries, loads data, loads a pretrained text-to-video pipeline, shows a **lightweight training scaffold**, then uploads your model **privately** to Hugging Face and provides a Space demo script.

In [None]:
!pip -q install diffusers transformers accelerate safetensors
!pip -q install imageio[ffmpeg] huggingface_hub datasets gradio

In [None]:
import os, json, math, random
import torch, torchvision
from huggingface_hub import login, HfApi, Repository
from datasets import load_dataset
from diffusers import DiffusionPipeline
import imageio
print("Torch CUDA available:", torch.cuda.is_available())

## üîë Login to Hugging Face

In [None]:
# Paste your Write token from https://huggingface.co/settings/tokens
login(input("Enter your Hugging Face token (Write): ").strip())

## üì¶ Load Dataset
Upload a small zip of short videos (2‚Äì5s), or use a public dataset to test.

In [None]:
USE_UPLOAD = True  # Set False to use a public dataset example

if USE_UPLOAD:
    from google.colab import files
    uploaded = files.upload()  # e.g., dataset.zip
    zip_name = list(uploaded.keys())[0]
    !mkdir -p dataset
    !unzip -o "$zip_name" -d dataset/
    data_root = "dataset"
else:
    # Example public dataset placeholder
    ds = load_dataset("damo-vilab/kinetics-mini")
    data_root = None  # Using hf dataset in memory
print("Data ready:", data_root or "HF dataset")

In [None]:
import glob
from pathlib import Path
from PIL import Image
import numpy as np

def extract_frames_from_videos(src_dir, dst_dir, fps=8, max_frames=16):
    os.makedirs(dst_dir, exist_ok=True)
    video_files = glob.glob(os.path.join(src_dir, "**/*.mp4"), recursive=True)
    if not video_files:
        print("No .mp4 files found in your dataset folder. Place short .mp4 clips in subfolders.")
        return []

    extracted = []
    for vid in video_files:
        # Use imageio to read frames
        reader = imageio.get_reader(vid)
        frames = []
        try:
            for i, frame in enumerate(reader):
                if len(frames) >= max_frames:
                    break
                frames.append(frame)
        except Exception as e:
            print("Error reading", vid, e)
        finally:
            reader.close()

        if not frames:
            continue

        # Save frames as PNGs
        rel = os.path.relpath(vid, src_dir)
        stem = Path(rel).with_suffix("").as_posix().replace("/", "_")
        out_dir = os.path.join(dst_dir, stem)
        os.makedirs(out_dir, exist_ok=True)
        for idx, fr in enumerate(frames):
            Image.fromarray(fr).save(os.path.join(out_dir, f"{idx:03d}.png"))
        extracted.append(out_dir)
    return extracted

if data_root:
    extracted_dirs = extract_frames_from_videos(data_root, "frame_dataset")
    print("Extracted clip dirs:", extracted_dirs[:3], "... total:", len(extracted_dirs))
else:
    extracted_dirs = []  # Using HF dataset path (not extracted here)

## üß† Load Pretrained Text-to-Video Pipeline

In [None]:
model_id = "damo-vilab/text-to-video-ms-1.7b"  # you can change
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, variant="fp16")
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipe.to(device)

## üõ†Ô∏è Lightweight Training Scaffold (LoRA-style placeholder)
This is a **scaffold** to show where your training logic would go. True fine-tuning of large text-to-video models is compute-heavy and implementation-specific.

We simulate a quick step so you can validate saving & pushing a model. Replace this with your real training when ready.

In [None]:
from torch.optim import AdamW

# Example: we'll pretend to update UNet parameters slightly
params = [p for p in pipe.unet.parameters() if p.requires_grad]
optimizer = AdamW(params, lr=1e-6)

pipe.unet.train()
fake_loss = torch.tensor(0.0, requires_grad=True, device=device)
(fake_loss + 0.0).backward()  # no-op backward to validate graph
optimizer.step()
optimizer.zero_grad()
pipe.unet.eval()

print("Scaffold training step completed (placeholder). Replace with real loop for true fine-tuning).")

## ‚ñ∂Ô∏è Test Generation

In [None]:
prompt = "A cute white cat dancing in space, cinematic, 4k"
out = pipe(prompt, num_frames=8)
frames = out.frames
os.makedirs("samples", exist_ok=True)
for i, fr in enumerate(frames):
    imageio.imwrite(f"samples/frame_{i:02d}.png", fr)
imageio.mimsave("samples/sample.mp4", frames, fps=8)
"Generated samples at samples/sample.mp4"

## üíæ Save Fine-Tuned (or scaffold) Model

In [None]:
SAVE_DIR = "my_text2video_model"
pipe.save_pretrained(SAVE_DIR)
print("Saved to", SAVE_DIR)

## ‚òÅÔ∏è Push to Hugging Face (Private)

In [None]:
api = HfApi()
repo_name = "my-text2video-model"  # change if you like
repo_url = api.create_repo(name=repo_name, private=True, exist_ok=True)
repo = Repository(local_dir=repo_name, clone_from=repo_url)
!cp -r my_text2video_model/* "{repo_name}"/
# Minimal README for the model card
with open(f"{repo_name}/README.md", "w") as f:
    f.write("# My private text-to-video model\n\nThis model was prepared via Colab. Replace this text with usage instructions and sample outputs.")
!cd "{repo_name}" && git add . && git commit -m "Upload model" && git push
print("Pushed to:", repo_url)

## üß™ Space Demo (Gradio)
Copy `space_app/app.py` from this repo to your new Space.
Set your model id in the UI or hardcode it in `app.py`. To monetize, enable **Paid Space** in Space Settings.