A local-AI powered YouTube story video generation pipeline. Give it a topic, get a fully narrated, image-synced MP4 — script, voiceover, visuals, and YouTube metadata included.
Chronicle Forge runs a 6-step pipeline fully automatically:
| Step | What happens |
|---|---|
| 1 — Research | Fetches Wikipedia context for the topic |
| 2 — Script | Pass 1: Writes the narration story (with delivery annotations stripped to punctuation) Pass 2: Splits into scenes + writes image prompts Pass 3: Expands scenes into visual sub-moments (action → reaction → emotion) |
| 3 — Metadata | Generates YouTube title, description, and tags |
| 4 — TTS | Narrates the script with edge-tts or Kokoro, capturing word-level timestamps |
| 5 — Images | Generates one SDXL image per scene (16:9, Ken Burns ready) |
| 6 — Video | Composites timestamp-synced Ken Burns slideshow at 1920×1080 |
Everything is checkpointed. Ctrl+C at any time, --resume to continue exactly where you left off.
| Path | OS |
|---|---|
| AMD GPU (DirectML) | Windows 10/11 only — torch-directml is Windows-exclusive |
| NVIDIA GPU (CUDA) | Windows or Linux |
| CPU-only (no images) | Any OS |
| Component | Minimum | Notes |
|---|---|---|
| CPU | 4-core, 2.5 GHz | Used for TTS, video composition, VAE decode |
| RAM | 16 GB | SDXL loads ~4 GB into RAM; pipeline buffers images |
| GPU VRAM | 8 GB | Enough for SDXL at 1216×832 with split-device (UNet on GPU, VAE on CPU) |
| Disk | 20 GB free | SDXL model ~6 GB + generated images + output video |
| Component | Recommended |
|---|---|
| CPU | 8-core, 3.5 GHz+ |
| RAM | 32 GB |
| GPU VRAM | 12 GB+ |
| Disk | 50 GB free (multiple runs + model cache) |
Any DirectML-capable AMD GPU works. Tested range:
| Class | Example | Notes |
|---|---|---|
| Entry (8 GB VRAM) | RX 6600, RX 6650 XT | Works; ~3–4 min/image at 20 steps |
| Mid (12 GB VRAM) | RX 6700 XT, RX 7700 XT | ~2 min/image |
| High (16 GB VRAM) | RX 6800 XT, RX 7900 GRE | ~90 s/image |
Integrated AMD graphics (Radeon 680M, etc.) will work but are very slow — expect 10–20 min/image.
| Class | Example | Notes |
|---|---|---|
| Entry (8 GB VRAM) | RTX 3070, RTX 4060 | Works; ~45–90 s/image at 20 steps |
| Mid (12 GB VRAM) | RTX 3080, RTX 4070 | ~30–50 s/image |
| High (24 GB VRAM) | RTX 3090, RTX 4090 | ~15–25 s/image |
These must be installed separately before running Chronicle Forge.
Downloads: ollama.com
Ollama runs the language model that writes the narration, scene prompts, and YouTube metadata.
# After installing Ollama, pull the model:
ollama pull nemotron-3-super:cloud
# Verify it's running:
ollama serve- Ollama must be running (
ollama serve) before starting the pipeline - Default model:
nemotron-3-super:cloud— a cloud-inference thinking model - Any Ollama-compatible model works; set
OLLAMA_MODELin.envto change it - The cloud model has a thinking phase (10–30 s of silence before tokens appear) — this is normal
Chronicle Forge needs an SDXL base model in diffusers format (a folder, not a single .safetensors file).
Option A — InvokeAI (recommended): InvokeAI downloads and converts SDXL models automatically and stores them in diffusers format. Point MODEL_PATH in .env at the model folder inside InvokeAI's model directory.
Option B — manual conversion: Download an SDXL .safetensors from CivitAI or Hugging Face and convert:
python -c "
from diffusers import StableDiffusionXLPipeline
pipe = StableDiffusionXLPipeline.from_single_file('your-model.safetensors')
pipe.save_pretrained('output-folder/')
"Then set MODEL_PATH=output-folder/ in .env.
Recommended base models: stable-diffusion-xl-base-1.0, dreamshaperXL, realvisxlV50.
MoviePy (used for the final video step) requires FFmpeg.
- Windows: Download from ffmpeg.org and add to PATH, or install via
winget install ffmpeg - Linux:
sudo apt install ffmpeg
Verify: ffmpeg -version
python --version # must say Python 3.11.xtorch-directml ships pre-built wheels for Python 3.11 on Windows only. Python 3.12+ is not supported for the AMD path.
pip install -r requirements.txtpip install kokoro soundfilepip install flask1. Clone and install
git clone <repo>
cd chronicle-forge
pip install -r requirements.txt2. Create .env (copy from .env.example)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=nemotron-3-super:cloud
# Path to your SDXL model in diffusers format
MODEL_PATH=Z:\Programs\Invoke\models\your-model-id
# SDXL inference settings
INVOKE_STEPS=20
INVOKE_CFG=7.0
# Default edge-tts voice (can override with --voice)
TTS_VOICE=en-US-AndrewMultilingualNeural3. Start Ollama
ollama serve
ollama pull nemotron-3-super:cloudtorch-directml lets AMD GPUs run SDXL on Windows without ROCm. The pipeline uses a specific split-device architecture to work around DirectML's limitations:
| Component | Device | Why |
|---|---|---|
| UNet | DirectML (GPU) | Heavy diffusion compute — this is where the GPU matters |
| VAE encoder/decoder | CPU (fp32) | DirectML's fp16 VAE produces corrupted output; CPU avoids it |
| Text encoders (CLIP / OpenCLIP) | CPU | Small models; keeping on CPU avoids device-mismatch crashes |
| Latents / scheduler | CPU | DirectML can't create random tensors; all book-keeping on CPU |
The UNet receives tensors from CPU, moves them to DML internally, and returns outputs back to CPU — a transparent bridge so the rest of the diffusers pipeline doesn't need to know about the device split.
Four patches are applied at load time (_patch_dml_unet in image_gen.py):
- CPU→DML→CPU bridge on
unet.forward— all inputs moved to DML on entry, all outputs returned to CPU on exit time_projon CPU — DirectML cannot compute int64 timestep embeddings; this layer runs on CPU then the result is moved to DMLtime_embeddingto DML — receives the CPU float fromtime_proj, explicitly moved to DML before proceedingadd_embeddingto DML (SDXL-specific) — the extra conditioning embeddings SDXL appends are moved to DML before the add_embedding layer
Version requirements are strict:
torch-directml # latest (AMD, Windows only)
diffusers>=0.27.0,<0.31.0 # newer versions change internal call signatures that break the patches
transformers>=4.40.0
If you upgrade diffusers past 0.30.x and images stop generating, pin back to
diffusers==0.30.3.
Install:
pip install torch-directml "diffusers>=0.27.0,<0.31.0" transformers accelerate safetensorsReplace the DirectML stack with standard PyTorch CUDA. No bridging patches are needed — CUDA handles full-pipeline GPU tensors without device splits.
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install "diffusers>=0.27.0" transformers accelerate safetensorsThen in image_gen.py, change the device:
# Line ~199: replace
dml = torch_directml.device()
pipe.unet.to(dml)
# with:
pipe.to("cuda")And remove the _patch_dml_unet call and _decode_on_cpu (use output_type="pil" instead of "latent").
A browser-based interface with live logging, image preview, and step progress.
Install Flask:
pip install flask
# or
pip install -r requirements-web.txtLaunch:
python web_ui.py # opens http://localhost:5000 automatically
python web_ui.py --port 8080
python web_ui.py --no-browser # skip auto-openFeatures:
- Settings form — topic, duration slider, TTS engine (edge / Kokoro), voice, image style, resume checkbox
- Live log stream — colour-coded output (step headers, progress, warnings, errors) via Server-Sent Events
- Image preview — polls every 2.5 seconds and shows the latest generated scene image
- Step progress bar — 6 steps, animates while active, turns green when done
- Elapsed timer — shown in the header while the pipeline runs
- Stop button — terminates the pipeline process
- Download link — appears when the final MP4 is ready
- Reconnect-safe: refreshing the browser replays the full log buffer
# Fully interactive (asks topic, duration, engine, voice)
python main.py
# Supply everything on the command line
python main.py --topic "the invention of dynamite" --duration 300
# 10-minute video with British narrator
python main.py --topic "the Chernobyl disaster" --duration 600 --voice ryan# Microsoft Azure Neural (default, requires internet, highest quality)
python main.py --tts edge --voice christopher
# Kokoro — local, fully offline, CPU-native (~200 MB model)
python main.py --tts kokoro --voice george# High-quality 2D illustration with cel shading (default)
python main.py --style refined
# Simple stick figures, Zenn-channel flat aesthetic
python main.py --style stickman# Resume from the furthest completed step
python main.py --resume
# Resume with verbose image logging
python main.py --resume --logging| Flag | Values | Default | Description |
|---|---|---|---|
--topic |
any string | (prompted) | Story topic |
--duration |
seconds | (prompted) | Target video length |
--tts |
edge, kokoro |
(prompted) | TTS engine |
--voice |
see tables below | (prompted) | Narrator voice alias |
--style |
refined, stickman |
refined |
Image style |
--resume |
flag | off | Resume from checkpoint |
--logging |
flag | off | Verbose image generation output |
| Alias | Description |
|---|---|
andrew |
US male, natural — default |
guy |
US male, confident narrator |
christopher |
US male, deep & authoritative |
eric |
US male, clear and engaging |
aria |
US female, warm and expressive |
jenny |
US female, natural and clear |
ryan |
British male, classic storyteller |
sonia |
British female, elegant |
william |
Australian male, casual authority |
| Alias | Description |
|---|---|
heart |
US female, warm — default |
bella |
US female, bright |
sarah |
US female, natural |
nova |
US female, expressive |
adam |
US male |
michael |
US male, clear |
george |
British male, classic |
lewis |
British male, strong |
emma |
British female, warm |
All files land in output/:
output/
├── video_data.json ← full pipeline state (topic, narration, scenes, metadata)
├── narration.mp3 ← TTS audio (edge-tts)
├── narration.wav ← TTS audio (Kokoro)
├── final_video.mp4 ← finished video (1920×1080)
└── images/
├── scene_001.png
├── scene_002.png
└── ...
The video_data.json checkpoint means --resume can pick up from any failed or interrupted step without regenerating prior work.
These topic types reliably produce strong narration and visually rich scenes.
the invention of dynamite
the accidental discovery of penicillin
how Post-it Notes were invented by accident
the creation of Velcro
the invention of the printing press and what it destroyed
how aspirin was discovered
how Kodak invented digital photography and then killed itself
the rise and fall of Blockbuster Video
how Theranos fooled Silicon Valley
the collapse of Enron
how Nokia lost the smartphone war it started
the Chernobyl disaster
the last days of Pompeii
how the Titanic actually sank
Ernest Shackleton's impossible survival in Antarctica
the 1918 Spanish flu pandemic
the Texas City Disaster of 1947
the Stanford Prison Experiment
Operation Paperclip — how NASA hired Nazi scientists
MK-Ultra and the CIA's mind control program
the real story of Nikola Tesla vs Thomas Edison
how the atomic bomb changed everything
how the Egyptian pyramids were actually built
the fall of the Roman Empire
the lost city of Pompeii
the real history of the Trojan War
how the Mongol Empire conquered the world
the Apollo 13 disaster and rescue
the race to the Moon (US vs Soviet Union)
the Voyager probes — humanity's furthest journey
the Challenger disaster and what NASA knew
the Cuban Missile Crisis — 13 days that nearly ended the world
the Berlin Wall — how it was built and how it fell
the double agent who fooled the KGB
Project Azorian — the CIA stole a Soviet submarine
Best for focused single-event stories:
"the accidental discovery of penicillin""how Blockbuster rejected Netflix""the Stanford Prison Experiment""the Texas City Disaster of 1947"
Room for full story arcs with turning points:
"the invention of dynamite""the Chernobyl disaster""the Apollo 13 disaster and rescue""how Kodak destroyed itself"
Full documentary feel with multiple acts:
"the rise and fall of Theranos""Ernest Shackleton's survival in Antarctica""the Cuban Missile Crisis""the real history of Nikola Tesla"
The script LLM (nemotron-3-super:cloud) writes emotionally-driven narration — not a documentary list of facts. It follows a three-act structure:
- Cold open — drop into a vivid scene with no context ("He held a vial so unstable one wrong move would kill everyone in the building.")
- Rising stakes — specific details, cause-and-effect, tension
- Retention hook — a question or revelation around the halfway point that makes stopping feel impossible
- Payoff — a closing image or consequence that lingers
Delivery is shaped through prose construction only:
- Short sentences hit hard
- Em-dashes (—) create mid-sentence pivots
- Ellipses (...) trail into suspense
- ALL CAPS on a single word for peak emphasis (rare)
- Paragraph breaks are breath marks
main.py ← pipeline orchestration, CLI, checkpointing
research_gen.py ← Wikipedia research fetch
script_gen.py ← Pass 1/2/3: narration, scene cuts, visual expansion
meta_gen.py ← YouTube title / description / tags
tts_gen.py ← edge-tts and Kokoro audio generation
image_gen.py ← SDXL image generation (DirectML / CUDA)
video_composer.py ← Ken Burns video compositor (moviepy + OpenCV)
torch-directml install fails or import crashes
This package only has Windows wheels for Python 3.11. Confirm your environment:
python --version # must be 3.11.x
pip install torch-directml
python -c "import torch_directml; print(torch_directml.device())"If you're on 3.12+, create a 3.11 venv and reinstall everything.
unbox expects Dml at::Tensor as inputs crash during image generation
The CPU→DML bridge patch in _patch_dml_unet should prevent this. If it still fires, it usually means a diffusers version past 0.31.0 is installed — it changes internal call signatures that break the patches. Pin back:
pip install "diffusers==0.30.3"RuntimeError during VAE decode / black or corrupted images
DirectML fp16 VAE decode produces garbage output. The pipeline already decodes on CPU in fp32 via _decode_on_cpu. If you see black images, check that pipe.vae is not on the DML device — it should be on CPU.
Images are slow (>3 min per image) Normal for AMD+DirectML. Expect 90–180 seconds per image at 20 steps on a mid-range AMD card. Reduce steps:
INVOKE_STEPS=15Ollama times out on long prompts
The cloud thinking model is slow. The pipeline uses 180s per batch call and 600s for the narration. If it times out, partial output is saved. Run --resume to continue.
Images look photorealistic instead of illustrated
This is expected with --style refined — it produces quality 2D illustration. For stick figures use --style stickman. If refined images still look too photorealistic, increase INVOKE_CFG to 8.0 or 9.0.
Kokoro not found
pip install kokoro soundfileedge-tts fails / no audio
Requires an internet connection. The Microsoft Azure TTS endpoint is free but rate-limited. Try again in a few seconds, or switch to --tts kokoro for offline use.
Video audio out of sync If you switch TTS engines between runs, delete the old audio file and re-run TTS:
del output\narration.mp3
del output\narration.wav
python main.py --resumeThe pipeline uses TTS word-boundary timestamps for image sync — mismatched audio from a different engine/run will desync.