Chronicle Forge

A local-AI powered YouTube story video generation pipeline. Give it a topic, get a fully narrated, image-synced MP4 — script, voiceover, visuals, and YouTube metadata included.

What It Does

Chronicle Forge runs a 6-step pipeline fully automatically:

Step	What happens
1 — Research	Fetches Wikipedia context for the topic
2 — Script	Pass 1: Writes the narration story (with delivery annotations stripped to punctuation) Pass 2: Splits into scenes + writes image prompts Pass 3: Expands scenes into visual sub-moments (action → reaction → emotion)
3 — Metadata	Generates YouTube title, description, and tags
4 — TTS	Narrates the script with edge-tts or Kokoro, capturing word-level timestamps
5 — Images	Generates one SDXL image per scene (16:9, Ken Burns ready)
6 — Video	Composites timestamp-synced Ken Burns slideshow at 1920×1080

Everything is checkpointed. Ctrl+C at any time, --resume to continue exactly where you left off.

System Requirements

Operating System

Path	OS
AMD GPU (DirectML)	Windows 10/11 only — `torch-directml` is Windows-exclusive
NVIDIA GPU (CUDA)	Windows or Linux
CPU-only (no images)	Any OS

Hardware

Minimum

Component	Minimum	Notes
CPU	4-core, 2.5 GHz	Used for TTS, video composition, VAE decode
RAM	16 GB	SDXL loads ~4 GB into RAM; pipeline buffers images
GPU VRAM	8 GB	Enough for SDXL at 1216×832 with split-device (UNet on GPU, VAE on CPU)
Disk	20 GB free	SDXL model ~6 GB + generated images + output video

AMD GPUs (DirectML, Windows)

Any DirectML-capable AMD GPU works. Tested range:

Class	Example	Notes
Entry (8 GB VRAM)	RX 6600, RX 6650 XT	Works; ~3–4 min/image at 20 steps
Mid (12 GB VRAM)	RX 6700 XT, RX 7700 XT	~2 min/image
High (16 GB VRAM)	RX 6800 XT, RX 7900 GRE	~90 s/image

Integrated AMD graphics (Radeon 680M, etc.) will work but are very slow — expect 10–20 min/image.

NVIDIA GPUs (CUDA)

Class	Example	Notes
Entry (8 GB VRAM)	RTX 3070, RTX 4060	Works; ~45–90 s/image at 20 steps
Mid (12 GB VRAM)	RTX 3080, RTX 4070	~30–50 s/image
High (24 GB VRAM)	RTX 3090, RTX 4090	~15–25 s/image

External Software

These must be installed separately before running Chronicle Forge.

Ollama — required (LLM backend)

Downloads: ollama.com

Ollama runs the language model that writes the narration, scene prompts, and YouTube metadata.

# After installing Ollama, pull the model:
ollama pull nemotron-3-super:cloud

# Verify it's running:
ollama serve

Ollama must be running (ollama serve) before starting the pipeline
Default model: nemotron-3-super:cloud — a cloud-inference thinking model
Any Ollama-compatible model works; set OLLAMA_MODEL in .env to change it
The cloud model has a thinking phase (10–30 s of silence before tokens appear) — this is normal

SDXL Model — required (image generation)

Chronicle Forge needs an SDXL base model in diffusers format (a folder, not a single .safetensors file).

Option A — InvokeAI (recommended): InvokeAI downloads and converts SDXL models automatically and stores them in diffusers format. Point MODEL_PATH in .env at the model folder inside InvokeAI's model directory.

Option B — manual conversion: Download an SDXL .safetensors from CivitAI or Hugging Face and convert:

python -c "
from diffusers import StableDiffusionXLPipeline
pipe = StableDiffusionXLPipeline.from_single_file('your-model.safetensors')
pipe.save_pretrained('output-folder/')
"

Then set MODEL_PATH=output-folder/ in .env.

Recommended base models: stable-diffusion-xl-base-1.0, dreamshaperXL, realvisxlV50.

FFmpeg — required (video composition)

MoviePy (used for the final video step) requires FFmpeg.

Windows: Download from ffmpeg.org and add to PATH, or install via winget install ffmpeg
Linux: sudo apt install ffmpeg

Verify: ffmpeg -version

Python 3.11 — required

python --version   # must say Python 3.11.x

torch-directml ships pre-built wheels for Python 3.11 on Windows only. Python 3.12+ is not supported for the AMD path.

Requirements

Python packages

pip install -r requirements.txt

Optional — Kokoro TTS (local, offline)

pip install kokoro soundfile

Optional — Web UI

pip install flask

Setup

1. Clone and install

git clone <repo>
cd chronicle-forge
pip install -r requirements.txt

2. Create .env (copy from .env.example)

OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=nemotron-3-super:cloud

# Path to your SDXL model in diffusers format
MODEL_PATH=Z:\Programs\Invoke\models\your-model-id

# SDXL inference settings
INVOKE_STEPS=20
INVOKE_CFG=7.0

# Default edge-tts voice (can override with --voice)
TTS_VOICE=en-US-AndrewMultilingualNeural

3. Start Ollama

ollama serve
ollama pull nemotron-3-super:cloud

GPU Setup

AMD GPU — DirectML

torch-directml lets AMD GPUs run SDXL on Windows without ROCm. The pipeline uses a specific split-device architecture to work around DirectML's limitations:

Component	Device	Why
UNet	DirectML (GPU)	Heavy diffusion compute — this is where the GPU matters
VAE encoder/decoder	CPU (fp32)	DirectML's fp16 VAE produces corrupted output; CPU avoids it
Text encoders (CLIP / OpenCLIP)	CPU	Small models; keeping on CPU avoids device-mismatch crashes
Latents / scheduler	CPU	DirectML can't create random tensors; all book-keeping on CPU

The UNet receives tensors from CPU, moves them to DML internally, and returns outputs back to CPU — a transparent bridge so the rest of the diffusers pipeline doesn't need to know about the device split.

Four patches are applied at load time (_patch_dml_unet in image_gen.py):

CPU→DML→CPU bridge on unet.forward — all inputs moved to DML on entry, all outputs returned to CPU on exit
time_proj on CPU — DirectML cannot compute int64 timestep embeddings; this layer runs on CPU then the result is moved to DML
time_embedding to DML — receives the CPU float from time_proj, explicitly moved to DML before proceeding
add_embedding to DML (SDXL-specific) — the extra conditioning embeddings SDXL appends are moved to DML before the add_embedding layer

Version requirements are strict:

torch-directml          # latest (AMD, Windows only)
diffusers>=0.27.0,<0.31.0   # newer versions change internal call signatures that break the patches
transformers>=4.40.0

If you upgrade diffusers past 0.30.x and images stop generating, pin back to diffusers==0.30.3.

Install:

pip install torch-directml "diffusers>=0.27.0,<0.31.0" transformers accelerate safetensors

NVIDIA GPU — CUDA

Replace the DirectML stack with standard PyTorch CUDA. No bridging patches are needed — CUDA handles full-pipeline GPU tensors without device splits.

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install "diffusers>=0.27.0" transformers accelerate safetensors

Then in image_gen.py, change the device:

# Line ~199: replace
dml = torch_directml.device()
pipe.unet.to(dml)
# with:
pipe.to("cuda")

And remove the _patch_dml_unet call and _decode_on_cpu (use output_type="pil" instead of "latent").

Web UI

A browser-based interface with live logging, image preview, and step progress.

Install Flask:

pip install flask
# or
pip install -r requirements-web.txt

Launch:

python web_ui.py           # opens http://localhost:5000 automatically
python web_ui.py --port 8080
python web_ui.py --no-browser   # skip auto-open

Features:

Settings form — topic, duration slider, TTS engine (edge / Kokoro), voice, image style, resume checkbox
Live log stream — colour-coded output (step headers, progress, warnings, errors) via Server-Sent Events
Image preview — polls every 2.5 seconds and shows the latest generated scene image
Step progress bar — 6 steps, animates while active, turns green when done
Elapsed timer — shown in the header while the pipeline runs
Stop button — terminates the pipeline process
Download link — appears when the final MP4 is ready
Reconnect-safe: refreshing the browser replays the full log buffer

Basic

# Fully interactive (asks topic, duration, engine, voice)
python main.py

# Supply everything on the command line
python main.py --topic "the invention of dynamite" --duration 300

# 10-minute video with British narrator
python main.py --topic "the Chernobyl disaster" --duration 600 --voice ryan

TTS engine

# Microsoft Azure Neural (default, requires internet, highest quality)
python main.py --tts edge --voice christopher

# Kokoro — local, fully offline, CPU-native (~200 MB model)
python main.py --tts kokoro --voice george

Image style

# High-quality 2D illustration with cel shading (default)
python main.py --style refined

# Simple stick figures, Zenn-channel flat aesthetic
python main.py --style stickman

Resume / checkpoint

# Resume from the furthest completed step
python main.py --resume

# Resume with verbose image logging
python main.py --resume --logging

All flags

Flag	Values	Default	Description
`--topic`	any string	(prompted)	Story topic
`--duration`	seconds	(prompted)	Target video length
`--tts`	`edge`, `kokoro`	(prompted)	TTS engine
`--voice`	see tables below	(prompted)	Narrator voice alias
`--style`	`refined`, `stickman`	`refined`	Image style
`--resume`	flag	off	Resume from checkpoint
`--logging`	flag	off	Verbose image generation output

Voices

edge-tts (online)

Alias	Description
`andrew`	US male, natural — default
`guy`	US male, confident narrator
`christopher`	US male, deep & authoritative
`eric`	US male, clear and engaging
`aria`	US female, warm and expressive
`jenny`	US female, natural and clear
`ryan`	British male, classic storyteller
`sonia`	British female, elegant
`william`	Australian male, casual authority

Kokoro (offline)

Alias	Description
`heart`	US female, warm — default
`bella`	US female, bright
`sarah`	US female, natural
`nova`	US female, expressive
`adam`	US male
`michael`	US male, clear
`george`	British male, classic
`lewis`	British male, strong
`emma`	British female, warm

Output

All files land in output/:

output/
├── video_data.json     ← full pipeline state (topic, narration, scenes, metadata)
├── narration.mp3       ← TTS audio (edge-tts)
├── narration.wav       ← TTS audio (Kokoro)
├── final_video.mp4     ← finished video (1920×1080)
└── images/
    ├── scene_001.png
    ├── scene_002.png
    └── ...

The video_data.json checkpoint means --resume can pick up from any failed or interrupted step without regenerating prior work.

Sample Topics

These topic types reliably produce strong narration and visually rich scenes.

Historical Inventions (dramatic origin stories)

the invention of dynamite
the accidental discovery of penicillin
how Post-it Notes were invented by accident
the creation of Velcro
the invention of the printing press and what it destroyed
how aspirin was discovered

Rise & Fall (corporate tragedy)

how Kodak invented digital photography and then killed itself
the rise and fall of Blockbuster Video
how Theranos fooled Silicon Valley
the collapse of Enron
how Nokia lost the smartphone war it started

Disasters & Survival

the Chernobyl disaster
the last days of Pompeii
how the Titanic actually sank
Ernest Shackleton's impossible survival in Antarctica
the 1918 Spanish flu pandemic
the Texas City Disaster of 1947

Scientific Secrets

the Stanford Prison Experiment
Operation Paperclip — how NASA hired Nazi scientists
MK-Ultra and the CIA's mind control program
the real story of Nikola Tesla vs Thomas Edison
how the atomic bomb changed everything

Ancient World

how the Egyptian pyramids were actually built
the fall of the Roman Empire
the lost city of Pompeii
the real history of the Trojan War
how the Mongol Empire conquered the world

Space & Exploration

the Apollo 13 disaster and rescue
the race to the Moon (US vs Soviet Union)
the Voyager probes — humanity's furthest journey
the Challenger disaster and what NASA knew

Cold War & Espionage

the Cuban Missile Crisis — 13 days that nearly ended the world
the Berlin Wall — how it was built and how it fell
the double agent who fooled the KGB
Project Azorian — the CIA stole a Soviet submarine

Sample Prompts by Duration

5 minutes (300s — ~750 words)

Best for focused single-event stories:

"the accidental discovery of penicillin"
"how Blockbuster rejected Netflix"
"the Stanford Prison Experiment"
"the Texas City Disaster of 1947"

8 minutes (480s — ~1200 words)

Room for full story arcs with turning points:

"the invention of dynamite"
"the Chernobyl disaster"
"the Apollo 13 disaster and rescue"
"how Kodak destroyed itself"

10 minutes (600s — ~1500 words)

Full documentary feel with multiple acts:

"the rise and fall of Theranos"
"Ernest Shackleton's survival in Antarctica"
"the Cuban Missile Crisis"
"the real history of Nikola Tesla"

How the Narration Works

The script LLM (nemotron-3-super:cloud) writes emotionally-driven narration — not a documentary list of facts. It follows a three-act structure:

Cold open — drop into a vivid scene with no context ("He held a vial so unstable one wrong move would kill everyone in the building.")
Rising stakes — specific details, cause-and-effect, tension
Retention hook — a question or revelation around the halfway point that makes stopping feel impossible
Payoff — a closing image or consequence that lingers

Delivery is shaped through prose construction only:

Short sentences hit hard
Em-dashes (—) create mid-sentence pivots
Ellipses (...) trail into suspense
ALL CAPS on a single word for peak emphasis (rare)
Paragraph breaks are breath marks

Architecture

main.py              ← pipeline orchestration, CLI, checkpointing
research_gen.py      ← Wikipedia research fetch
script_gen.py        ← Pass 1/2/3: narration, scene cuts, visual expansion
meta_gen.py          ← YouTube title / description / tags
tts_gen.py           ← edge-tts and Kokoro audio generation
image_gen.py         ← SDXL image generation (DirectML / CUDA)
video_composer.py    ← Ken Burns video compositor (moviepy + OpenCV)

Troubleshooting

torch-directml install fails or import crashes This package only has Windows wheels for Python 3.11. Confirm your environment:

python --version          # must be 3.11.x
pip install torch-directml
python -c "import torch_directml; print(torch_directml.device())"

If you're on 3.12+, create a 3.11 venv and reinstall everything.

unbox expects Dml at::Tensor as inputs crash during image generation The CPU→DML bridge patch in _patch_dml_unet should prevent this. If it still fires, it usually means a diffusers version past 0.31.0 is installed — it changes internal call signatures that break the patches. Pin back:

pip install "diffusers==0.30.3"

RuntimeError during VAE decode / black or corrupted images DirectML fp16 VAE decode produces garbage output. The pipeline already decodes on CPU in fp32 via _decode_on_cpu. If you see black images, check that pipe.vae is not on the DML device — it should be on CPU.

Images are slow (>3 min per image) Normal for AMD+DirectML. Expect 90–180 seconds per image at 20 steps on a mid-range AMD card. Reduce steps:

INVOKE_STEPS=15

Ollama times out on long prompts The cloud thinking model is slow. The pipeline uses 180s per batch call and 600s for the narration. If it times out, partial output is saved. Run --resume to continue.

Images look photorealistic instead of illustrated This is expected with --style refined — it produces quality 2D illustration. For stick figures use --style stickman. If refined images still look too photorealistic, increase INVOKE_CFG to 8.0 or 9.0.

Kokoro not found

pip install kokoro soundfile

edge-tts fails / no audio Requires an internet connection. The Microsoft Azure TTS endpoint is free but rate-limited. Try again in a few seconds, or switch to --tts kokoro for offline use.

Video audio out of sync If you switch TTS engines between runs, delete the old audio file and re-run TTS:

del output\narration.mp3
del output\narration.wav
python main.py --resume

The pipeline uses TTS word-boundary timestamps for image sync — mismatched audio from a different engine/run will desync.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
.env		.env
.env.example		.env.example
.gitattributes		.gitattributes
README.md		README.md
arch.md		arch.md
image_gen.py		image_gen.py
main.py		main.py
meta_gen.py		meta_gen.py
prompt_gen.py		prompt_gen.py
requirements-web.txt		requirements-web.txt
requirements.txt		requirements.txt
research_gen.py		research_gen.py
script_gen.py		script_gen.py
tts_gen.py		tts_gen.py
video_composer.py		video_composer.py
web_ui.py		web_ui.py

Component	Recommended
CPU	8-core, 3.5 GHz+
RAM	32 GB
GPU VRAM	12 GB+
Disk	50 GB free (multiple runs + model cache)

Folders and files

Latest commit

History

Repository files navigation

Chronicle Forge

What It Does

System Requirements

Operating System

Hardware

Minimum

Recommended

AMD GPUs (DirectML, Windows)

NVIDIA GPUs (CUDA)

External Software

Ollama — required (LLM backend)

SDXL Model — required (image generation)

FFmpeg — required (video composition)

Python 3.11 — required

Requirements

Python packages

Optional — Kokoro TTS (local, offline)

Optional — Web UI

Setup

GPU Setup

AMD GPU — DirectML

NVIDIA GPU — CUDA

Web UI

Basic

TTS engine

Image style

Resume / checkpoint

All flags

Voices

edge-tts (online)

Kokoro (offline)

Output

Sample Topics

Historical Inventions (dramatic origin stories)

Rise & Fall (corporate tragedy)

Disasters & Survival

Scientific Secrets

Ancient World

Space & Exploration

Cold War & Espionage

Sample Prompts by Duration

5 minutes (300s — ~750 words)

8 minutes (480s — ~1200 words)

10 minutes (600s — ~1500 words)

How the Narration Works

Architecture

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages