Transcribe multi-speaker audio and video recordings with speaker diarization. Built for tabletop RPG sessions but works for any multi-speaker recording (meetings, interviews, podcasts).
The pipeline uses WhisperX for transcription with word-level alignment, and pyannote for speaker diarization.
Session writeups produced from transcripts generated by this pipeline:
- Shadowmaze — Session 47 (February 9, 2026)
- Shadowmaze — Session 48 (February 10, 2026)
Extracts the native audio stream from a video file without transcoding. Uses ffprobe to detect the audio codec and maps it to the correct container format.
extract-audio session.mkv
# -> session.opus (or .m4a, .mp3, etc. depending on source codec)Supported codec mappings: AAC, MP3, Opus, Vorbis, FLAC, PCM, ALAC, AC3, EAC3, WMA. Unknown codecs fall back to using the codec name as extension.
Transcribes an audio or video file with speaker diarization. Automatically calls extract-audio if given a video file.
transcribe session.mkv # video input (extracts audio first)
transcribe session.opus # audio input (transcribes directly)
# -> session.txtBoth scripts are idempotent: if the output file already exists, they print its path and exit.
Parses a saved Roll20 chat log HTML page into timestamped, structured text. Extracts player names, character names, ability/weapon names, and roll results.
# Parse entire campaign log
parse-roll20-log "Chat Log for My Campaign.html" > full-log.txt
# Filter to a single session by date
parse-roll20-log "Chat Log for My Campaign.html" --session 2026-02-10 > session.txtTo save the HTML: open your Roll20 game, click the chat archive button (speech bubble icon in the chat tab), then save the page (Ctrl+S) as a complete HTML file.
No external Python dependencies required — uses only the standard library. Output format:
[February 10, 2026 9:06PM] nikki: Irulan: longsword (+6): 13 20 10
[February 10, 2026 9:14PM] Matthew: Bancroft Barleychaser: Light (+3): 7 7
[February 10, 2026 9:55PM] Matthew: Bancroft Barleychaser: Strength (3): 22 8
sudo apt install ffmpegffmpeg and ffprobe are used for audio codec detection and stream extraction.
WhisperX runs in an isolated venv. Create it and install:
python3 -m venv ~/.local/share/whisperx
~/.local/share/whisperx/bin/pip install whisperxThe scripts expect whisperx at ~/.local/share/whisperx/bin/whisperx. Adjust the path in bin/transcribe if you install it elsewhere.
Speaker diarization requires accepting the pyannote model licenses and providing an access token:
- Create an account at huggingface.co
- Accept the license for pyannote/speaker-diarization-3.1
- Accept the license for pyannote/segmentation-3.0
- Create an access token at huggingface.co/settings/tokens
- Export it in your shell:
export HF_TOKEN="hf_your_token_here"PyTorch 2.7 changed torch.load to default to weights_only=True, which breaks pyannote's model loading. Until upstream fixes this, patch lightning_fabric:
VENV=~/.local/share/whisperx
CLOUD_IO="$VENV/lib/python3.*/site-packages/lightning_fabric/utilities/cloud_io.py"
# Add weights_only=False default for local file loads
sed -i '/^ fs = get_filesystem/i\ if weights_only is None:\n weights_only = False' $CLOUD_IOThis sets weights_only=False for local checkpoint files only. The security implications are minimal since pyannote models are downloaded from Hugging Face's authenticated model hub.
The fastest path. Install PyTorch with CUDA support and change --device cpu --compute_type int8 to --device cuda --compute_type float16 in bin/transcribe.
~/.local/share/whisperx/bin/pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128PyTorch works with ROCm, but WhisperX's transcription engine (faster-whisper / ctranslate2) is CUDA-only. The transcription step runs on CPU; the diarization and alignment steps can use the GPU.
~/.local/share/whisperx/bin/pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4Leave --device cpu in the transcribe script. The pyannote diarization pipeline will detect and use the ROCm GPU automatically.
Works out of the box. A modern CPU (Zen 5, recent Intel) handles a 2-hour recording in roughly 25-50 minutes with --compute_type int8.
git clone https://github.com/matthewjhunter/ai-session-notes.git
cp ai-session-notes/bin/* ~/bin/
chmod +x ~/bin/extract-audio ~/bin/transcribeEnsure ~/bin is in your PATH.
Tested on a 114-minute D&D session recording (AMD Ryzen AI MAX+ 395, Zen 5, 16 cores, ROCm 6.2 for diarization):
| Metric | Value |
|---|---|
Wall clock (real) |
~71 minutes |
CPU time (user) |
~166 minutes |
System time (sys) |
~76 minutes |
The high CPU time reflects multi-threaded transcription across 16 cores. The high system time comes from ROCm GPU interaction during diarization and alignment.
Rough phase breakdown:
| Phase | Time |
|---|---|
| Audio extraction | < 5 seconds |
| Transcription (CPU, int8, large-v3) | ~23 minutes |
| Alignment + Diarization (ROCm GPU) | ~45 minutes |
NVIDIA GPU users can expect significantly faster times, since both transcription (faster-whisper / ctranslate2) and diarization can run natively on CUDA.
MIT