Conference Talk Analyzer

Turn a YouTube conference talk into a structured, concept-driven Markdown report with verbatim slide text, statistics, notable quotes, and a parsed Q&A section.

The pipeline downloads a YouTube segment, detects slide changes, transcribes the audio with Whisper, classifies each slide with Claude Haiku (vision), consolidates camera-cut duplicates and animation noise, then uses Claude Sonnet (vision) to synthesize a per-segment concept that integrates the slide image with the aligned transcript. Originally built for SaaStr livecast talks, but works on any single-speaker conference recording with on-screen slides.

One ~60-minute talk → ~30-40 concept segments, ~$0.35 in API costs, ~15 minutes wall-clock on first run (most subsequent runs are seconds; everything is cached).

What you get

For one ~60-minute talk you'll end up with:

A Markdown report of ~30-40 segments, each one a discrete concept the speaker communicated. Each segment includes the slide image, verbatim slide text, a 2-4 sentence thesis, key stats, notable quotes, and the raw transcript collapsed inside a <details> block.
concepts.json — the canonical structured output. Every segment is a typed object you can pipe into other tools (search index, blog generator, embeddings, dashboards).
All slide images cropped and rendered to PNG.
The full Whisper transcript (with timestamps).
Classification cache so re-runs are cheap and fast.

Features

Vision-grounded synthesis. Each segment's concept is generated from the slide image and the aligned transcript, not just the transcript text — so charts, diagrams, and on-slide bullets all show up in the output.
Verbatim slide capture. Substantive on-slide text (titles, bullets, principles, headers) is captured exactly as written, preserving hierarchy with Markdown. Demos and photo-style slides return an empty string instead of being hallucinated.
Camera-cut deduplication. Conference livecasts alternate between slide, picture-in-picture, and speaker-only shots — every cut fires the phash slide detector. A Claude Haiku same-slide check merges consecutive detections that show the same underlying content, even when the pixel composition is wildly different.
Software demo grouping. A run of UI screenshots from a live demo becomes one segment with multiple anchor frames (a "strip") rather than 6-8 spurious entries each describing a single frame.
Speaker / company grounding. Every synthesis call is anchored to {speaker_name} from {company_name}, with explicit instructions against industry-pattern-match attribution hallucinations (e.g. confusing the speaker's restaurant-tech company with a more famous one).
Q&A extraction from transcript cues. When the speaker says "any questions?" / "minutes for questions" / "walk up to the mics" mid-segment, the script force-splits at that timestamp even when the slide doesn't change, then parses each distinct question + answer into a structured array.
Speaker-only frames preserve speech. Cuts to the speaker (no slide visible) attach their transcript to the prior content segment, so no speech is lost.
Transitions and applause-frames dropped. Title cards, blanks, brand graphics, and end-of-talk applause sections are filtered out instead of being synthesized as if they were content slides.
Apple Silicon transcription speedup. Auto-detects and uses mlx-whisper if installed (~3-4× faster than CPU openai-whisper on the same audio). Falls back to CPU on Intel/AMD.
End-to-end caching. Every artifact (downloaded video, extracted audio, Whisper transcript, slide detection, slide classifications, synthesized concepts) is cached on disk. Re-runs only redo what changed.

Prerequisites

Python 3.10+

ffmpeg for audio extraction:

# macOS
brew install ffmpeg
# Debian/Ubuntu
sudo apt-get install ffmpeg

An Anthropic API key (sk-ant-…) with access to Claude Sonnet and Haiku. Get one at https://console.anthropic.com.
A YouTube URL of a single-speaker conference talk with on-screen slides, and the timestamps where the talk starts/ends within the video.
Optional: Apple Silicon Mac to use mlx-whisper for GPU-accelerated transcription (highly recommended if you have one).

Installation

git clone <your-fork-of-this-repo>
cd conference-talk-analyzer

# 1. Python deps
pip install -r requirements.txt

# 2. Optional: GPU-accelerated transcription on Apple Silicon
pip install mlx-whisper

# 3. API key — copy the template, then fill in your real key
cp .env.example .env
# edit .env so the line reads: ANTHROPIC_API_KEY=sk-ant-your-real-key

# 4. Verify ffmpeg is on PATH
ffmpeg -version

The first run will also download the Whisper model weights (medium ≈ 1.5GB, cached in ~/.cache/whisper/ or ~/.cache/huggingface/hub/ depending on which backend you use). One-time cost.

Quick start

python3 video_theme_analyzer.py \
  'https://www.youtube.com/watch?v=Rx78-T4Jeek' \
  --start 01:08:28 --end 02:05:00 \
  --name adam_owner \
  --speaker-name "Adam Guild" \
  --company-name "Owner"

Quote the URL (zsh treats ? as a glob). On a fresh run with no caches this takes ~15 minutes and ~$0.35 in API costs:

~1-2 min: yt-dlp download + ffmpeg audio extract
~30s: slide-change detection
~3-5 min: Whisper transcription (mlx-whisper on M-series) or ~10-12 min (CPU)
~1 min: slide PNG render
~2-3 min: Haiku classification (~100 calls)
~1-2 min: Haiku same-slide checks during consolidation
~5-10 min: Sonnet content synthesis + 1 Sonnet Q&A parse

Output lands in ./output/adam_owner/.

Usage

Required flags

Flag	Required?	What it is
Positional `url`	yes	The YouTube URL of the source video. Quote it to escape `?`.
`--start HH:MM:SS`	yes	Where the talk begins within the source video.
`--end HH:MM:SS`	no	Where the talk ends. Omit to run to end-of-video.
`--name <id>`	recommended	Folder + output-file prefix (e.g. `adam_owner`). Defaults to `session`.
`--speaker-name "First Last"`	yes (unless `--skip-synthesis`)	Grounds synthesis to the right person.
`--company-name "Company"`	yes (unless `--skip-synthesis`)	Grounds synthesis to the right company. Prevents brand misattribution.

All flags

positional:
  url                       YouTube URL (quote it to escape `?`)

required:
  --start HH:MM:SS          Segment start time
  --speaker-name NAME       Speaker's full name (e.g. "Adam Guild")
  --company-name NAME       Speaker's company (e.g. "Owner")

optional:
  --end HH:MM:SS            Segment end (default: end of video)
  --name ID                 Session id, used for the output dir + filenames
  --output DIR              Where to write outputs (default: ./output)

  --whisper-model SIZE      tiny | base | small | medium (default) | large
                            Tradeoff: tiny is fast but error-prone; medium is
                            the sweet spot; large is most accurate but slow.
  --language CODE           Force Whisper language (ISO code, default "en").
                            Pass an empty string to enable auto-detect.

slide detection / consolidation:
  --slide-crop L,T,R,B      Crop region of the slide within the frame as
                            percentages (default 0.32,0.21,0.96,0.90 — tuned
                            for SaaStr livecast PiP-on-left layout).
  --threshold N             Perceptual-hash distance threshold for slide-
                            change detection (default 6). Lower = more
                            sensitive.
  --sample-fps N            Frames sampled per second of video for slide
                            detection (default 2.0). Higher = catches briefer
                            slides.
  --min-slide-duration SEC  Collapse adjacent detections within N seconds
                            (default 3.0). Main knob for filtering animation
                            noise without losing real slide changes. Changing
                            this re-filters the cache without re-detecting.

cache control:
  --reclassify              Re-run per-slide classification (Haiku) even if
                            classifications.json exists.
  --force-synthesis         Re-run synthesis (Sonnet) even if concepts.json
                            exists. (Alias: --resynthesize.)
  --skip-synthesis          Skip synthesis entirely; render markdown from
                            placeholders. Cheap dry-run, no Claude API calls
                            (classification still requires a key unless
                            classifications.json is already cached).

To force any earlier step (download / audio / transcript / slides), just delete the corresponding cache file in the session's output dir.

More examples

Dry-run the pipeline without spending API credits (PNGs render, no synthesis):

python3 video_theme_analyzer.py 'https://www.youtube.com/...' \
  --start 00:05:00 --end 01:15:00 --name some_talk \
  --skip-synthesis

Re-synthesize after tweaking the speaker prompt (skips download / transcription / classification — only the Sonnet synthesis re-runs):

python3 video_theme_analyzer.py 'https://www.youtube.com/...' \
  --start 00:05:00 --end 01:15:00 --name some_talk \
  --speaker-name "Jane Doe" --company-name "Acme" \
  --force-synthesis

Detect more aggressively for a video with subtle slide changes:

python3 video_theme_analyzer.py 'https://www.youtube.com/...' \
  --start 00:00:00 --name some_talk \
  --threshold 4 --sample-fps 4.0 \
  --speaker-name "Jane Doe" --company-name "Acme"

Different slide crop for a video where the slide is full-screen, not PiP:

python3 video_theme_analyzer.py 'https://www.youtube.com/...' \
  --start 00:00:00 --name some_talk \
  --slide-crop "0.0,0.0,1.0,1.0" \
  --speaker-name "Jane Doe" --company-name "Acme"

How it works

┌──────────────┐    ┌────────────┐    ┌─────────────────┐    ┌────────────┐
│ yt-dlp       │──▶│ ffmpeg     │──▶│ Whisper          │   │ phash      │
│ download     │   │ audio mp3  │   │ transcribe       │   │ slide-     │
│              │   │            │   │ (segments)       │   │ change     │
└──────────────┘   └────────────┘   └─────────────────┘    │ detect     │
                                                            └─────┬──────┘
                                                                  ▼
                                                            ┌────────────┐
                                                            │ min-       │
                                                            │ duration   │
                                                            │ filter     │
                                                            └─────┬──────┘
                                                                  ▼
                                                            ┌────────────┐
                                                            │ overlap-   │
                                                            │ based      │
                                                            │ alignment  │
                                                            │ + 1.5s     │
                                                            │ offset     │
                                                            └─────┬──────┘
                                                                  ▼
                                                            ┌────────────┐
                                                            │ render     │
                                                            │ slide PNGs │
                                                            └─────┬──────┘
                                                                  ▼
                                              ┌──────────────────────────────┐
                                              │ Haiku classify each slide:   │
                                              │ content_slide / software_demo│
                                              │ / speaker_only / transition  │
                                              │ / qa_session                 │
                                              └──────────────┬───────────────┘
                                                             ▼
                                              ┌──────────────────────────────┐
                                              │ Consolidate:                 │
                                              │ - drop transitions           │
                                              │ - attach speaker_only's      │
                                              │   transcript to prior        │
                                              │   content segment            │
                                              │ - merge same-slide content   │
                                              │   slides (Haiku same-slide   │
                                              │   check)                     │
                                              │ - group consecutive software │
                                              │   _demo frames               │
                                              │ - temporal-guard early qa    │
                                              │   false positives            │
                                              └──────────────┬───────────────┘
                                                             ▼
                                              ┌──────────────────────────────┐
                                              │ Cue-based Q&A split:         │
                                              │ scan transcript for "any     │
                                              │ questions?" etc. — force a   │
                                              │ split mid-segment if found.  │
                                              └──────────────┬───────────────┘
                                                             ▼
                                              ┌──────────────────────────────┐
                                              │ Sonnet synthesize:           │
                                              │ - content_slide / sw_demo:   │
                                              │   ContentAnalysis            │
                                              │ - qa_session:                │
                                              │   list[QAQuestion]           │
                                              │ Both calls grounded in       │
                                              │ {speaker, company}.          │
                                              └──────────────┬───────────────┘
                                                             ▼
                                              ┌──────────────────────────────┐
                                              │ Render markdown from         │
                                              │ concepts.json                │
                                              └──────────────────────────────┘

Why each step exists

Slide-change detection with perceptual hashing is fast and free. It's intentionally over-sensitive — better to catch every camera cut and animation build than miss a real slide change.
Min-duration filter collapses raw detections within ~3s of each other, killing animation-build noise while preserving real slide changes.
Classification is the expensive-looking step that pays for itself. Without it, every speaker-only camera cut becomes a "slide" the synthesis call wastes API spend on.
Same-slide check handles the SaaStr livecast issue where the same slide is shown from three camera framings; without it, one slide becomes 3-6 separate segments with redundant theses.
Software-demo grouping turns the 6-frame "AI crawl demo" run into one segment with 6 anchor images — Sonnet sees the progression, not 6 individual frames each described separately.
Cue-based Q&A split fixes the case where the speaker invites questions but the slide doesn't change — no visual cue fires, so without the transcript-cue scan, 15 minutes of Q&A would get glued onto the last content slide.
Speaker grounding in synthesis prompts prevents the model from pattern-matching to industry-famous brands (e.g. "Toast" when the speaker is actually from "Owner") or confusing the speaker's company with vendors they reference ("Anthropic's finance team" when they meant "the speaker's finance team using Claude").

Output

Files produced

For a session named adam_owner, you'll get this tree:

output/
  adam_owner/
    video.mp4                 # raw download from yt-dlp
    audio.mp3                 # extracted audio for Whisper
    transcript.json           # Whisper segments with timestamps
    slides.json               # raw + filtered slide-change boundaries
    classifications.json      # per-slide Haiku classifications
    concepts.json             # canonical synthesized output
    adam_owner.md             # rendered markdown report
    adam_owner_slides/
      slide_001.png           # cropped slide images, one per raw detection
      slide_002.png
      ...

Sample report

A typical content-slide segment looks like this:

## Segment 1 — Owner Before: Strong Metrics, Hidden Risk
*1:08:28 to 1:09:26 · content_slide*

![Slide](adam_owner_slides/slide_002.png)

**On the slide:**
> # Owner before
>
> Building Shopify for restaurants.
> Great metrics - exceeding triple triple double double double.
> Great funding - $89M raised.
> But...

**Thesis.** Three years ago, Owner appeared to be thriving — building a
Shopify for restaurants with strong growth metrics and $89M raised. Despite
hitting the coveted triple-triple-double-double-double SaaS growth trajectory,
an existential threat was quietly emerging beneath the surface starting in
early 2023.

**Visual.** Dark background, 'Owner before' title with 'before' highlighted
in a purple box. Four bullets on the left, an accelerating bar chart on the
right reinforcing the growth.

**Key stats:**
- $89M — Capital raised (source: slide)
- triple triple double double double — Growth trajectory exceeded (source: slide)

**Notable quotes:**
> "We were exceeding the triple, triple, double, double, double trajectory
> that the SaaS lore says eventually leads to building a unicorn." — [1:09:03]

<details>
<summary>Raw transcript</summary>

Let me take you back to three years ago, because the past three years has
been a huge rollercoaster at Owner. ...

</details>

A Q&A segment renders like this:

## Segment 35 — Q&A
*1:45:47 to 1:52:01 · qa_session*

### Q1: How many attempts did you have before you came up with a winning idea?
*Asked by: PMP, founder of Emotions AI*

The company went through many failures, including a website builder focused
on dine-in reservations that was destroyed by the pandemic, before pivoting to
online ordering and eventually finding product-market fit.

**Quotes:**
> "The first three years of the company's life in 2018 to 2021, we were
> building something that got completely destroyed by the pandemic." — [1:46:30]

### Q2: How do you think about focus when AI lets you build so many things?
...

concepts.json schema

The file is wrapped with session metadata so consumers can verify the speaker and company context:

{
  "speaker_name": "Adam Guild",
  "company_name": "Owner",
  "concepts": [
    {
      "segment_type": "content_slide",       // content_slide | software_demo | qa_session
      "segment_index": 1,
      "concept_title": "Owner Before: Strong Metrics, Hidden Risk",
      "thesis": "...",                       // 2-4 sentences max
      "slide_text_verbatim": "...",          // markdown-formatted on-slide text (empty if no text)
      "stats": [
        {"value": "$89M", "label": "Capital raised", "source": "slide"}
      ],
      "key_quotes": [
        {"text": "...", "timestamp": "1:09:03"}
      ],
      "visual_description": "...",
      "anchor_slides": ["slide_002.png"],    // one for content_slide, multiple for software_demo
      "transcript_window": {                 // absolute video time, in seconds
        "start": 4108.0,
        "end": 4166.0
      },
      "raw_transcript": "...",
      "questions": null,                     // null for content/demo, array for qa_session
      "source_slide_indices": [2]            // raw detection indices that map into this segment
    },
    // qa_session entries have questions: [...] instead of null
    {
      "segment_type": "qa_session",
      "segment_index": 35,
      "concept_title": "Q&A",
      // ... most fields empty for Q&A ...
      "questions": [
        {
          "question_index": 1,
          "question_text": "...",
          "asker_description": "PMP, founder of Emotions AI",  // null if not stated
          "answer_summary": "...",                              // 2-3 sentences
          "answer_quotes": [{"text": "...", "timestamp": "1:46:30"}],
          "transcript_window": {"start": 6360.0, "end": 6580.0}
        }
      ]
    }
  ]
}

Tuning

The defaults are tuned for the SaaStr Annual livecast layout (PiP on left, slide on right). For other videos:

Knob	Default	When to tune
`--slide-crop L,T,R,B`	`0.32,0.21,0.96,0.90`	Other layouts. Use `0.0,0.0,1.0,1.0` for full-screen slides. Open any rendered PNG to verify the crop is right.
`--threshold`	`6`	If you're missing real slide changes, lower it (4 or 5). If animation noise survives the duration filter, raise it (8-10).
`--sample-fps`	`2.0`	Raise to catch briefly-shown slides. Lower (1.0) for a faster-but-noisier sweep.
`--min-slide-duration`	`3.0`	The big consolidation knob. Talks with rapid build animations: try `1.5`. Talks where the camera lingers: try `5.0`. Changing this refilters the cache — no re-detection.
`--whisper-model`	`medium`	`medium` is the sweet spot. `large` is more accurate but slow. `base` is fine for low-effort drafts.
`--language`	`en`	Forces transcription language to skip auto-detect (which can hallucinate Turkish on silent intros). Pass empty string for auto-detect.

If the cue-based Q&A split misses your speaker's wording, edit the QA_TRIGGER_PHRASES constant in video_theme_analyzer.py. It's a list of case-insensitive substrings; add the phrase the speaker actually used.

Costs

Approximate per-talk cost for a 60-minute talk on the defaults:

Step	Model	Calls	Cost
Classification	Claude Haiku 4.5	~100 (one per detected slide)	~$0.02
Same-slide checks	Claude Haiku 4.5	~20 (adjacent content pairs)	~$0.005
Content synthesis	Claude Sonnet 4.6	~30 (one per consolidated segment, with vision)	~$0.30
Q&A synthesis	Claude Sonnet 4.6	1 (text-only, parses the whole Q&A transcript)	~$0.02
Total			~$0.35

Sonnet's system prompt is cached across calls so subsequent calls within the same session pay ~10% of the full input price for cached tokens. The script prints an exact cost breakdown at the end of each run.

The first time you run the script it'll also download the Whisper model (~1.5GB for medium) — a one-time cost, cached locally.

Caching and re-runs

Every step writes its output to disk. To force a step, delete the cache file:

Delete	Forces
`video.mp4`	full re-download
`audio.mp3`	re-extract audio
`slides.json`	re-detect slide changes (also clears stale PNGs and classifications)
`transcript.json`	re-transcribe
`classifications.json`	re-classify (alternative: `--reclassify`)
`concepts.json`	re-synthesize (alternative: `--force-synthesis`)

Speaker/company change triggers auto-resynth. If you re-run with different values for --speaker-name or --company-name, the script detects the metadata mismatch on concepts.json and re-synthesizes automatically — no need to pass --force-synthesis.

--min-slide-duration doesn't force re-detection. It's applied as a post- processing filter on top of the cached raw detections, so you can sweep through values quickly. Changing it does invalidate slide PNGs and classifications (since slide indices may now point at different timestamps).

Troubleshooting

zsh: no matches found: https://www.youtube.com/watch?v=... Quote the URL. ? is a glob in zsh.

python3 video_theme_analyzer.py 'https://www.youtube.com/watch?v=...' ...

ANTHROPIC_API_KEY environment variable is not set Either copy .env.example → .env and put your key there, or export ANTHROPIC_API_KEY=sk-ant-... in your shell.

Synthesis requires --speaker-name and --company-name flags. Pass both. Example:

--speaker-name "Adam Guild" --company-name "Owner"

Whisper transcribed half the video as "Thank you. Thank you. Thank you..." That's Whisper hallucinating during silent applause / dead air at the end of the talk. Either tighten --end HH:MM:SS to cut the dead air, or accept the artifact — the pipeline correctly ignores it (those slides are classified as qa_session or transition and either dropped or absorbed into the Q&A segment, which gets discarded if no real questions are extracted).

Whisper mid-talk crash, or transcription detects the wrong language. The language auto-detection runs on the first 30 seconds of audio. If the talk opens with music or applause, auto-detect can drift. --language en (the default) forces English and prevents this.

Too many segments / animation noise survives. Raise --min-slide-duration (try 5.0). This re-filters the cached detections without re-detecting — fast.

Too few segments / real slides being missed. Lower --threshold (try 4). This requires re-detection (delete slides.json).

The crop is cutting off part of the slide / showing the speaker. Open any slide_NNN.png to see what's being cropped. Adjust --slide-crop L,T,R,B (percentages of frame width/height). Default 0.32,0.21,0.96,0.90 is left=32%, top=21%, right=96%, bottom=90% — tuned for the SaaStr PiP layout. Use 0.0,0.0,1.0,1.0 for full-screen slide.

The script ran two synthesis passes and burned $0.60 — what gives? Most likely the metadata-mismatch auto-resynth fired because you changed --speaker-name / --company-name between runs. To avoid this on intentional re-runs, keep the values consistent.

Q&A wasn't split correctly. The cue list in QA_TRIGGER_PHRASES covers common phrasings ("any questions?", "minutes for questions", "walk up to the mics", etc.). If your speaker said something different ("let's open it up", "shoot me your questions"), add the substring to that list and re-run with --force-synthesis.

Limitations

Single-speaker only. Q&A asker_description works when the asker identifies themselves verbally, but speaker diarization (telling who's talking) is not built in. Multi-speaker panels won't render correctly.
English-language defaults. Whisper handles many languages; the cue list and synthesis prompts assume English. Translating them is a small change but currently a manual one.
Layout assumptions. The default crop is tuned for the SaaStr Annual livecast. Other producers (Web Summit, AWS re:Invent, etc.) use different framings — you'll need to adjust --slide-crop.
No OCR fallback. Slide text capture relies on Claude Sonnet's vision — works well in practice, but very small text or unusual layouts may slip through. Tesseract / RapidOCR integration would be a natural next step.
YouTube only. Download is via yt-dlp. Adapting to other video sources (Vimeo, direct mp4 URLs) means changing the download_video function — a few lines.

License

MIT — see LICENSE.

Acknowledgements

Built on top of yt-dlp, OpenAI Whisper / mlx-whisper, OpenCV, imagehash, and the Anthropic Python SDK.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
video_theme_analyzer.py		video_theme_analyzer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conference Talk Analyzer

Table of Contents

What you get

Features

Prerequisites

Installation

Quick start

Usage

Required flags

All flags

More examples

How it works

Why each step exists

Output

Files produced

Sample report

concepts.json schema

Tuning

Costs

Caching and re-runs

Troubleshooting

Limitations

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Conference Talk Analyzer

Table of Contents

What you get

Features

Prerequisites

Installation

Quick start

Usage

Required flags

All flags

More examples

How it works

Why each step exists

Output

Files produced

Sample report

concepts.json schema

Tuning

Costs

Caching and re-runs

Troubleshooting

Limitations

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages