把几小时的中文播客 → 剪好的成片 + 小红书金句
End-to-end pipeline for turning long-form Chinese video podcasts into polished, social-ready cuts. Local-first · works offline · no per-minute fees.
Drop a 4-hour video in. Get out:
- 🎙️ A speaker-labeled transcript (
SPEAKER_00,SPEAKER_01, …) - ✂️ A trimmed final video (the boring bits cut, the good bits kept)
- 🪙 A bullet list of 金句 ready to drop into a 小红书 / 公众号 draft
It's a one-command local web app. The UI is a Mario-themed pixel-art editor with a dual-lane timeline for speakers, a 🍄 mushroom playhead, and bouncing ¥ coins above your highlighted clips. Keyboard-driven editing (X to cut, H for 金句).
video.mp4
│
▼ [1] transcribe.py (WhisperX + pyannote)
transcript.json ← speaker-labeled segments with timestamps
│
▼ [2] editor (browser) — AI suggests, you refine
selections.json ← each segment tagged keep / cut / highlight
│
├──▶ [3a] cut.py → final.mp4 / final.mov (trimmed video)
└──▶ [3b] extract.py → social.md (金句 + full kept transcript, for 小红书 drafting)
- WhisperX + pyannote — automatic transcription with speaker diarization
- AI suggestions — heuristic scoring marks fillers as
cutand quotables ashighlight, on a per-speaker weight you control - Preview mode — play the video as if cuts were already applied (skips
cut-tagged segments live) - Lossless concat —
ffmpegre-encodes once at clean cut boundaries, no glitches - MP4 or MOV export — pick your container in the UI
- Mario timeline — chunky pixel art shows everyone's segments by speaker, gold coins flag highlights, a mushroom is the playhead. Just for fun.
- HF mirror support —
HF_ENDPOINT=https://hf-mirror.comis the default for China users - ModelScope fallback — when HuggingFace is unreachable, you can pre-download the Whisper model from ModelScope (Aliyun CDN, ~40 MB/s from China) and PodCut will pick up the local cache
whisperx and pyannote.audio both depend on torchcodec, which is built against ffmpeg 4–7.
Installing the latest ffmpeg (8.x) silently breaks audio loading. PodCut handles this for you:
| Platform | What you get | Why it works |
|---|---|---|
macOS via setup.sh |
brew install ffmpeg@7 (keg-only, coexists with anything) |
Pinned to 7.1.x; start.sh puts it first on PATH for our scripts only |
Docker (Dockerfile) |
python:3.11-slim-bookworm + apt install ffmpeg (= 5.1.x) |
Bookworm is locked, build aborts if a future image bump lands ffmpeg ≥ 8 |
| Manual install | You're on your own — make sure ffmpeg --version reports 4, 5, 6, or 7 |
torchcodec ABI compat |
Python deps are pinned in requirements.txt (torch 2.8.0 / torchcodec 0.7.0 / whisperx 3.8.5 / pyannote.audio 4.0.4). Re-running pip install -r requirements.txt in a fresh python==3.11 venv reproduces a known-good environment.
git clone https://github.com/jinyang0530/podcut.git
cd podcut
bash scripts/setup.shsetup.sh installs Homebrew (if missing), python@3.11, ffmpeg@7, creates a venv, installs the pinned requirements.txt, and prompts for a free HuggingFace token. ~10 min, ~4 GB disk.
Required HuggingFace gated-repo terms — visit each link and click "Agree and access":
- https://huggingface.co/pyannote/speaker-diarization-community-1
- https://huggingface.co/pyannote/segmentation-3.0
Then create a read token at https://huggingface.co/settings/tokens.
git clone https://github.com/jinyang0530/podcut.git
cd podcut
cp .env.example .env # then edit .env and paste your HF_TOKEN
docker compose upBrowser opens to http://localhost:8787. Drop videos into ./videos/ (mounted as /data inside the container).
⚠️ Native picker (osascript) won't work in Docker. Use the "📂 选择视频文件" button in the editor — the server will pick from files in/datainstead.
bash scripts/start.sh /path/to/video.mp4
# or, with no argument, pick the video in the browser:
bash scripts/start.shThis starts a local HTTP server on 127.0.0.1:8787, opens the editor in your default browser, and (if you passed a video) auto-loads it.
In the editor:
- ▶ 开始转录 — click if there's no transcript yet. Pick speaker count, language, and model size. Runs in the background; progress and live log shown in a modal.
- ✨ AI suggestions — auto-runs after the transcript loads. The
🤖 AI 建议toggle in the top bar turns it off (manual mode) or back on. Click⚙for fine-tuning (per-speaker weight, target compression ratio, strip-fillers). - Refine — keyboard shortcuts:
X— cut ·H— 金句 ·Z— clear tagsSpace— play/pause ·J/L— seek ±5s ·↑/↓— prev / next segmentP— toggle 原片 / ✂ 成片 preview
- 💾 导出最终视频 — runs
cut.pyon the server, downloads the final MP4 or MOV when done. - Run
python scripts/extract.py <selections.json>to dump the 金句 list (social.md) — then ask Claude (or anyone) to rewrite it into a 小红书 post.
Everything is local. No third-party APIs are called during editing or cutting.
start.sh <video>
└─▶ serve.py (Python stdlib HTTP server on 127.0.0.1:8787)
├─ serves editor/index.html (single-file Tailwind + Alpine.js app)
├─ serves the video file (Range-aware, for scrubbing)
├─ /api/jobs → spawns transcribe.py / cut.py as subprocesses
├─ /api/suggest → scores segments and returns cut/highlight suggestions
├─ /api/pick-video → opens the macOS native file picker via osascript
└─ /api/download/<token> → streams the final MP4
Network use happens only on first run (downloading whisper + pyannote model weights, ~3 GB total).
transcript.json (output of transcribe.py)
{
"video_path": "/abs/path.mp4",
"duration": 14400.0,
"language": "zh",
"num_speakers": 4,
"segments": [
{ "id": 0, "start": 0.0, "end": 3.52, "speaker": "SPEAKER_00", "text": "..." }
]
}selections.json (output of editor → input to cut.py / extract.py)
{
"video_path": "/abs/path.mp4",
"speaker_names": { "SPEAKER_00": "主持人", "SPEAKER_01": "嘉宾" },
"segments": [
{ "id": 0, "start": 0.0, "end": 3.52, "speaker": "SPEAKER_00",
"text": "...", "tags": ["highlight"] }
]
}Tags: cut (drop) and highlight (金句). Anything else is kept by default.
HuggingFace download stalls / Read timed out from cas-bridge.xethub.hf.co — Common in China. Set HF_ENDPOINT=https://hf-mirror.com (already default). For the big Whisper model.bin file the mirror still redirects to xet, so as a last resort run:
python -c "from modelscope import snapshot_download; \
snapshot_download('pengzhendong/faster-whisper-medium', \
cache_dir='~/.cache/modelscope/')"then symlink/copy the files into ~/.cache/huggingface/hub/models--Systran--faster-whisper-medium/snapshots/manual/. PodCut auto-detects that path.
Diarization fails but Whisper succeeded — transcribe.py saves a <video>.whisper-cache.json after Whisper completes. Run python scripts/diarize_only.py <video> --num-speakers N to retry just the speaker step without redoing the 20+ minutes of Whisper work.
Editor shows "独立模式" — you opened the HTML directly via file:// instead of through the server. Use bash scripts/start.sh for the full experience.
Port 8787 in use — serve.py falls back to the next free port automatically (8787–8799).
podcut/
├── scripts/
│ ├── setup.sh # one-time install (brew + python@3.11 + venv + deps)
│ ├── start.sh # one-command launcher
│ ├── serve.py # local HTTP server (stdlib only)
│ ├── transcribe.py # video → transcript.json
│ ├── diarize_only.py # rerun speaker diarization from a Whisper cache
│ ├── cut.py # selections.json + video → final.mp4 / .mov
│ └── extract.py # selections.json → social.md
├── editor/
│ ├── index.html # the single-file editor (Tailwind + Alpine.js)
│ ├── logo-mark.svg # the icon (Mario ? block + scissors)
│ └── logo.html # logo explorations
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
MIT — see LICENSE.
- whisperx — alignment + diarization wrapper around Whisper
- pyannote.audio — speaker diarization
- faster-whisper — CTranslate2-based Whisper inference
- ModelScope — China-friendly model mirror
- 🍄 Mario assets are tributes only; not affiliated with Nintendo.