A self-hosted web application for transcribing audio files with automatic speaker identification. Built with WhisperX and Gradio, optimized for CPU-only environments.
- Accurate Transcription: Uses WhisperX (based on faster-whisper) for high-quality speech-to-text
- Speaker Diarization: Automatically identifies and labels different speakers
- Long Audio Support: Handles files up to 3+ hours
- Web Interface: Easy-to-use Gradio UI for uploading and downloading
- CPU Optimized: Runs efficiently on CPU with int8 quantization
- Plain Text Output: Clean, readable transcripts with timestamps
- OS: Linux (Ubuntu 20.04+ recommended)
- RAM: 16GB minimum, 32GB+ recommended for large files
- CPU: Multi-core processor (more cores = faster processing)
- Storage: At least 10GB free (for models and temporary files)
FFmpeg (for audio processing):
sudo apt update
sudo apt install ffmpegSpeaker diarization requires access to pyannote models:
- Create a HuggingFace account
- Go to pyannote/speaker-diarization-3.1
- Accept the license agreement
- Go to pyannote/segmentation-3.0
- Accept the license agreement
- Create an access token at HuggingFace Settings > Tokens
- Save your token - you'll need it when using the app
You have two options for providing your HuggingFace token:
Option A: Using a .env file (Recommended)
Create a .env file in the project directory:
cp .env.example .env
nano .env # or use your preferred editorAdd your token:
HF_TOKEN=hf_your_token_here
The application will automatically load this token on startup. The UI will show a green checkmark indicating the token is loaded.
Option B: Paste in the UI
If you don't want to store the token in a file, you can paste it directly into the web interface each time you use it.
Note: If both are provided, the UI input takes precedence over the
.envfile.
The setup script will install uv (if needed) and all dependencies:
# Clone or download the project
mkdir whisper-transcriber
cd whisper-transcriber
# (copy all project files here)
# Run automated setup
chmod +x setup.sh
./setup.sh# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc # or restart your shell
# Create project directory
mkdir whisper-transcriber
cd whisper-transcriber
# (copy all project files here)
# Install all dependencies (creates venv automatically)
uv sync --python 3.11uv run python -c "import whisperx; import gradio; print('Installation successful!')"uv run python app.pyThe server will start at http://0.0.0.0:7860. Access it via:
- Local:
http://localhost:7860 - Network:
http://<your-server-ip>:7860
- Upload Audio: Click the upload area or drag a
.wavfile - Select Model: Choose based on your accuracy/speed needs:
tiny: Fastest, least accuratebase: Fast, basic accuracysmall: Good balancemedium: Recommended for most useslarge-v3: Best accuracy, slowest on CPU
- HuggingFace Token: If you configured
.env, you'll see a green checkmark. Otherwise, paste your token here. - Set Speaker Limits (optional): If you know the number of speakers
- Adjust CPU Threads: Match your available cores (16-32 typical)
- Click Transcribe: Wait for processing to complete
- Download: Get the transcript as a
.txtfile
The transcript includes:
- File metadata (filename, model used, timestamp)
- Speaker-labeled segments with timestamps
Example:
Transcription of: meeting_recording.wav
Model: medium
Speaker diarization: Yes
Generated: 2024-01-15 14:30:00
============================================================
[00:00:05] SPEAKER_00: Welcome everyone to today's meeting. Let's start with the quarterly review.
[00:00:15] SPEAKER_01: Thanks for having us. I've prepared some slides on the sales figures.
[00:01:02] SPEAKER_00: Great, please go ahead and share your screen.
Processing times on CPU (approximate, varies by hardware):
| Audio Length | Model Size | Est. Time (16 cores) |
|---|---|---|
| 30 min | small | 10-15 min |
| 30 min | medium | 15-25 min |
| 1 hour | small | 20-30 min |
| 1 hour | medium | 30-50 min |
| 3 hours | small | 60-90 min |
| 3 hours | medium | 90-150 min |
| 3 hours | large-v3 | 3-6 hours |
Tips for faster processing:
- Use more CPU threads (up to your core count)
- Use
smallmodel for drafts,mediumorlarge-v3for final transcripts - Ensure adequate RAM (processing loads full audio into memory)
To run the transcription server as a systemd service:
sudo nano /etc/systemd/system/whisper-transcriber.serviceAdd:
[Unit]
Description=WhisperX Transcription Service
After=network.target
[Service]
Type=simple
User=your-username
WorkingDirectory=/path/to/whisper-transcriber
ExecStart=/path/to/whisper-transcriber/.venv/bin/python app.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl daemon-reload
sudo systemctl enable whisper-transcriber
sudo systemctl start whisper-transcriberFor multi-hour transcriptions, use screen or tmux to prevent interruption:
# Start a screen session
screen -S transcriber
# Run the app
uv run python app.py
# Detach with Ctrl+A, D
# Reattach later with: screen -r transcriber- Ensure you're using
uv run python - Reinstall dependencies:
uv sync
- Verify HuggingFace token is correct
- Ensure you've accepted licenses for both pyannote models
- Check token has read permissions
- Use a smaller model (
smallinstead ofmedium) - Reduce batch size in the code
- Ensure no other memory-intensive processes are running
- Increase thread count in the UI
- Use a smaller model
- Check CPU usage - ensure all cores are being utilized
- Install FFmpeg:
sudo apt install ffmpeg - Verify:
ffmpeg -version
- Run:
curl -LsSf https://astral.sh/uv/install.sh | sh - Then:
source ~/.bashrcor restart your shell
whisper-transcriber/
├── app.py # Main application
├── pyproject.toml # Project config and dependencies
├── uv.lock # Locked dependency versions (generated)
├── setup.sh # Automated setup script
├── .env.example # Template for environment variables
├── .env # Your local config (create from .env.example)
└── .venv/ # Virtual environment (created during setup)
This project uses uv for dependency management:
- 10-100x faster package installation
- Automatic Python version management (downloads Python 3.11 if needed)
- Lockfile support for reproducible builds (
uv.lock) - Single command setup with
uv sync
This project is licensed under the GNU General Public License v3.0.
Dependencies used:
- WhisperX - BSD-4-Clause
- Gradio - Apache-2.0
- pyannote-audio - MIT (models require license acceptance)
- OpenAI for the original Whisper model
- Max Bain for WhisperX
- pyannote team for speaker diarization
- Gradio team for the web framework
- Astral for uv