A document-to-audiobook pipeline with a web UI. Upload documents, clean text for TTS, generate audio with voice cloning, and listen with a synced transcript player that highlights words as they're spoken.
- Extract and clean text from .docx, .pdf, .srt, .txt, and .md files. Strips abbreviations, footnotes, URLs, and other content that trips up voice models.
- Generate audio using F5-TTS with zero-shot voice cloning from a short reference recording. Includes per-segment QA verification via Whisper transcription.
- Listen with synced transcript -- words highlight in real time as the narrator reads, like an immersive reader. Click any word to seek to that position.
- Python 3.10+
- NVIDIA GPU with CUDA (tested on RTX 4080 Super, 16GB VRAM)
- ~2GB disk space for models (downloaded automatically on first run)
git clone https://github.com/jamditis/audioslop.git
cd audioslop
pip install flask f5-tts whisper python-docx pdfplumber
# Create directories for user data
mkdir ref uploads jobs content output audio
# Copy your voice reference clip (5-15 seconds of speech, .wav format)
cp /path/to/your/voice.wav ref/
# Configure
cp .env.example .env
# Edit .env with your password and secretpython app.pyOpen http://localhost:5000. Log in with the password from your .env file.
- Upload a document on the home page
- Review and edit the cleaned text
- Click "Generate audio" to start synthesis
- Listen in the player with synced word highlighting
The pipeline scripts work standalone without the web UI:
# Clean a document for TTS
python audioslop.py content/mybook/ -o output/mybook/
# Generate audio from cleaned text
python synthesize.py output/mybook/ --ref-audio ref/voice.wav -o audio/mybook/
# Verify audio quality
python qa.py audio/mybook/ --source output/mybook/Browser (HTML/JS/Tailwind)
|
Flask API (app.py)
|
Background worker (worker.py)
|
Pipeline: audioslop.py -> synthesize.py -> qa.py
|
F5-TTS (voice synthesis) + Whisper (verification + word timestamps)
The web UI wraps three standalone Python scripts:
audioslop.py-- Multi-format text extraction, TTS-specific cleaning (abbreviation expansion, dash normalization, footnote removal), and size-based chunkingsynthesize.py-- F5-TTS synthesis with per-paragraph generation, structural pauses between segments, and a QA verification loopqa.py-- Whisper-based transcription verification with word-level timestamps, accuracy scoring, and flow analysis
The player uses Whisper's word-level timestamps to sync transcript highlighting to audio playback via binary search on a flat timeline array, updated every animation frame.
F5-TTS clones any voice from a short reference recording. For best results:
- Record 5-15 seconds of natural speech in a quiet room
- Save as .wav format
- The model mirrors whatever speaking style it hears in the reference
| Variable | Default | Purpose |
|---|---|---|
AUDIOSLOP_PASSWORD |
(required) | Web UI login password |
AUDIOSLOP_SECRET |
dev-secret-change-me |
Flask session secret (change in production) |
audioslop/
├── app.py # Flask web app and API routes
├── worker.py # Background job processing thread
├── audioslop.py # Text extraction and TTS cleaning
├── synthesize.py # F5-TTS synthesis with QA loop
├── qa.py # Transcription verification and timing
├── db.py # SQLite database layer
├── activity.py # Per-job event logging
├── static/
│ ├── app.css # Styles
│ └── player.js # Audio player and transcript sync engine
├── templates/ # Jinja2 templates (upload, review, player)
├── tests/ # pytest test suite
└── docs/ # Design specs and implementation plans
MIT
