Listen to technical Markdown out loud, with interactive pauses on code blocks.
md-tts reads a Markdown file aloud and stops on every code block, table and flashcard so you can actually look at the screen and study. It recognises <details><summary>Q</summary>A</details> blocks as flashcards (question → wait → answer) and detects the dominant language of the document (Spanish or English) to pick a single TTS voice for the whole session.
A --no-pause "podcast mode" is included for when you just want continuous playback in the background (commute, gym): instead of waiting on code blocks, it announces them and moves on.
Existing TTS tools for Markdown either:
- treat code blocks as silence and skip them, leaving the listener confused about what just happened;
- read code character-by-character as if it were prose (
open-paren-self-comma-x), which is unusable; or - support SSML pauses but not interactive pauses where playback waits for the listener.
After testing 8+ tools (Speechify, NaturalReader, Study MD Desk, VoxTrack and several SSML-based pipelines) nothing offered the combination of parse Markdown structure → speak prose → stop on code → wait for me. md-tts is a small Python CLI that does exactly that.
It is intentionally minimal. It targets developers who want to revise their own technical notes while away from the keyboard.
- 🛑 Interactive pauses on code blocks and tables.
- 🎴 Flashcard mode for
<details><summary>Q</summary>A</details>(speak Q, wait, speak A). - 🌍 ES/EN dominant-language detection: the parser picks a single session voice based on the document’s dominant language. Per-paragraph voice switching was tried and proved unstable on SAPI5; it lives in the roadmap.
- 🎧 Podcast mode (
--no-pause) that announces skipped blocks in the chosen language instead of waiting. - 🔊 Cross-platform TTS via
pyttsx4(SAPI5 on Windows, NSSpeechSynthesizer on macOS, eSpeak on Linux). No cloud account, no API key. - 🌐 Optional Edge neural voices (
--backend edge): natural-sounding Microsoft voices, picks a voice per paragraph based on the detected language. Requires internet. - 🧪 Unit tested on Python 3.11 / 3.12 / 3.13 (see CI).
pip install md-ttsThis gives you the offline pyttsx4 backend (SAPI5 on Windows, NSSpeechSynthesizer on macOS, eSpeak on Linux) plus the Markdown parser and interactive CLI — no API keys, no internet.
On Linux you also need
espeakinstalled at the system level:sudo apt-get install espeak libespeak1.
| Install | Adds | When to use |
|---|---|---|
pip install md-tts |
base only | Local TTS playback, no neural voices. |
pip install "md-tts[edge]" |
edge-tts + pygame |
Microsoft Edge neural voices (--backend edge) and real pause/resume during playback. |
pip install "md-tts[export]" |
edge-tts only |
MP3 export (--export out.mp3) on headless or mobile environments where pygame/SDL2 is hard to install. |
pygame requires SDL2 native libraries and is painful to build on Termux. If you only want to generate MP3 files and listen to them with a regular Android audio player, use the [export] extra:
pkg install python
pip install "md-tts[export]"
md-tts notes.md --backend edge --export notes.mp3That avoids pygame entirely. For local playback on Termux with pyttsx4 (no Edge), also install pkg install espeak. Interactive controls (SPACE pause, +/- rate) work only with the [edge] extra, which is not recommended on Termux.
git clone https://github.com/jmponcebe/md-tts.git
cd md-tts
uv sync --extra dev # or: pip install -e ".[dev]"# Default: interactive — ENTER skips each code block / table / flashcard.
md-tts notes.md
# Podcast mode: never wait, just announce skipped blocks.
md-tts notes.md --no-pause
# Force a language (no auto-detect):
md-tts notes.md --lang es
# Force a specific voice by id (use --list-voices to discover them):
md-tts notes.md --voice "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_ES-ES_HELENA_11.0"
# Tune speed:
md-tts notes.md --rate 220
# Inspect voices available on this system (path is optional with this flag):
md-tts --list-voices
# Switch to Microsoft Edge neural voices (requires internet, sounds much better):
md-tts notes.md --backend edge
# Edge mode auto-picks an es-ES voice for Spanish paragraphs and en-US for
# English ones, so a bilingual document is read with the right voice per
# paragraph automatically.
md-tts notes.md --backend edge
# Inspect the Edge voice catalogue:
md-tts --backend edge --list-voices
# Export to MP3 for offline listening (commute, gym, etc.):
md-tts notes.md --backend edge --export notes.mp3
# Code blocks become short "skipping code block" announcements; <details>
# cards keep their question + 3 s silence + answer pattern.You can also run the module directly:
python -m md_tts notes.md| Markdown construct | Behaviour |
|---|---|
| Headings | Spoken with Chapter: / Section: prefix (or Capítulo: in Spanish). |
| Paragraphs | Spoken as prose. |
Inline code ` ` |
Quoted in the spoken output (e.g. 'git status') so it’s audibly distinct from prose. |
| Fenced code blocks | Pause + print to terminal. |
| Tables | Pause + print rows. |
| Inline images | Announced inline as [imagen: alt]. |
| Lists | Spoken as Punto 1: ..., Punto 2: ... (Spanish prefix used in both languages currently). |
| Block quotes | Prefixed with Cita:. |
HR (---) |
Spoken as Separador.. |
<details><summary>Q</summary>A</details> |
Flashcard: speak Q, wait for ENTER, speak A. |
Math blocks (
$$ ... $$) and standalone image blocks are not yet detected as pause points — they fall through as text. Adding them is on the roadmap.
.md file
│
▼
parser.parse_markdown(text) → Iterator[Block]
│ kind ∈ {text, code, table, card}
▼
cli.run() ← argparse + interactive loop
│
▼
reader.build_reader(backend) → LocalReader (pyttsx4) or EdgeReader (edge-tts)
Three modules plus per-backend implementations, ~700 lines total. The parser builds on top of markdown-it-py and pre-processes <details> HTML blocks with a regex/placeholder trick before parsing, because markdown-it treats raw HTML as opaque tokens.
The local backend uses pyttsx4 (a maintained fork of pyttsx3) because pyttsx3 2.99 exhibits a SAPI5 bug on Windows where only the first runAndWait() call produces audio. The edge backend uses edge-tts to call Microsoft Edge's neural voices over HTTPS (no Microsoft account, no API key) and plays the resulting MP3 with pygame.mixer.music, which exposes real pause/unpause cross-platform (SDL_mixer under the hood) — that's what enables the SPACE control during a paragraph. Per-paragraph voice switching works on edge because each utterance is an independent HTTP request, with no shared engine state to corrupt.
- Interactive controls during playback (SPACE / s / n / b / +/- / q) — v0.3
- Optional cloud-quality TTS backend (Microsoft Edge neural voices) — v0.2
- Rewind / skip-back during interactive mode — v0.3
- MP3 export of an entire document for offline mobile listening — v0.4
- PyPI release (
pip install md-tts) - Math blocks (
$$ ... $$) detected as pause points instead of being read as prose - Standalone image blocks announced as
[image: alt-text]instead of being silently flattened - Bookmarks: persist a per-document position so
--resumepicks up where you left off -
--chapterflag to start playback from a specific heading - Real-time rate change (requires a streaming pitch-preserving resampler — non-trivial)
- More backends: Piper (local neural, fast, free), Azure TTS (premium voices via API key)
uv sync --extra dev # install dev extras (pytest, pytest-cov, ruff)
uv run pytest # 48 tests
uv run ruff check .
uv run ruff format .Conventional commits, feature branches off main, squash-merge by default. See .github/copilot-instructions.md for the full contributor guide.
MIT — see LICENSE.
Jose María Ponce Bernabé. Built as a side-project while studying for AI / Data Engineering interviews — needed a way to revise PharmaGraphRAG and DengueMLOps notes during commutes.