# Audiobook Generator (XTTS v2 model) — Notebook

This notebook generates an audiobook-style audio file from a **TXT** file using a **reference voice sample** (**MP3**) for speaker conditioning via **Coqui TTS (XTTS v2)**.

**Folder assumption:** the reference voice MP3 and the input TXT are in the **same folder** as this notebook (or you can point to a different folder).

**Ethics/safety:** Only generate a voice you own or have **explicit permission** to use.


## 0. Install the dependencies: 
- `coqui-tts` for XTTS v2
- `ffmpeg` is required for MP3 I/O (conversion + final MP3 export)

If you don't have FFmpeg installed:
- Windows: install via `choco install ffmpeg` (Chocolatey) or download an official build ("ffmpeg-7.1.1-full_build-shared.7z") and add to PATH
- macOS: `brew install ffmpeg`
- Linux (Debian/Ubuntu): `sudo apt-get install ffmpeg`

Then run: `pip install -r requirements.txt`

or like me (Windows):

```bash
python -m venv audiogen
audiogen/Scripts/activate
pip install ipykernel, coqui-tts
pip install "transformers==5.0.0"
uv pip install torch torchaudio torchcodec --torch-backend=auto
git clone https://github.com/idiap/coqui-ai-TTS
cd coqui-ai-TTS
uv pip install -e .[notebooks]
```

## 1. Imports and helpers

In [None]:
import os, re, torch
from TTS.api import TTS
os.environ["COQUI_TOS_AGREED"] = "1"

# # Get device if NVDIA GPU is present. AMD GPU is not supported or very limit.
# device = "cuda" if torch.cuda.is_available() else "cpu"
# print(TTS().list_models())

['tts_models/multilingual/multi-dataset/xtts_v2', 'tts_models/multilingual/multi-dataset/xtts_v1.1', 'tts_models/multilingual/multi-dataset/your_tts', 'tts_models/multilingual/multi-dataset/bark', 'tts_models/bg/cv/vits', 'tts_models/cs/cv/vits', 'tts_models/da/cv/vits', 'tts_models/et/cv/vits', 'tts_models/ga/cv/vits', 'tts_models/en/ek1/tacotron2', 'tts_models/en/ljspeech/tacotron2-DDC', 'tts_models/en/ljspeech/tacotron2-DDC_ph', 'tts_models/en/ljspeech/glow-tts', 'tts_models/en/ljspeech/speedy-speech', 'tts_models/en/ljspeech/tacotron2-DCA', 'tts_models/en/ljspeech/vits', 'tts_models/en/ljspeech/vits--neon', 'tts_models/en/ljspeech/fast_pitch', 'tts_models/en/ljspeech/overflow', 'tts_models/en/ljspeech/neural_hmm', 'tts_models/en/vctk/vits', 'tts_models/en/vctk/fast_pitch', 'tts_models/en/sam/tacotron2-DCA', 'tts_models/en/blizzard2013/capacitron-t2-c50', 'tts_models/en/blizzard2013/capacitron-t2-c150_v2', 'tts_models/en/multi-dataset/tortoise-v2', 'tts_models/en/jenny/jenny', 'tts_

In [None]:
# Set the location of cached voice

WORKDIR = os.getcwd()
if ("XDG_DATA_HOME" not in os.environ) and ("TTS_HOME" not in os.environ):
	os.environ["XDG_DATA_HOME"] = WORKDIR + "\\temp"
	os.environ["TTS_HOME"] = WORKDIR + "\\temp"

In [83]:
def normalize_text(text: str) -> str:
    # Remove hyphenation at line breaks: "exam-\nple" -> "example"
    text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)
    # Collapse whitespace
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

file_path = 'thought out thought.txt'
contents = []
with open(file_path, 'r') as file:
    for line in file:
        if line.strip():
            contents.append(normalize_text(line))

print(contents)

['THOUGHT OUT THOUGHT', 'Thoughts are not you. They are not yours to believe, not yours to follow, not even for you to be drawn near. Thought is a natural process—it appears, and it passes. So thought is not you, not for you, not happening to you, nor even against you. Only ignorance can blinds you from noticing this fact.', 'For many of us, the opposite seems true. In fact, thoughts can be so real that we cannot dispute or oppose them. Take the case of regret over a deed wrongly done— or not done when it should have been. Or worry over what may happen in the future. Such thoughts can feel so real that, in extreme cases, they drive a person to take their own life, or even act in revenge.', 'In ordinary daily life, even a single passing thought—such as judging another—can make us believe in it entirely. If such small thoughts have this much sway, how much more powerful the severe and extreme ones?', 'Certain thoughts may be trivial or unfounded, yet they can still cause stress—like forg

## 2. Provide reference audio files and generate a sample simple speech with that voice:

In [None]:
# initialize TTS model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

# reference audio
reference_files = ["reference_voice.mp3"]
tts.tts_to_file(
    text=contents[0],
    speaker_wav=reference_files,
    language="en",
    file_path="output.wav"
)

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

'output.wav'

## 3. Cache reference voices for easy reuse

In [None]:
tts.tts_to_file(
  text=contents[6],
  speaker_wav=reference_files,
  speaker="HorTuckLoon",
  language="en",
  file_path=None
)

'output.wav'

In [77]:
tts.tts_to_file(
  text=contents[0],
  speaker="HorTuckLoon",
  language="en",
)

'output.wav'