This repository contains the essential code for cloning any voice using just text and a 10-second audio sample of the target voice. XTTS-2-UI is simple to setup and use. Example Results 🔊
Works in 16 languages and has in-built voice recording/uploading.
The model used is tts_models/multilingual/multi-dataset/xtts_v2
. For more details, refer to Hugging Face - XTTS-v2 and its specific version XTTS-v2 Version 2.0.2.
To set up this project, follow these steps in a terminal:
-
Clone the Repository
- Clone the repository to your local machine.
git clone https://github.com/pbanuru/xtts2-ui.git cd xtts2-ui
- Clone the repository to your local machine.
-
Create a Virtual Environment:
- Run the following command to create a Python virtual environment (tested on Python 3.10.10):
python -m venv venv
- Activate the virtual environment:
-
Windows:
# cmd prompt venv\Scripts\activate
or
# git bash source venv/Scripts/activate
-
Linux/Mac:
source venv/bin/activate
-
- Run the following command to create a Python virtual environment (tested on Python 3.10.10):
-
Install PyTorch:
- If you have an Nvidia CUDA-Enabled GPU, choose the appropriate PyTorch installation command:
- Before installing PyTorch, check your CUDA version by running:
nvcc --version
- For CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
- For CUDA 11.8:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Before installing PyTorch, check your CUDA version by running:
- If you don't have a CUDA-enabled GPU,: Follow the instructions on the PyTorch website to install the appropriate version of PyTorch for your system.
- If you have an Nvidia CUDA-Enabled GPU, choose the appropriate PyTorch installation command:
-
Install Other Required Packages:
- Install direct dependencies:
pip install -r requirements.txt
- Upgrade the TTS package to the latest version:
pip install --upgrade TTS
- Install direct dependencies:
After completing these steps, your setup should be complete and you can start using the project.
Models will be downloaded automatically upon first use.
Download paths:
- MacOS:
/Users/USR/Library/Application Support/tts/tts_models--multilingual--multi-dataset--xtts_v2
- Windows:
C:\Users\ YOUR-USER-ACCOUNT \AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2
- Linux:
/home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
To run the application:
python app.py
On initial use, you will need to agree to the terms:
[XTTS] Loading XTTS...
> tts_models/multilingual/multi-dataset/xtts_v2 has been updated, clearing model cache...
> You must agree to the terms of service to use this model.
| > Please see the terms of service at https://coqui.ai/cpml.txt
| > "I have read, understood and agreed to the Terms and Conditions." - [y/n]
| | >
If your model is re-downloading each run, please consult Issue 4723 on GitHub.
The dataset consists of a single folder named targets
, pre-populated with several voices for testing purposes.
To add more voices (if you don't want to go through the GUI), create a 24KHz WAV file of approximately 10 seconds and place it under the targets
folder.
You can use yt-dlp to download a voice from YouTube for cloning:
yt-dlp -x --audio-format wav "https://www.youtube.com/watch?"
Language | Audio Sample Link |
---|---|
English | |
Russian | |
Arabic |
Arabic, Chinese, Czech, Dutch, English, French, German, Hungarian, Italian, Japanese (see setup), Korean, Polish, Portuguese, Russian, Spanish, Turkish
If you would like to select Japanese as the target language, you must install a dictionary.
# Lite version
pip install fugashi[unidic-lite]
or for more serious processing:
# Full version
pip install 'fugashi[unidic]'
python -m unidic download
More details here.
- Heavily based on https://github.com/kanttouchthis/text_generation_webui_xtts/