The VoiceCraft API is supposed to be a user-friendly, easy to install and Windows-compatible FastAPI application designed to extend the VoiceCraft text-to-speech (TTS) model with a convenient interface for generating speech audio from text. It comes with Windows and Linux one-click installers.
This guide provides an overview of the API, how to install, run it, and an example of how to use it. It was made for Pandrator.
The API endpoint /generate
accepts POST requests with several parameters for customizing the TTS generation process:
- time: The cut-off time (in seconds) - how much of the sample is to be used for voice cloning, recommended between 3 and 9 (required).
- target_text: The text you wish to generate speech for (required).
- audio: The input audio file in WAV format (should be 16000hz and mono) which will be used to clone the voice (required).
- transcript: The full transcript of the input audio file, named as the wav file (required).
- save_to_file: Whether to save the generated audio to a file (default
True
). - output_path: The directory where the output audio file should be saved (default
.
). - model_name: The name of the model you wish to use. Either
VoiceCraft_830M_TTSEnhanced
(larger) orVoiceCraft_gigaHalfLibri330M_TTSEnhanced_max16s
(smaller). The default is the 330M model, and it is the one the installer downloads. - Additional parameters for fine-tuning the generation (
top_k
,top_p
,temperature
,stop_repetition
,kvcache
,sample_batch_size
,device
).
The response will either be a JSON containing a message and the output file path (if save_to_file
is True
) or a streaming response with the generated audio (if save_to_file
is False
).
After starting the API server, you can explore and test the API using the Swagger UI by navigating to http://127.0.0.1:8245/docs
in your browser. This interface allows you to easily send requests to the API and view responses.
Below is an example Python script demonstrating how to send a request to the API:
import requests
url = 'http://127.0.0.1:8245/generate'
files = {
'audio': open('path/to/your/audio.wav', 'rb'),
'transcript': open('path/to/your/transcript.txt', 'rb')
}
data = {
'time': 5.0,
'target_text': 'The text you want to generate',
'save_to_file': True,
'output_path': './generated_audios',
`model_name: `VoiceCraft_gigaHalfLibri330M_TTSEnhanced_max16s`,
# Add other form fields as needed
}
response = requests.post(url, files=files, data=data)
print(response.json())
You can use install_and_run.sh
to install the api as well as start it later on linux systems (it supports various package managers).
- Download the script from Releases
- Make the script executable
chmod +x start_and_run.sh
- Run the script
sudo ./install_and_run.sh
- Download the .exe or .py file from Releases
- Open the .exe with administrator priviliges if you want it to install git and ffmpeg automatically, or
- Run the .py file from the Windows Terminal
- Make sure that
git
,ffmpeg
,miniconda
andespeak-ng
(if on Linux) are installed. - Clone the VoiceCraft API repository:
git clone https://github.com/lukaszliniewicz/VoiceCraft_API.git
- Change into the repository directory:
cd VoiceCraft_API
- Create a conda environment named
voicecraft_api
:conda create -n voicecraft_api python=3.9.16
- Activate the environment:
conda activate voicecraft_api
- Install audiocraft:
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
- Install pytorch etc.
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
- Install the API requirements
pip install -r requirements.txt
- Install Montreal Forced Aligner
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
- Install Montreal Forced Aligner models
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
- Install
ffmpeg
as per your OS instructions. - If running on Windows, after installing
audiocraft
, replace the specified files with those from theaudiocraft_windows
directory in this repository to make it compatible with Windows:
- Replace
src/audiocraft/audiocraft/utils/cluster.py
withaudiocraft_windows/cluster.py
- Replace
src/audiocraft/audiocraft/environment.py
withaudiocraft_windows/environment.py
- Replace
src/audiocraft/audiocraft/utils/checkpoint.py
withaudiocraft_windows/checkpoint.py
- Download the model and the encoder (one of the
.pth
files and the.th
file) into thepretrained_models
folder in the repository from HuggingFace. - Run the api (remember to always activate the Conda environment first!):
python api.py (Windows) or python3 api.py (Linux)
- The API automatically performs audio-text alignment if not already performed for the given WAV/TXT pair and prepends the correct portion of the transcript to the prompt. It created a folder for each "voice" with the wav/txt pair and the alignment csv.
- You can simply send the text you want to generate, and the rest is handled automatically.
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
There are three ways:
- with Google Colab. see quickstart colab
- with docker. see quickstart docker
- without docker. see environment setup
When you are inside the docker image or you have installed all dependencies, Checkout inference_tts.ipynb
.
If you want to do model development such as training/finetuning, I recommend following envrionment setup and training.
⭐ 03/28/2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 here!
⭐ 04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight, the model outperforms giga830M on TTS. Weights are here. Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data)
- Codebase upload
- Environment setup
- Inference demo for speech editing and TTS
- Training guidance
- RealEdit dataset and training manifest
- Model weights (giga330M.pth, giga830M.pth, and gigaHalfLibri330M_TTSEnhanced_max16s.pth)
- Better guidance on training/finetuning
- Write colab notebooks for better hands-on experience
- HuggingFace Spaces demo
- Command line
- Improve efficiency
⭐ To try out speech editing or TTS Inference with VoiceCraft, the simplest way is using Google Colab. Instructions to run are on the Colab itself.
- To try Speech Editing
- To try TTS Inference
⭐ To try out TTS inference with VoiceCraft, you can also use docker. Thank @ubergarm and @jayc88 for making this happen.
Tested on Linux and Windows and should work with any host with docker installed.
# 1. clone the repo on in a directory on a drive with plenty of free space
git clone git@github.com:jasonppy/VoiceCraft.git
cd VoiceCraft
# 2. assumes you have docker installed with nvidia container container-toolkit (windows has this built into the driver)
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.13.5/install-guide.html
# sudo apt-get install -y nvidia-container-toolkit-base || yay -Syu nvidia-container-toolkit || echo etc...
# 3. First build the docker image
docker build --tag "voicecraft" .
# 4. Try to start an existing container otherwise create a new one passing in all GPUs
./start-jupyter.sh # linux
start-jupyter.bat # windows
# 5. now open a webpage on the host box to the URL shown at the bottom of:
docker logs jupyter
# 6. optionally look inside from another terminal
docker exec -it jupyter /bin/bash
export USER=(your_linux_username_used_above)
export HOME=/home/$USER
sudo apt-get update
# 7. confirm video card(s) are visible inside container
nvidia-smi
# 8. Now in browser, open inference_tts.ipynb and work through one cell at a time
echo GOOD LUCK
conda create -n voicecraft python=3.9.16
conda activate voicecraft
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
pip install xformers==0.0.22
pip install torchaudio==2.0.2 torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
apt-get install ffmpeg # if you don't already have ffmpeg installed
apt-get install espeak-ng # backend for the phonemizer installed below
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
# install MFA for getting forced-alignment, this could take a few minutes
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
# install MFA english dictionary and model
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
# pip install huggingface_hub
# conda install pocl # above gives an warning for installing pocl, not sure if really need this
# to run ipynb
conda install -n voicecraft ipykernel --no-deps --force-reinstall
If you have encountered version issues when running things, checkout environment.yml for exact matching.
Checkout inference_speech_editing.ipynb
and inference_tts.ipynb
To train an VoiceCraft model, you need to prepare the following parts:
- utterances and their transcripts
- encode the utterances into codes using e.g. Encodec
- convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)
- manifest (i.e. metadata)
Step 1,2,3 are handled in ./data/phonemize_encodec_encode_hf.py, where
- Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
- phoneme sequence and encodec codes are also extracted using the script.
An example run:
conda activate voicecraft
export CUDA_VISIBLE_DEVICES=0
cd ./data
python phonemize_encodec_encode_hf.py \
--dataset_size xs \
--download_to path/to/store_huggingface_downloads \
--save_dir path/to/store_extracted_codes_and_phonemes \
--encodec_model_path path/to/encodec_model \
--mega_batch_size 120 \
--batch_size 32 \
--max_len 30000
where encodec_model_path is avaliable here. This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our paper. If you encounter OOM during extraction, try decrease the batch_size and/or max_len.
The extracted codes, phonemes, and vocab.txt will be stored at path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt}
.
As for manifest, please download train.txt and validation.txt from here, and put them under path/to/store_extracted_codes_and_phonemes/manifest/
. Please also download vocab.txt from here if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same).
Now, you are good to start training!
conda activate voicecraft
cd ./z_scripts
bash e830M.sh
It's the same procedure to prepare your own custom dataset. Make sure that if
You also need to do step 1-4 as Training, and I recommend to use AdamW for optimization if you finetune a pretrained model for better stability. checkout script ./z_scripts/e830M_ft.sh
.
If your dataset introduce new phonemes (which is very likely) that doesn't exist in the giga checkpoint, make sure you combine the original phonemes with the phoneme from your data when construction vocab. And you need to adjust --text_vocab_size
and --text_pad_token
so that the former is bigger than or equal to you vocab size, and the latter has the same value as --text_vocab_size
(i.e. --text_pad_token
is always the last token). Also since the text embedding are now of a different size, make sure you modify the weights loading part so that I won't crash (you could skip loading text_embedding
or only load the existing part, and randomly initialize the new)
The codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE), and the model weights are under Coqui Public Model License 1.0.0 (LICENSE-MODEL). Note that we use some of the code from other repository that are under different licenses: ./models/codebooks_patterns.py
is under MIT license; ./models/modules
, ./steps/optim.py
, data/tokenizer.py
are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.
We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.
@article{peng2024voicecraft,
author = {Peng, Puyuan and Huang, Po-Yao and Li, Daniel and Mohamed, Abdelrahman and Harwath, David},
title = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
journal = {arXiv},
year = {2024},
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.