This repo currently support Text-to-Audio (including Music), Text-to-Speech Generation and Super Resolution Inpainting.
- 2023-08-27: Add two new checkpoints!
- Add the text-to-speech checkpoint
- Open-source the AudioLDM training code.
- Support the generation of longer audio (> 10s)
- Optimizing the inference speed of the model.
- Integration with the Diffusers library (see 🧨 Diffusers)
- Add the style-transfer and inpainting code for the audioldm_48k checkpoint (PR welcomed, same logic as AudioLDMv1)
- Prepare running environment
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
- Start the web application (powered by Gradio)
python3 app.py
- A link will be printed out. Click the link to open the browser and play.
Prepare running environment
# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
If you plan to play around with text-to-speech generation. Please also make sure you have installed espeak. On linux you can do it by
sudo apt-get install espeak
- Generate sound effect or Music based on a text prompt
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
- Generate sound effect or music based on a list of text
audioldm2 -tl batch.lst
- Generate speech based on (1) the transcription and (2) the description of the speaker
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"
audioldm2 -t "A female reporter is speaking" --transcription "Wish you have a good day"
Text-to-Speech use the audioldm2-speech-gigaspeech checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set --model_name audioldm2-speech-ljspeech.
Sometimes model may not perform well (sounds weird or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.
audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
You can choose model checkpoint by setting up "model_name":
# CUDA
audioldm2 --model_name "audioldm2-full" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
# MPS
audioldm2 --model_name "audioldm2-full" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
We have five checkpoints you can choose:
- audioldm2-full (default): Generate both sound effect and music generation with the AudioLDM2 architecture.
- audioldm_48k: This checkpoint can generate high fidelity sound effect and music.
- audioldm_16k_crossattn_t5: The improved version of AudioLDM 1.0.
- audioldm2-full-large-1150k: Larger version of audioldm2-full.
- audioldm2-music-665k: Music generation.
- audioldm2-speech-gigaspeech (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
- audioldm2-speech-ljspeech: Text-to-Speech, trained on LJSpeech Dataset.
We currently support 3 devices:
- cpu
- cuda
- mps ( Notice that the computation requires about 20GB of RAM. )
usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
[--model_name {audioldm_48k, audioldm_16k_crossattn_t5, audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
[-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
[--seed SEED] [--mode {generation, sr_inpainting}] [-f FILE_PATH]
optional arguments:
-h, --help show this help message and exit
--mode {generation,sr_inpainting}
generation: text-to-audio generation; sr_inpainting: super resolution inpainting
-t TEXT, --text TEXT Text prompt to the model for audio generation
-f FILE_PATH, --file_path FILE_PATH
(--mode sr_inpainting): Original audio file for inpainting; Or
(--mode generation): the guidance audio file for generating similar audio, DEFAULT None
--transcription TRANSCRIPTION
Transcription used for speech synthesis
-tl TEXT_LIST, --text_list TEXT_LIST
A file that contains text prompt to the model for audio generation
-s SAVE_PATH, --save_path SAVE_PATH
The path to save model output
--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
The checkpoint you gonna use
-d DEVICE, --device DEVICE
The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
-b BATCHSIZE, --batchsize BATCHSIZE
Generate how many samples at the same time
--ddim_steps DDIM_STEPS
-dur DURATION, --duration DURATION
The duration of the samples
The sampling step for DDIM
-gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
-n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
heavier computation
--seed SEED Change this value (any integer number) will lead to a different generation result.
AudioLDM 2 is available in the Hugging Face 🧨 Diffusers library from v0.21.0 onwards. The official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.
The Diffusers version of the code runs upwards of 3x faster than the native AudioLDM 2 implementation, and supports generating audios of arbitrary length.
To install 🧨 Diffusers and 🤗 Transformers, run:
pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate
You can then load pre-trained weights into the AudioLDM2 pipeline, and generate text-conditional audio outputs by providing a text prompt:
from diffusers import AudioLDM2Pipeline
import torch
import scipy
repo_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs."
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Tips for obtaining high-quality generations can be found under the AudioLDM 2 docs, including the use of prompt engineering and negative prompting.
Tips for optimising inference speed can be found in the blog post AudioLDM 2, but faster ⚡️.
If you found this tool useful, please consider citing
@article{audioldm2-2024taslp,
author={Liu, Haohe and Yuan, Yi and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Tian, Qiao and Wang, Yuping and Wang, Wenwu and Wang, Yuxuan and Plumbley, Mark D.},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining},
year={2024},
volume={32},
pages={2871-2883},
doi={10.1109/TASLP.2024.3399607}
}
@article{liu2023audioldm,
title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={Proceedings of the International Conference on Machine Learning},
year={2023}
pages={21450-21474}
}