<a href="https://colab.research.google.com/github/karen-pal/notebook/blob/main/AudioLDM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The styles of AudioLDM 🎺

AudioLDM was proposed in the paper [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://audioldm.github.io) by Haohe Liu et al. Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.

In this Colab, we showcase how to use the AudioLDM using the Hugging Face [🧨 Diffusers library](https://github.com/huggingface/diffusers), covering concepts such as random seeds, prompt engineering and negative prompts, in order to generate audio outputs in a range of different styles and characteristics 🎨

## Set-up environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click `Runtime` -> `Change runtime type`, then change `Hardware accelerator` from `None` to `GPU`. We can verify that we’ve been assigned a GPU and view its specifications through the `nvidia-smi` command:

In [None]:
!nvidia-smi

Thu Jun  1 14:17:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We see here that we've got on Tesla T4 16GB GPU, although this may vary for you depending on GPU availablity and Colab GPU assignment.

Next, we can install the required Python packages, namely 🧨 Diffusers and 🤗 Transformers for running the AudioLDM and CLAP models respectively, and scipy to save our generated audio samples:

In [None]:
!pip install --quiet --upgrade diffusers transformers

### Loading the pipeline

The AudioLDM model is comprised of four stages:
1. CLAP text encoder that maps a text input to a text embedding (CLAP is trained such that this text embedding is shared with the corresponding audio sample)
2. Latent diffusion model (LDM) that performs the de-noising routine to recover the audio latent
3. VAE decoder to map from the LDM latents to a log-mel spectrogram representation
4. Vocoder model to generate the audio waveform from the generated spectrogram

These four stages are depicted diagramatically below, taken from Figure 1 of the [AudioLDM paper](https://arxiv.org/abs/2301.12503):

<p align="center">
  <img src="https://github.com/sanchit-gandhi/notebooks/blob/main/audioldm.jpg?raw=true" width="600"/>
</p>

The [`AudioLDMPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm) is an end-to-end inference pipeline that wraps these four stages into a single class, enabling you to generate audio samples from text in just a few lines of code. 

There are four available AudioLDM checkpoints that vary in model size and the training scheme (i.e. number of steps and audio conditioning), summarised in the table below:

| Checkpoint                                                            | Training Steps | Audio conditioning | CLAP audio dim | UNet dim | Params |
|-----------------------------------------------------------------------|----------------|--------------------|----------------|----------|--------|
| [audioldm-s-full](https://huggingface.co/cvssp/audioldm)              | 1.5M           | No                 | 768            | 128      | 421M   |
| [audioldm-s-full-v2](https://huggingface.co/cvssp/audioldm-s-full-v2) | > 1.5M         | No                 | 768            | 128      | 421M   |
| [audioldm-m-full](https://huggingface.co/cvssp/audioldm-m-full)       | 1.5M           | Yes                | 1024           | 192      | 652M   |
| [audioldm-l-full](https://huggingface.co/cvssp/audioldm-l-full)       | 1.5M           | No                 | 768            | 256      | 975M   |

For the purposes of this tutorial, we'll initialise the pipeline with the pre-trained weights from v2 version of the smallest checkpoint ([audioldm-s-full-v2](https://huggingface.co/cvssp/audioldm-s-full-v2)). We'll also load the weights in half precision (float16) to speed-up inference time:

In [None]:
from diffusers import AudioLDMPipeline
import torch

model_id = "cvssp/audioldm-s-v2"
pipe = AudioLDMPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 
```
pip install accelerate
```
.


The pipeline can be moved to the GPU in much the same way as a standard PyTorch nn module:

In [None]:
pipe.to("cuda");

Great! We'll define a [Generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a [seed](https://huggingface.co/docs/diffusers/using-diffusers/reproducibility) for reproducibility. This will allow us to tweak our prompts and observe the effect that they have on the generations by fixing the starting latents in the LDM model:

In [None]:
generator = torch.Generator("cuda").manual_seed(0)

## Style 1 - Music

Let's get generating! First off, let's try generating a sample of techno music 🎶 We'll define a simple prompt and pass it to the pipeline alongside the generator:

In [None]:
prompt = "Techno music"

audio = pipe(prompt, generator=generator).audios[0]

  0%|          | 0/10 [00:00<?, ?it/s]

Let's take a listen using IPython's built-in audio playback class:

In [None]:
from IPython.display import Audio

Audio(audio, rate=16000)

Alright! It sounds _vaguely_ like techno music, but I wouldn't go so far as adding it to my playlist... Let's try using a more descriptive prompt, one that includes more adjectives and descriptive terms:

In [None]:
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

This already sounds a lot better - the generated sample follows our prompt with a stronger beat and melody over the top. We can also include a [negative prompt](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm#diffusers.AudioLDMPipeline.__call__.negative_prompt) to **not** guide the diffusion process, thus allowing us to specify features that we don't want to have in our resulting audio sample. In practice, low and average quality seem to be very effective negative prompts that are worth including:

In [None]:
prompt = "Techno music with a strong upbeat tempo and high melodic riffs"
negative_prompt = "low quality, average quality, vocals"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

The negative prompt has a quite noticeable effect on the generated audio quality. Removing low / average quality features through our negative prompt results in greater sharpness and audio clarity. It's already much closer to a real techno track!

So far, we've been generating short audio samples, in the range of 5 seconds. We can control the length of the generated audio with the argument [`audio_length_in_s`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm#diffusers.AudioLDMPipeline.__call__.audio_length_in_s). Let's try generating a 15 second audio sample using the same prompt and negative prompt as before:

In [None]:
generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, audio_length_in_s=15, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

This sounds okay, but is not my favourite generation. We can generate a different sample by switching the seed to start from a different random latent:

In [None]:
generator = torch.Generator("cuda").manual_seed(1)
audio = pipe(prompt, negative_prompt=negative_prompt, audio_length_in_s=15, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

Great! We have all the tools we need for effective audio generation. There are also a number of different scheduler parameters that you can experiment with to modify the de-noising process, such as number of de-noising steps and guidance scale. You can read more about these in the `AudioLDMPipeline` [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm#diffusers.AudioLDMPipeline.__call__).

Let's try generating in different generes of music. The prompts for the remaining examples are taken from the [AudioLDM examples page](https://audioldm.github.io). They tend to be very descriptive and well-engineered, and provide good templates for writing effective prompts.

### 1.1 Dance Music

In [None]:
prompt = "Dance music with strong, upbeat tempo, and repetitive rhythms, include sub-genres like house, techno, EDM, trance, and many more."
negative_prompt = "low quality, average quality, vocals"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=5, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

### 1.2 Scary Music

In [None]:
prompt = "Scary music with dissonant harmonies, irregular rhythms, and unconventional use of instruments."
negative_prompt = "low quality, average quality, vocals"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=5, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

### 1.3 Pop Music

In [None]:
prompt = "Pop music that upbeat, catchy, and easy to listen, high fidelity, with simple melodies, electronic instruments and polished production."
negative_prompt = "low quality, average quality, vocals"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=10, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

### 1.4 Calming Music

In [None]:
prompt = "This is a piece that would be suitable as calming study music or music for sleeping. It features a relaxing and soothing motif on the piano, being backed by a distant, high pitched and sustained violin."
negative_prompt = "low quality, average quality"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=10, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

## Style 2 - Sound Effects

AudioLDM is not restricted to music - it can generate sound effects and also speech-like audio, complete with textual control over the generated samples (e.g. acoustic environment control). In the remainder of the Colab, we explore different sound and speech effects.

In [None]:
prompt = "The crashing of waves against the shore, high fidelity, the sound of seagulls and other coastal birds."
negative_prompt = "Low quality, average quality"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=5, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
prompt = "Radio emissions from stars, planets, galaxies and other celestial bodies, high fidelity, as well as the sounds of solar winds and cosmic rays."
negative_prompt = "Low quality, average quality"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=5, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

## Style 3 - Speech

In [None]:
prompt = "A man is speaking under water."
negative_prompt = "Low quality, average quality"

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=5, generator=generator).audios[0]

Audio(audio, rate=16000, autoplay=True)

  0%|          | 0/10 [00:00<?, ?it/s]

### 3.1 - Speech with Acoustic Environment Control

In [None]:
prompts = ["A man is speaking in a huge room.", "A man is speaking in a small room.", "A man is speaking in a studio."]
negative_prompt = ["Low quality, average quality"] * len(prompts)

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(prompts, negative_prompt=negative_prompt, num_inference_steps=10, audio_length_in_s=5, generator=generator, return_dict=True).audios

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
# huge room
Audio(audio[0], rate=16000)

In [None]:
# small room
Audio(audio[1], rate=16000)

In [None]:
# studio
Audio(audio[2], rate=16000)

We hear that the audio style get progressively less reverberant as we go from a huge room to a small room and finally a studio, as we'd expect for real speech. For a comprehensive list of audio controls AudioLDM is capable of, refer to the [AudioLDM examples page](https://audioldm.github.io).