<a href="https://colab.research.google.com/github/matteospanio/audiocraft-tutorial/blob/main/ai_music_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audiocraft tutorial

This notebook shows how to use the [audiocraft library](https://github.com/facebookresearch/audiocraft) by facebook research.

First we need to install dependencies:

In [1]:
!pip install audiocraft==1.1.0 torch==2.6.0

Collecting audiocraft==1.1.0
  Downloading audiocraft-1.1.0.tar.gz (610 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/610.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/610.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m610.4/610.4 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting av (from audiocraft==1.1.0)
  Downloading av-14.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting flashy>=0.0.1 (from audiocraft==1.1.0)
  Downloading flashy-0.0.2.tar.gz (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.4/72.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata

## MusicGen

We will start generating music through the model MusicGen. To do so we define a custom function to load the model

> The model comes with many variants, that's why we define an helper function, so we can load the model we prefert.

Make sure to enable the GPU through the Colab menu *Runtime > Change runtime type* and select the T4 GPU model.

In [1]:
from typing import Literal
from audiocraft.models import MusicGen
from torch import Tensor


def get_model(
    model: str = "csc-unipd/tasty-musicgen-small",
    device: Literal["cpu", "cuda"] = "cuda",
) -> MusicGen:
    musicgen = MusicGen.get_pretrained(model, device=device)
    return musicgen

Then we can load the model just by calling the function. By default the loaded variant is a [finetuned version of the model](https://huggingface.co/csc-unipd/tasty-musicgen-small). To know the model string names have a look at the MusicGen [official documentation page](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md).

In [None]:
model = get_model()

Once loaded the model we can define other helper functions to leverage the full potential of the model, in fact it can be used to:

1. generate unconditional music
2. condition generation through text
3. condition generation with text and audio
4. continue a given audio

The following code defines a function for each of these tasks. In addition we add control over the duration and temperature parameters (once again, the model interface accepts more parameters, a comprehensive description is provided at the official MusicGen documentation page)

In [None]:
def make_random_audio(
    synthesiser: MusicGen,
    duration: float = 30.0,
    temperature: float = 1.0,
) -> Tensor:
    """
    Generate random audio using the synthesiser.
    """
    synthesiser.set_generation_params(
        duration=duration,
        extend_stride=0.5,
        temperature=temperature,
    )

    # Generate audio
    music = synthesiser.generate_unconditional(1, progress=True)
    return music


def make_audio_from_text(
    synthesiser: MusicGen,
    prompt: str,
    duration: float = 30.0,
    temperature: float = 1.0,
) -> Tensor:
    """
    Generate audio from a text prompt using the synthesiser.
    """
    synthesiser.set_generation_params(
        duration=duration,
        extend_stride=0.5,
        temperature=temperature,
    )

    # Generate audio
    music = synthesiser.generate([prompt], progress=True)
    return music


def make_audio_from_given_melody_and_text(
    synthesiser: MusicGen,
    prompt: str,
    audio: Tensor,
    audio_sample_rate: int,
    duration: float = 30.0,
    temperature: float = 1.0,
) -> Tensor:
    """
    Generate audio from a text prompt using the synthesiser.
    """
    synthesiser.set_generation_params(
        duration=duration,
        extend_stride=0.5,
        temperature=temperature,
    )

    # Generate audio
    music = synthesiser.generate_with_chroma(
        melody_wavs=audio,
        melody_sample_rate=audio_sample_rate,
        descriptions=[prompt],
        progress=True,
    )
    return music


def continue_audio(
    synthesiser: MusicGen,
    prompt: str | None,
    audio: Tensor,
    audio_sample_rate: int,
    duration: float = 30.0,
    temperature: float = 1.0,
) -> Tensor:
    """
    Continue audio from a text prompt using the synthesiser.
    """
    synthesiser.set_generation_params(
        duration=duration,
        extend_stride=0.5,
        temperature=temperature,
    )

    # Generate audio
    music = synthesiser.generate_continuation(
        prompt=audio,
        prompt_sample_rate=audio_sample_rate,
        descriptions=[prompt] if prompt else None,
        progress=True,
    )
    return music

To generate an audio we just need to call a function:

In [16]:
# generate random audio
audio = make_random_audio(model)



The model generates a tensor in the shape batch, channels, data [B, C, D]. Since we used the mono variant we just get one channel and of course just one batch because we asked for a single generation:

In [17]:
print(audio.shape)

torch.Size([1, 1, 960000])


Anyway, with just a little bit of magic, we can reshape the tensor and move it to the CPU, so that we are able to listen to the result.

In [15]:
from IPython.display import Audio, display

display(Audio(data=audio.cpu().squeeze(0).numpy(), rate=32000))

In [18]:
# Let's define a function to listen a generated audio
def create_player(audio: Tensor, sr = 32000) -> None:
    y = audio.cpu().squeeze(0).numpy()
    display(Audio(data=y, rate=sr))

Now let's try to give a text condition:

In [19]:
text_conditioned_audio = make_audio_from_text(model, prompt="A piano sweet melody")



In [20]:
create_player(text_conditioned_audio)

## AudioGen

AudioGen is even more simple as it just generates sounds from text.

In this case we can avoid to create wrapper functions and use the raw audiocraft api:

In [21]:
from audiocraft.models import AudioGen

audiogen = AudioGen.get_pretrained('facebook/audiogen-medium')

compression_state_dict.bin:   0%|          | 0.00/236M [00:00<?, ?B/s]

  WeightNorm.apply(module, name, dim)


state_dict.bin:   0%|          | 0.00/3.68G [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

In [23]:
audiogen.set_generation_params(duration=5)
bark = audiogen.generate(["dog barking"])

In [24]:
create_player(bark, audiogen.sample_rate)