## Install the necessary libraries:

1. stable-audio-tools - The code package for using Stable Audio Open that includes functions for generating audio from text prompts.
2. torch - Stable Audio Open relies on PyTorch for its core functionalities.
3. torchaudio - Official PyTorch library specifically designed for working with audio data.
4. einops - It offers functionalities for manipulating the tensor shapes used in models. List item





In [None]:
!pip install stable-audio-tools torch torchaudio einops

Collecting stable-audio-tools
  Downloading stable_audio_tools-0.0.16-py3-none-any.whl (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.1/121.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting einops
  Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aeiou==0.0.20 (from stable-audio-tools)
  Downloading aeiou-0.0.20-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alias-free-torch==0.0.6 (from stable-audio-tools)
  Downloading alias_free_torch-0.0.6-py3-none-any.whl (9.7 kB)
Collecting auraloss==0.4.0 (from stable-audio-tools)
  Downloading auraloss-0.4.0-py3-none-any.whl (16 kB)
Collecting descript-audio-codec==1.0.0 (from stable-audio-tools)
  Downloading descript_audio_codec-1.0.0-py3-none-any.whl (26

## Import the libraries:

1. einops’s rearrange function for manipulating tensor shapes.
2. stable_audio_tools’s get_pretrained_model function is used for downloading the trained Stable Audio Open model.
3. stable_audio_tools.inference.generation’s generate_diffusion_cond is used for generating audio based on prompt and conditioning factors.

Also, check if a CUDA-enabled GPU is available. The device variable is set to “cuda” to use the GPU for faster computations, else, it defaults to “cpu”.


In [None]:
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

device = "cuda" if torch.cuda.is_available() else "cpu"

 ## Download the pre-trained model and extract the necessary configuration details from the model configuration dictionary:

sample_rate determines the number of samples per second in the audio signal, whereas sample_size refers to the number of samples in each audio frame.   


In [None]:
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)

model_config.json:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

No module named 'flash_attn'
flash_attn not installed, disabling Flash Attention


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/4.85G [00:00<?, ?B/s]

## Model Conditioning through a dictionary of conditioning parameters:

It’s implemented to generate a 30-second audio sample that sounds like a combination of various Indian Classical Music Instruments (prompt). The audio sample starts from the beginning as the seconds_start is set to 0.


In [None]:
conditioning = [{
    "prompt": "Indian Classical Music Instruments",
    "seconds_start": 0,
    "seconds_total": 30
}]

## Model Execution for generating audio sample:

1. steps indicate the iterations that the diffusion process will take.
2. cfg_scale indicates classifier-free guidance scale
3. sample_size represents the length of audio to generate, in samples.
4. sigma_min and sigma_max are the minimum and maximum noise magnitude.
5. sampler_type determines the type of sampler used for the diffusion process.
6. device parameter specifies “cuda” or “cpu”.


In [None]:
output = generate_diffusion_cond(
    model,
    steps=100,
    cfg_scale=7,
    conditioning=conditioning,
    sample_size=sample_size,
    sigma_min=0.3,
    sigma_max=500,
    sampler_type="dpmpp-3m-sde",
    device=device
)

3216528834


  0%|          | 0/100 [00:00<?, ?it/s]

## Tensor shape manipulation:

rearrange function from the einops library is used for defining the rearrangement pattern. b, d and n represent the batch dimension, the dimension of the audio data and the potential dimension.

This rearrangement converts the output tensor from separate dimensions for batch and channels into a single sequence.


In [None]:
output = rearrange(output, "b d n -> d (b n)")

## Process and save the generated audio:

1. output.to(torch.float32) is for converting the output tensor to 32-bit floating point format.
2. .div(torch.max(torch.abs(output))) is used for normalising the audio .
3. .clamp(-1, 1) is for clipping the audio values between -1 and 1.
4. .mul(32767) scales the audio values to the range of 16-bit signed integers.
5. .to(torch.int16) is for converting the audio to 16-bit signed integer format.
6. .cpu moves the audio tensor to the CPU.
7. torchaudio.save(“output.wav”, output, sample_rate) is used for saving the generated audio to a WAV file named “output.wav” using torchaudio.save function.


In [None]:
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

### The generated output.wav can be checked in a music player, it generated an audio sample based on the prompt “Indian Classical Music Instruments”.
