# Chapter 2: Audio Applications with Pipelines

**Following the [Hugging Face Audio Course](https://huggingface.co/learn/audio-course/) always trying to use soundscapes or sound libraries instead of speech.**

The HF course shows three tasks using `pipeline()`: audio classification, ASR, and audio generation.
I adapt all three to my focus on **environmental sounds and soundscapes**:

| HF Course | My version |
|---|---|
| Audio Classification on MINDS-14 (speech intent) | Audio Classification on **ESC-50** (environmental sounds) |
| ASR with Whisper (speech to text) | **Audio Captioning** (sound to description) |
| TTS with Bark + Music with MusicGen | **Soundscape generation** with MusicGen |

The key idea of this chapter: **you can do a lot with pre-trained models and zero training**. The `pipeline()` function handles all preprocessing for you.

Let's go.

## Setup

In [1]:
!pip install -q transformers datasets librosa soundfile torch accelerate

In [2]:
import librosa
import numpy as np
import torch
from IPython.display import Audio, display

In [4]:
# Upload your files (Colab)
from google.colab import files
uploaded = files.upload()

THUNDER_FILE = "thunder.wav"
CHIMES_FILE = "chimes.wav"

SR = 16000
thunder, _ = librosa.load(THUNDER_FILE, sr=SR, mono=True)
chimes, _ = librosa.load(CHIMES_FILE, sr=SR, mono=True)

print(f"Thunder: {len(thunder)/SR:.1f}s")
print(f"Chimes:  {len(chimes)/SR:.1f}s")

Saving thunder.wav to thunder.wav
Saving chimes.wav to chimes.wav
Thunder: 27.3s
Chimes:  53.5s


---

## Part 1: Audio Classification

Audio classification = give the model an audio clip, get back a label (or a ranked list of labels with scores).

The HF course uses a model fine-tuned on MINDS-14 for intent classification ("pay_bill", "freeze", etc.).
We use a model trained on environmental sounds instead.

### 1a. Classify ESC-50 examples

The **Audio Spectrogram Transformer (AST)** by MIT was trained on AudioSet, which includes environmental sounds. Let's try it.

In [5]:
from transformers import pipeline

# AST model trained on AudioSet (527 sound classes)
classifier = pipeline(
    "audio-classification",
    model="MIT/ast-finetuned-audioset-10-10-0.4593",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]



model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/203 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [6]:
# Load ESC-50 dataset
from datasets import load_dataset, Audio

dataset = load_dataset("ashraq/esc50", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

README.md:   0%|          | 0.00/345 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


dataset_infos.json: 0.00B [00:00, ?B/s]

data/train-00000-of-00002-2f1ab7b824ec75(…):   0%|          | 0.00/387M [00:00<?, ?B/s]

data/train-00001-of-00002-27425e5c1846b4(…):   0%|          | 0.00/387M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [7]:
# Try classification on a few ESC-50 examples from different categories
test_categories = ["rain", "thunderstorm", "church_bells", "sea_waves", "dog"]

for cat in test_categories:
    example = dataset.filter(lambda x: x["category"] == cat)[0]
    audio_array = np.array(example["audio"]["array"], dtype=np.float32)

    result = classifier(audio_array)
    top3 = result[:3]

    print(f"\nTrue label: {cat}")
    for r in top3:
        print(f"  {r['label']:<30s} {r['score']:.3f}")

Filter:   0%|          | 0/2000 [00:00<?, ? examples/s]


True label: rain
  Rain                           0.396
  Rain on surface                0.352
  Raindrop                       0.191


Filter:   0%|          | 0/2000 [00:00<?, ? examples/s]


True label: thunderstorm
  Thunder                        0.548
  Thunderstorm                   0.254
  Rain                           0.109


Filter:   0%|          | 0/2000 [00:00<?, ? examples/s]


True label: church_bells
  Church bell                    0.721
  Bell                           0.234
  Change ringing (campanology)   0.024


Filter:   0%|          | 0/2000 [00:00<?, ? examples/s]


True label: sea_waves
  Waves, surf                    0.415
  Ocean                          0.310
  Wind                           0.058


Filter:   0%|          | 0/2000 [00:00<?, ? examples/s]


True label: dog
  Bark                           0.192
  Animal                         0.184
  Dog                            0.108


The model was trained on AudioSet which has 527 classes, so the label names may not match ESC-50 exactly. But the predictions should be semantically close.

### 1b. Classify our own sounds

Now let's classify our thunder and chimes files. The model has never seen these specific recordings.

In [9]:
for name, signal in [("Thunder/Rain", thunder), ("Chimes", chimes)]:
    # Most models expect short clips, let's use the first 10 seconds
    clip = signal[:SR * 10]

    result = classifier(clip)
    top5 = result[:5]

    print(f"\n{name}:")
    display(Audio(clip, SR))
    for r in top5:
        print(f"  {r['label']:<30s} {r['score']:.3f}")


Thunder/Rain:


Audio(sampling_rate=array([ 1.0560325e-07, -2.1087180e-07, -4.6398418e-07, ...,
        2.0433748e-02,  1.8988080e-02,  2.0739187e-02], dtype=float32), decode=16000, stream_index=None)

  Thunder                        0.446
  Thunderstorm                   0.424
  Rain                           0.118
  Rain on surface                0.005
  Raindrop                       0.004

Chimes:


Audio(sampling_rate=array([ 8.692126e-06,  8.508536e-06, -2.119155e-05, ..., -4.619440e-03,
       -1.666450e-02,  5.383512e-03], dtype=float32), decode=16000, stream_index=None)

  Wind chime                     0.561
  Chime                          0.390
  Tubular bells                  0.016
  Cowbell                        0.010
  Chink, clink                   0.003


### 1c. How does the pipeline work under the hood?

The `pipeline()` is convenient, but it hides what's happening. Here's the manual version, step by step:

In [10]:
from transformers import AutoFeatureExtractor, ASTForAudioClassification

model_id = "MIT/ast-finetuned-audioset-10-10-0.4593"

# Step 1: Load the feature extractor (handles preprocessing)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

# Step 2: Load the model
model = ASTForAudioClassification.from_pretrained(model_id)

# Step 3: Preprocess the audio
clip = thunder[:SR * 10]
inputs = feature_extractor(clip, sampling_rate=SR, return_tensors="pt")
print(f"Input shape: {inputs['input_values'].shape}")

# Step 4: Run inference
with torch.no_grad():
    logits = model(**inputs).logits

# Step 5: Get predictions
probs = torch.softmax(logits, dim=-1)
top5_idx = torch.topk(probs[0], 5).indices

print("\nManual classification (thunder):")
for idx in top5_idx:
    label = model.config.id2label[idx.item()]
    score = probs[0][idx].item()
    print(f"  {label:<30s} {score:.3f}")

Loading weights:   0%|          | 0/203 [00:00<?, ?it/s]

Input shape: torch.Size([1, 1024, 128])

Manual classification (thunder):
  Thunder                        0.446
  Thunderstorm                   0.424
  Rain                           0.118
  Rain on surface                0.005
  Raindrop                       0.004


That's exactly what `pipeline()` does for you in one line. Good to understand, but use the pipeline in practice.

---

## Part 2: Audio Captioning (instead of ASR)

The HF course uses **Whisper for ASR** (speech to text). Since our sounds are not speech, we use a Whisper model fine-tuned for **audio captioning**: it generates a free-text description of what it hears.

The model (`MU-NLPC/whisper-tiny-audio-captioning`) is a standard Whisper encoder-decoder fine-tuned on audio captioning data. The authors provide a custom model class, but it breaks with recent `transformers` versions. No problem — the weights are the same as standard Whisper, so we load them with `WhisperForConditionalGeneration` and handle the caption style prefix ourselves.

This is a useful lesson: **when a custom class breaks, understand what it does and replicate it with standard tools.**

In [23]:
from transformers import WhisperForConditionalGeneration, WhisperTokenizer, WhisperFeatureExtractor

# The authors provide a custom WhisperForAudioCaptioning class, but it breaks
# with recent transformers versions. Instead, we load the weights into the
# standard WhisperForConditionalGeneration and handle the style prefix ourselves.
# The weights are identical — the custom class only changed the generate() method.

checkpoint = "MU-NLPC/whisper-tiny-audio-captioning"

captioning_model = WhisperForConditionalGeneration.from_pretrained(checkpoint)
tokenizer = WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
cap_feature_extractor = WhisperFeatureExtractor.from_pretrained(checkpoint)

captioning_model.eval()
print(f"Audio captioning model ready ({sum(p.numel() for p in captioning_model.parameters())/1e6:.0f}M params)")

Loading weights:   0%|          | 0/167 [00:00<?, ?it/s]

Audio captioning model ready (38M params)


In [27]:
def caption_audio(audio_array, sr, style="clotho > caption: "):
    """Generate a text description of an audio clip."""
    # Resample to model's expected rate if needed
    if sr != cap_feature_extractor.sampling_rate:
        audio_array = librosa.resample(
            audio_array, orig_sr=sr, target_sr=cap_feature_extractor.sampling_rate
        )

    # Extract features (log-mel spectrogram)
    features = cap_feature_extractor(
        audio_array,
        sampling_rate=cap_feature_extractor.sampling_rate,
        return_tensors="pt"
    ).input_features

    # The model supports 3 caption styles:
    #   "clotho > caption: "    -> natural descriptions (default)
    #   "audiocaps > caption: " -> shorter captions
    #   "audioset > keywords: " -> keyword tags
    #
    # We encode the style prefix as token IDs, then feed them as
    # decoder_input_ids so the model continues from there.
    prefix_tokens = tokenizer(
        "", text_target=style, return_tensors="pt", add_special_tokens=False
    ).labels  # shape: (1, N)

    # Prepend the start-of-transcript token (<|startoftranscript|>)
    sot = torch.tensor([[tokenizer.convert_tokens_to_ids("<|startoftranscript|>")]])
    decoder_input_ids = torch.cat([sot, prefix_tokens], dim=-1)

    # Generate — we pass decoder_input_ids directly and disable the
    # default forced_decoder_ids so they don't conflict with our prefix.
    with torch.no_grad():
        outputs = captioning_model.generate(
            input_features=features.to(captioning_model.device),
            decoder_input_ids=decoder_input_ids.to(captioning_model.device),
            max_length=100,
            forced_decoder_ids=None,   # disable defaults
            suppress_tokens=None,      # don't suppress style tokens
        )

    caption = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    # Remove the style prefix from the output
    if ": " in caption:
        caption = caption.split(": ", 1)[1]
    return caption

### 2a. Caption our own sounds

In [28]:
# Caption the first 30 seconds of each file (model limit)
for name, signal in [("Thunder/Rain", thunder), ("Chimes", chimes)]:
    clip = signal[:SR * 30]
    print(f"\n{name}:")
    display(Audio(clip, SR))

    # Try all 3 styles
    for style_name, style_prefix in [
        ("Natural caption", "clotho > caption: "),
        ("Short caption",   "audiocaps > caption: "),
        ("Keywords",        "audioset > keywords: "),
    ]:
        caption = caption_audio(clip, SR, style=style_prefix)
        print(f"  {style_name:.<20s} {caption}")


Thunder/Rain:


Audio(sampling_rate=array([ 1.0560325e-07, -2.1087180e-07, -4.6398418e-07, ...,
       -1.8968001e-07,  7.0659723e-08,  0.0000000e+00], dtype=float32), decode=16000, stream_index=None)

  Natural caption..... A man is walking down the street with thunder in his hands.
  Short caption....... Rain falling and th falling
  Keywords............ natural, fire, natural, natural, natural sounds,, thunderstorm, thunder

Chimes:


Audio(sampling_rate=array([ 8.6921264e-06,  8.5085358e-06, -2.1191550e-05, ...,
       -2.4080617e-04, -4.5509581e-04,  2.5719107e-04], dtype=float32), decode=16000, stream_index=None)

  Natural caption..... A person is tapping a glass glass in a room.
  Short caption....... A series of musical tones playing in a musical instrument
  Keywords............ onomatopoeia, jingle, alarm,


The three styles now produce different outputs from the same audio:
- **Natural caption** (Clotho style): full sentences, but prone to hallucination ("a man walking down the street with thunder in his hands")
- **Short caption** (AudioCaps style): more concise, sometimes truncated
- **Keywords** (AudioSet style): tag-like output, often the most accurate for identification

This is the same encoder-decoder architecture as Whisper for ASR. The only difference is what it was trained to output: descriptions of sounds instead of transcriptions of speech.

This is the **tiny** model (39M parameters). The larger variants (small, large-v2) produce significantly better captions, but don't fit in Colab free tier memory. Even so, the tiny model correctly identifies the dominant sounds — thunderstorm/rain and bells/jingle — it just struggles to describe them eloquently.

### 2b. Caption some ESC-50 examples

Let's see how well the captioning model describes sounds from ESC-50, where we know the true label.

In [29]:
test_categories = ["rain", "thunderstorm", "sea_waves", "church_bells", "dog"]

for cat in test_categories:
    example = dataset.filter(lambda x: x["category"] == cat)[0]
    audio_array = np.array(example["audio"]["array"], dtype=np.float32)

    caption = caption_audio(audio_array, 16000)
    print(f"  True: {cat:<20s} Caption: {caption}")

  True: rain                 Caption: A large volume of water splashes as it flows
  True: thunderstorm         Caption: Thunder rumbling in the distance
  True: sea_waves            Caption: A large volume of water splashes as it flows
  True: church_bells         Caption: A large bell rings loudly
  True: dog                  Caption: A person makes a small burst and a horn blows


### Comparing ASR vs Audio Captioning

| | ASR (Speech-to-Text) | Audio Captioning |
|---|---|---|
| **Input** | Speech audio | Any audio |
| **Output** | Exact transcription of words spoken | Free-text description of sounds |
| **Architecture** | Encoder-decoder (Whisper) | Same encoder-decoder (fine-tuned Whisper) |
| **Training data** | Speech + transcription pairs | Audio + description pairs |
| **Use case** | Subtitles, dictation, voice assistants | Accessibility, search, metadata generation |

Same architecture, different training data, completely different task. That's the power of transfer learning.

---

## Part 3: Audio Generation with MusicGen

The HF course shows two generation tasks:
- **Text-to-Speech** with Bark (text to spoken voice)
- **Music generation** with MusicGen (text to music)

Since we're focused on soundscapes, we'll use **MusicGen** to generate environmental audio from text descriptions. Let's see if we can generate something similar to our thunder/rain recording just from a text prompt.

In [30]:
music_pipe = pipeline(
    "text-to-audio",
    model="facebook/musicgen-small",
    device="cuda" if torch.cuda.is_available() else "cpu",
)
print("MusicGen loaded.")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/611 [00:00<?, ?it/s]

MusicgenForConditionalGeneration LOAD REPORT from: facebook/musicgen-small
Key                                           | Status     |  | 
----------------------------------------------+------------+--+-
decoder.model.decoder.embed_positions.weights | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/275 [00:00<?, ?B/s]

MusicGen loaded.


In [44]:
# Generate soundscapes from text descriptions
# max_new_tokens controls the length (~256 tokens = ~5 seconds)

prompts = [
    "Soft rain falling in a tropical forest with distant thunder and crickets",
    "Gentle wind chimes ringing with different tones in a peaceful garden",
    "Ocean waves crashing on a rocky shore with seagulls",
]

from IPython.display import Audio

for prompt in prompts:
    print(f"\nPrompt: {prompt}")
    output = music_pipe(prompt, forward_params={"max_new_tokens": 256})
    display(Audio(output["audio"], rate=32000))


Prompt: Soft rain falling in a tropical forest with distant thunder and crickets



Prompt: Gentle wind chimes ringing with different tones in a peaceful garden



Prompt: Ocean waves crashing on a rocky shore with seagulls


MusicGen is primarily a music model, so it may interpret environmental prompts as "music inspired by" rather than literal soundscapes. The results are still interesting and show how text-to-audio generation works.

### Compare: our real recording vs generated audio

In [46]:
print("Real thunder/rain recording (first 10s):")
display(Audio(thunder[:SR * 10], rate=32000))

print("\nGenerated from prompt 'Soft rain falling in a tropical forest with distant thunder':")
output = music_pipe(
    "Soft rain falling in a tropical forest with distant thunder",
    forward_params={"max_new_tokens": 512}
)
display(Audio(output["audio"], rate=32000))

Real thunder/rain recording (first 10s):



Generated from prompt 'Soft rain falling in a tropical forest with distant thunder':


### The full loop: Captioning + Generation

Here's something fun: we can create a pipeline that listens to a sound, describes it, then generates a new sound from that description. Audio -> Text -> Audio.

In [47]:
# Take our thunder recording
clip = thunder[:SR * 30]
print("Original audio:")
display(Audio(clip, rate=32000))

# Step 1: Caption it
caption = caption_audio(clip, SR, style="audiocaps > caption: ")
print(f"\nGenerated caption: {caption}")

# Step 2: Use the caption to generate new audio
print(f"\nRegenerating from caption...")
output = music_pipe(caption, forward_params={"max_new_tokens": 512})
print("Generated audio:")
display(Audio(output["audio"], rate=32000))

Original audio:



Generated caption: Rain falling and th falling

Regenerating from caption...
Generated audio:


This is a fun demonstration but also shows the limitations: information is lost at each step. The caption doesn't capture everything in the original audio, and the generation model interprets the caption in its own way. Still, it's remarkable that this works at all with pre-trained models and zero custom training.

---

## Recap

In this chapter we used three `pipeline()` tasks (or equivalent) with **zero training**:

| Task | Model | What it does |
|---|---|---|
| Audio Classification | `MIT/ast-finetuned-audioset` | Audio -> label (e.g. "rain", "chime") |
| Audio Captioning | `MU-NLPC/whisper-tiny-audio-captioning` | Audio -> free text description |
| Audio Generation | `facebook/musicgen-small` | Text -> audio |

**Key takeaways:**
- `pipeline()` abstracts away all preprocessing. You give it raw audio, it gives you results.
- Not all models have a `pipeline()` wrapper (like the captioning model). When they don't, you load the model, feature extractor, and tokenizer separately.
- Pre-trained models get you surprisingly far, but they have blind spots. Fine-tuning on your specific data (Chapter 4) fixes this.
- You can chain tasks together (caption -> generate) for creative applications.


📖 **Based on**: [HF Audio Course — Unit 2](https://huggingface.co/learn/audio-course/en/chapter2/introduction)