# Audio Classification

# Libraries

Balancing, torch, torchaudio, and transformers can be tricky! Here are the versions used for this notebook:

## Library and Versions

In [2]:
import torch, transformers, torchaudio
# Sometimes we need a specific combination of library versions...
print(torch.__version__) # 2.3.0+cu121
print(torchaudio.__version__) # 2.3.0+cu121
print(transformers.__version__) # 4.41.2


2.3.0+cu121
2.3.0+cu121
4.41.2


In [3]:
# We perform the classification in 2 separate steps to show the capabilities/flexibility
# for other use cases
from transformers import AutoFeatureExtractor, ASTForAudioClassification

In [4]:
# Audio Spectrogram Transformer (AST) model fine-tuned on AudioSet
# The Audio Spectrogram Transformer is equivalent to ViT, but applied on audio.
# Audio is first turned into an image (as a spectrogram),
# after which a Vision Transformer is applied.
# The model gets state-of-the-art results on several audio classification benchmarks.
# https://arxiv.org/abs/2104.01778
# https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [17]:
import librosa
audio_path = 'example.mp3'
y, sr = librosa.load(audio_path, sr=None)
print(sr) # 48000
# Recall that the sampling rate at which the model was trained is important
# Most models are trained for 16 kHz sr
# If our audio files are not in that frequency, they need to be converted
# but HF pipelines do that automatically.

48000


In [20]:
len(y) # 630031

630031

In [22]:
# We'll downsample from 48 kHz - (/3) -> 16 kHz
len(y)/3 # 210010

210010.33333333334

## Sampling Rate Issues

Recall that most ML models are trained on 16 kHz sampling rate, you will run into issues if you try to force your own sampling rate:

In [None]:
# ERROR!
# result = feature_extractor(y,sampling_rate=sr)

In [12]:
# Get feature vector
# y is in 48 kHz, but it will be downsampled to 16 kHz automatically
# If we don't set return_tensors="pt", we get a dict with a numpy array
# If we want to play around with the feature vectors, we can use the numpy arrays,
# but if we want to perform the classification, we need the Pytorch tensors: "pt"
# If we don't pass sampling_rate, the input sampling rate will be adjusted
# to the model sr.
# Important arguments:
# - num_mel_bins (int, *optional*, defaults to 128): Number of Mel-frequency bins.
# - max_length (int, *optional*, defaults to 1024): Max length to which to pad/truncate the extracted features.
result = feature_extractor(y,return_tensors="pt")

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [15]:
result
# {'input_values': tensor([[[ 0.0770, ..., -1.2776]]])}

{'input_values': tensor([[[ 0.0770, -0.2676,  0.1092,  ..., -1.2776, -1.2776, -1.2776],
         [ 0.0846, -0.2771,  0.0997,  ..., -1.2776, -1.2776, -1.2776],
         [-0.2939, -0.3674,  0.0095,  ..., -1.2776, -1.2776, -1.2776],
         ...,
         [ 0.2184, -0.0845,  0.2923,  ..., -1.2776, -1.2776, -1.2776],
         [ 0.1963, -0.1293,  0.2475,  ..., -1.2776, -1.2776, -1.2776],
         [-0.0509, -0.4521, -0.0752,  ..., -1.2776, -1.2776, -1.2776]]])}

In [16]:
# 210010 steps will be truncated here to 1024 if default args are used
result['input_values'].shape # torch.Size([1, 1024, 128]): num_mel_bins, max_length

torch.Size([1, 1024, 128])

In [23]:
# Load model
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

config.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

In [37]:
# Check class labels: dictionary with ids & class names
# 0: 'Speech'
# 1: 'Male speech, man speaking',
# 2: 'Female speech, woman speaking',
# ...
model.config.id2label

{0: 'Speech',
 1: 'Male speech, man speaking',
 2: 'Female speech, woman speaking',
 3: 'Child speech, kid speaking',
 4: 'Conversation',
 5: 'Narration, monologue',
 6: 'Babbling',
 7: 'Speech synthesizer',
 8: 'Shout',
 9: 'Bellow',
 10: 'Whoop',
 11: 'Yell',
 12: 'Battle cry',
 13: 'Children shouting',
 14: 'Screaming',
 15: 'Whispering',
 16: 'Laughter',
 17: 'Baby laughter',
 18: 'Giggle',
 19: 'Snicker',
 20: 'Belly laugh',
 21: 'Chuckle, chortle',
 22: 'Crying, sobbing',
 23: 'Baby cry, infant cry',
 24: 'Whimper',
 25: 'Wail, moan',
 26: 'Sigh',
 27: 'Singing',
 28: 'Choir',
 29: 'Yodeling',
 30: 'Chant',
 31: 'Mantra',
 32: 'Male singing',
 33: 'Female singing',
 34: 'Child singing',
 35: 'Synthetic singing',
 36: 'Rapping',
 37: 'Humming',
 38: 'Groan',
 39: 'Grunt',
 40: 'Whistling',
 41: 'Breathing',
 42: 'Wheeze',
 43: 'Snoring',
 44: 'Gasp',
 45: 'Pant',
 46: 'Snort',
 47: 'Cough',
 48: 'Throat clearing',
 49: 'Sneeze',
 50: 'Sniff',
 51: 'Run',
 52: 'Shuffle',
 53: 'Walk

In [25]:
# Forward pass
prediction_logits = model(result['input_values']).logits

In [29]:
# One raw logit output per each class
prediction_logits.shape # torch.Size([1, 527])

torch.Size([1, 527])

In [30]:
# Select max value -> predicted class
predicted_class_ids = torch.argmax(prediction_logits, dim=-1).item()

In [31]:
predicted_label = model.config.id2label[predicted_class_ids]

In [32]:
predicted_label

'Music'

## Pipeline for Audio Classification

In [38]:
# Instead of performing manually the feature extraction + inference,
# we can use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593")

In [39]:
# We can check that it's the same model as before
pipe.model

ASTForAudioClassification(
  (audio_spectrogram_transformer): ASTModel(
    (embeddings): ASTEmbeddings(
      (patch_embeddings): ASTPatchEmbeddings(
        (projection): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ASTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ASTLayer(
          (attention): ASTSdpaAttention(
            (attention): ASTSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ASTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ASTIntermediate(
       

In [40]:
# We run the pipeline and we obtain the top n classes
pipe('example.mp3')

  waveform = torch.from_numpy(waveform).unsqueeze(0)


[{'score': 0.48486781120300293, 'label': 'Music'},
 {'score': 0.19131098687648773, 'label': 'Violin, fiddle'},
 {'score': 0.08519725501537323, 'label': 'Musical instrument'},
 {'score': 0.04692428186535835, 'label': 'Bowed string instrument'},
 {'score': 0.045360978692770004, 'label': 'Orchestra'}]

In [None]:
len(pipe.model.config.id2label)

527

## AudioSet

In classification tasks, often the [AudioSet](https://research.google.com/audioset/) dataset and class-list are from Google/YouTube used. AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Sometimes a subset of 527 most common classes is taken.
The original ontology has high level groups which are then refined:

- **Human sounds**
  - Speech
  - Laughter
  - Crying, sobbing
  - Shout
  - Whispering

- **Animal sounds**
  - Dog
  - Cat
  - Bird
  - Rooster
  - Insect

- **Natural sounds**
  - Wind
  - Rain
  - Thunder
  - Fire
  - Waterfall

- **Musical instruments**
  - Piano
  - Guitar
  - Violin
  - Drums
  - Flute

- **Environmental sounds**
  - Car
  - Motorcycle
  - Train
  - Aircraft
  - Siren

- **Machine sounds**
  - Engine
  - Tools
  - Machinery

- **Music genres and instruments**
  - Rock music
  - Pop music
  - Jazz
  - Classical music



In [45]:
import json
import requests
import pandas as pd

# URL to the ontology file
ontology_url = "https://raw.githubusercontent.com/audioset/ontology/master/ontology.json"

# Load the ontology
response = requests.get(ontology_url)
ontology = json.loads(response.text)

# Count the number of classes
num_classes = len(ontology)
print(f"Number of classes in AudioSet: {num_classes}") # 632

# URL to the SUBSET AudioSet class labels (527 classes)
subset_labels_url = "http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv"

# Load the class labels
subset_labels_df = pd.read_csv(subset_labels_url)
num_classes = subset_labels_df.shape[0]
print(f"Number of classes in the subset: {num_classes}") # 527

# Display some top-level classes and their sub-classes
num = 5
for category in ontology[:num]:  # Display the first num=5 top-level classes
    print(f"Class: {category['name']}")
    if 'child_ids' in category:
        for child_id in category['child_ids'][:num]:  # Display the first 5 sub-classes
            child = next((c for c in ontology if c['id'] == child_id), None)
            if child:
                print(f"  Sub-class: {child['name']}")
    print()



Number of classes in AudioSet: 632
Number of classes in the subset: 527
Class: Human sounds
  Sub-class: Human voice
  Sub-class: Whistling
  Sub-class: Respiratory sounds
  Sub-class: Human locomotion
  Sub-class: Digestive

Class: Human voice
  Sub-class: Speech
  Sub-class: Shout
  Sub-class: Screaming
  Sub-class: Whispering
  Sub-class: Laughter

Class: Speech
  Sub-class: Male speech, man speaking
  Sub-class: Female speech, woman speaking
  Sub-class: Child speech, kid speaking
  Sub-class: Conversation
  Sub-class: Narration, monologue

Class: Male speech, man speaking

Class: Female speech, woman speaking

