# Audio Classification

# Libraries

Balancing, torch, torchaudio, and transformers can be tricky! Here are the versions used for this notebook:

## Library and Versions

In [9]:
import torch, transformers, torchaudio
print("These are the versions used for this notebook, but watch the lecture for an important note on this")
print(torch.__version__)
print(torchaudio.__version__)
print(transformers.__version__)


These are the versions used for this notebook, but watch the lecture for an important note on this
2.2.1+cpu
2.2.1+cpu
4.26.1


In [15]:
from transformers import AutoFeatureExtractor, ASTForAudioClassification

In [16]:
feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

In [19]:
import librosa
audio_path = 'example.mp3'
y, sr = librosa.load(audio_path, sr=None)

## Sampling Rate Issues

Recall that most ML models are trained on 16 kHz sampling rate, you will run into issues if you try to force your own sampling rate:

In [22]:
# ERROR!
# result = feature_extractor(y,sampling_rate=sr)

In [30]:
result = feature_extractor(y,return_tensors="pt")

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [31]:
result

{'input_values': tensor([[[ 0.0770, -0.2676,  0.1092,  ..., -1.2776, -1.2776, -1.2776],
         [ 0.0846, -0.2771,  0.0997,  ..., -1.2776, -1.2776, -1.2776],
         [-0.2939, -0.3674,  0.0095,  ..., -1.2776, -1.2776, -1.2776],
         ...,
         [ 0.2184, -0.0845,  0.2923,  ..., -1.2776, -1.2776, -1.2776],
         [ 0.1963, -0.1293,  0.2475,  ..., -1.2776, -1.2776, -1.2776],
         [-0.0509, -0.4521, -0.0752,  ..., -1.2776, -1.2776, -1.2776]]])}

In [32]:
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

In [34]:
prediction_logits = model(result['input_values']).logits

In [36]:
# prediction_logits

In [38]:
predicted_class_ids = torch.argmax(prediction_logits, dim=-1).item()

In [39]:
predicted_label = model.config.id2label[predicted_class_ids]

In [40]:
predicted_label

'Music'

In [42]:
# model.config.id2label

## Pipeline for Audio Classification

In [3]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593")

In [12]:
pipe.model

ASTForAudioClassification(
  (audio_spectrogram_transformer): ASTModel(
    (embeddings): ASTEmbeddings(
      (patch_embeddings): ASTPatchEmbeddings(
        (projection): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ASTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ASTLayer(
          (attention): ASTAttention(
            (attention): ASTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ASTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ASTIntermediate(
            (de

In [5]:
pipe('example.mp3')

  waveform = torch.from_numpy(waveform).unsqueeze(0)


[{'score': 0.48486804962158203, 'label': 'Music'},
 {'score': 0.1913108080625534, 'label': 'Violin, fiddle'},
 {'score': 0.08519719541072845, 'label': 'Musical instrument'},
 {'score': 0.046924274414777756, 'label': 'Bowed string instrument'},
 {'score': 0.045361001044511795, 'label': 'Orchestra'}]

In [8]:
len(pipe.model.config.id2label)

527