<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/AST/Inference_with_the_Audio_Spectogram_Transformer_to_classify_audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

First we install 🤗 Transformers.

## Load audio

Let's load some audio on which we'd like to test the model.

In [None]:
from huggingface_hub import hf_hub_download
import IPython

filepath = hf_hub_download(repo_id="nielsr/audio-spectogram-transformer-checkpoint",
                           filename="sample_audio.flac",
                           repo_type="dataset")

IPython.display.Audio(filepath)

## Prepare audio for the model (using feature extractor)

We can prepare the audio using ASTFeatureExtractor, which turns it into a tensor of shape (batch_size, time_dimension, frequency_dimension). This is also known as a spectrogram.

In [1]:
from transformers import ASTFeatureExtractor

feature_extractor = ASTFeatureExtractor()

In [2]:
import torch, torchaudio


In [33]:

# waveform, sampling_rate = torchaudio.load(filepath)
# waveform = waveform.squeeze().numpy()

# waveform.shape

In [None]:

# Example timeseries data (1 second of a 440 Hz sine wave at 16 kHz sampling rate)
sampling_rate = 16000

# function that takes time and returns a sin wav with frequency that increases from 100 hz to 800 hz over 10 seconds
def chirp(t):
    return torch.sin(2 * torch.pi * (100 + 700 * t / duration) * t)



frequency = 440
duration = 10  # in seconds
t = torch.linspace(0, duration, int(sampling_rate * duration))
waveform = chirp(t)

# Ensure waveform is in the correct shape (channels, samples)
waveform = waveform.unsqueeze(0)  # Add a channel dimension

# Save the waveform as a WAV file
torchaudio.save("output.wav", waveform, sampling_rate)

waveform = waveform.squeeze().numpy()

IPython.display.Audio("output.wav")

In [None]:
print(waveform.shape)
inputs = feature_extractor(waveform, sampling_rate=sampling_rate, padding="max_length", return_tensors="pt")
input_values = inputs.input_values
print(input_values.shape)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Assuming input_values is a PyTorch tensor
input_values_np = input_values.squeeze().numpy()  # Convert to NumPy array and remove batch dimension if present

plt.figure(figsize=(10, 4))
plt.imshow(input_values_np, aspect='auto', origin='lower')
plt.colorbar()
plt.title('Input Values Visualization')
plt.xlabel('Time')
plt.ylabel('Features')
plt.show()

## Load model

Next we load one of the models that the AST authors released from the [hub](https://huggingface.co/models?other=audio-spectrogram-transformer).

This one was fine-tuned on AudioSet, an important benchmark for audio classification.

In [9]:
from transformers import AutoModelForAudioClassification

model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

## Forward pass

Next let's forward the audio through the model! We perform an argmax on the model's logits to get the predicted class index. We use model.config.id2label to turn that back into text.

In [10]:
import torch

with torch.no_grad():
  outputs = model(input_values)

In [None]:
predicted_class_idx = outputs.logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])