Audio classification - just like with text - assigns a class label output from the input data. The only difference is instead of text inputs, you have raw audio waveforms. Some practical applications of audio classification include identifying speaker intent, language classification, and even animal species by their sounds.

This guide shows how to:
1. Finetune Wav2Vec2 on the MInDS-14 dataset to classify speaker intent.
2. Use finetuned model for inference.

# Libraries

In [1]:
pip install transformers datasets evaluate

Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset, Audio

In [3]:
from transformers import AutoFeatureExtractor

# Load Data

In [4]:
# Load the MInDS-14 dataset
minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

# Split the dataset’s train split into a smaller train and test set
# Chance to experiment and make sure everything works before spending more time on the full dataset
minds = minds.train_test_split(test_size=0.2)

# Inspect the data
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

In [5]:
# Dataset contains a lot of useful information, like lang_id and english_transcription
# We’ll focus on the audio and intent_class in this guide. Remove the other columns.
minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

# Two fields in the dataset:
# audio: a 1-dimensional array of the speech signal that must be called to load and resample the audio file.
# intent_class: represents the class id of the speaker’s intent.
minds["train"][0]

{'audio': {'path': '/Users/nm/.cache/huggingface/datasets/downloads/extracted/4944423aa9d3073a2117f3bc2f3c9cde7140f2cd3d395a3e99ab62075ac3d5eb/en-US~APP_ERROR/602bad435f67b421554f6491.wav',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 8000},
 'intent_class': 2}

In [6]:
# Map label name to label id
labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
    
# Now the label id can be converted to a label name
id2label[str(2)]

'app_error'

# Preprocessing

In [7]:
# load a Wav2Vec2 feature extractor to process the audio signal
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



In [8]:
# The MInDS-14 dataset has a sampling rate of 8000khz (you can find this information in it’s dataset card), 
# Need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'audio': {'path': '/Users/nm/.cache/huggingface/datasets/downloads/extracted/4944423aa9d3073a2117f3bc2f3c9cde7140f2cd3d395a3e99ab62075ac3d5eb/en-US~APP_ERROR/602bad435f67b421554f6491.wav',
  'array': array([-4.29120264e-05, -4.39114665e-05,  4.62070457e-05, ...,
         -4.74929839e-05, -3.16616060e-05,  3.27393427e-05]),
  'sampling_rate': 16000},
 'intent_class': 2}

In [None]:
# create a preprocessing function that:
# Calls the audio column to load, and if necessary, resample the audio file
# Checks if the sampling rate of audio file = sampling rate of the audio data a model was pretrained with
# Set a maximum input length to batch longer inputs without truncating them
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

In [None]:
encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds = encoded_minds.rename_column("intent_class", "label")