Audio classification - just like with text - assigns a class label output from the input data. The only difference is instead of text inputs, you have raw audio waveforms. Some practical applications of audio classification include identifying speaker intent, language classification, and even animal species by their sounds.

This guide shows how to:
1. Finetune Wav2Vec2 on the MInDS-14 dataset to classify speaker intent.
2. Use finetuned model for inference.

# Libraries

In [1]:
pip install transformers datasets evaluate

Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset, Audio

# Load Data

In [3]:
# Load the MInDS-14 dataset
minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

# Split the dataset’s train split into a smaller train and test set
# Chance to experiment and make sure everything works before spending more time on the full dataset
minds = minds.train_test_split(test_size=0.2)

# Inspect the data
minds

Downloading builder script:   0%|          | 0.00/5.90k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

In [4]:
# Dataset contains a lot of useful information, like lang_id and english_transcription
# We’ll focus on the audio and intent_class in this guide. Remove the other columns.
minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

# Two fields in the dataset:
# audio: a 1-dimensional array of the speech signal that must be called to load and resample the audio file.
# intent_class: represents the class id of the speaker’s intent.
minds["train"][0]

{'audio': {'path': '/Users/nm/.cache/huggingface/datasets/downloads/extracted/4944423aa9d3073a2117f3bc2f3c9cde7140f2cd3d395a3e99ab62075ac3d5eb/en-US~ADDRESS/602ba181bb1e6d0fbce91fe8.wav',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 8000},
 'intent_class': 1}

In [5]:
# Map label name to label id
labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
    
# Now the label id can be converted to a label name
id2label[str(2)]

'app_error'