In [None]:
!pip install -qU transformers datasets evaluate

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Audio Classification

**Audio classification** assigns a class label as output from the input data. The only difference is instead of text inputs, we have raw audio waveforms.

Applications of audio classification includes
* speaker intent identification
* language classification
* animal species identification by their sounds.

We will use **MINDS-14** dataset to fine-tune **Wav2Vec2** to classify speaker intent, and then use it for inference.

## Load MINDS-14 dataset

In [2]:
from datasets import load_dataset, Audio

minds = load_dataset(
    'PolyAI/minds14',
    name='en-US',
    split='train'
)

README.md:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

minds14.py:   0%|          | 0.00/5.83k [00:00<?, ?B/s]

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


MInDS-14.zip:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Split the training dataset into a smaller train and test set with the `train_test_split` method:

In [3]:
minds = minds.train_test_split(test_size=0.2)

In [4]:
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

Inside the datasets, we only focus on the `audio` and `intent_class`, so we need to remove other columns:

In [5]:
minds = minds.remove_columns(['path', 'transcription', 'english_transcription', 'lang_id'])

In [6]:
# check an example
minds['train'][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/f9018fd3747971e77d59e6c5da3fdf9d5bb914c495e16c23e1fe47c921d76a7a/en-US~CARD_ISSUES/602ba965bb1e6d0fbce92124.wav',
  'array': array([ 0.        ,  0.        ,  0.00024414, ...,  0.01989746,
          0.0045166 , -0.00891113]),
  'sampling_rate': 8000},
 'intent_class': 6}

* `audio` is a 1D array of the speech signal that must be called to load and resample the audio fille
* `intent_class` represents the class id of the speaker's intent

To understand the label id better, we need to create a dictionary that maps the label name to an integer and vice versa:

In [7]:
labels = minds['train'].features['intent_class'].names
label2id, id2label = dict(), dict()

for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

In [8]:
# check the label class
id2label[str(6)]

'card_issues'

## Preprocess

We will load a Wav2Vec2 feature extractor to process the audio signal:

In [9]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/wav2vec2-base')

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



In the dataset card, the MINDS-14 dataset states that the sampling rate is 8kHz, so we need to resample the dataset to 16kHz to use the pretrained Wav2Vec2 model:

In [10]:
minds = minds.cast_column(
    'audio',
    Audio(sampling_rate=16_000)
)
# check the same example
minds['train'][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/f9018fd3747971e77d59e6c5da3fdf9d5bb914c495e16c23e1fe47c921d76a7a/en-US~CARD_ISSUES/602ba965bb1e6d0fbce92124.wav',
  'array': array([-2.66974985e-05, -4.93595217e-05,  2.90675962e-05, ...,
         -6.28457079e-03, -8.84941593e-03, -4.70361672e-03]),
  'sampling_rate': 16000},
 'intent_class': 6}

We will create a preprocessing function that
1. calls the `audio` column to load, and if necessary, resample the audio file,
2. checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with.
3. sets a maximum input length to batch longer inputs without truncating them.

In [12]:
def preprocess_function(examples):
    audio_arrays = [x['array'] for x in examples['audio']]

    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=16000,
        truncation=True
    )

    return inputs

We can apply the preprocessing function over the entire dataset by using the `map` function.

In [13]:
encoded_minds = minds.map(
    preprocess_function,
    remove_columns='audio',
    batched=True
)

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map:   0%|          | 0/113 [00:00<?, ? examples/s]

We need to rename `intent_class` to `label` because this is required by the model:

In [14]:
encoded_minds = encoded_minds.rename_column('intent_class', 'label')

## Evaluate

For this task, we will load the accuracy metric

In [15]:
import evaluate

accuracy = evaluate.load('accuracy')

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Then we will create a function that passes our predictions and labels to `compute` method to calculate the accuracy:

In [16]:
import numpy as np

def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)

    return accuracy.compute(
        predictions=predictions,
        references=eval_pred.label_ids
    )

## Train

Load Wav2Vec2 with `AutoModelForAudioClassification` along with the number of expected labels, and the label mappings:

In [17]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    'facebook/wav2vec2-base',
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label
)



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we need to define our training hyperparameters.

In [18]:
training_args = TrainingArguments(
    output_dir='my_awesome_mind_model',
    eval_strategy='epoch',
    save_strategy='epoch',
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    push_to_hub=False,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds['train'],
    eval_dataset=encoded_minds['test'],
    processing_class=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

## Inference

For inference, remember to resample the sampling rate of the audio file to match the model's sampling rate.

In [None]:
from datasets import load_dataset, Audio

dataset = load_dataset(
    'PolyAI/minds14',
    name='en-US',
    split='train'
)
dataset = dataset.cast_column('audio', Audio(sampling_rate=16_000))

sampling_rate = dataset.features['audio'].sampling_rate
audio_fille = dataset[0]['audio']['path']

We can try out our fine-tuned model for inference using `pipeline()`.

In [None]:
from transformers import pipeline

classifier = pipelline(
    'audio-classification',
    model='stevhliu/my_awesome_mind_model'
)

In [None]:
classifier(audio_file)

We can also manually replicate the `pipeline`:

In [None]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    'stevhlliu/my_awesome_mind_model'
)

In [None]:
inputs = feature_extractor(
    dataset[0]['audio']['array'],
    sampling_rate=sampling_rate,
    return_tensors='pt'
)

In [None]:
from transformers import AutoModelForAudioClassification

model = AutoModelForAudioClassification.from_pretrained(
    'stevhlliu/my_awesome_mind_model'
)

In [None]:
import torch

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits).items()
predicted_label = model.config.id2label[predicted_class_ids]
predicted_label