# Fine-tuning Wav2Vec for Audio Classification
Wav2Vec is a powerful pre-trained model designed for self-supervised learning of audio representations. We can fine-tune it for our audio classification task, this time working with *huggin face* transformers.


## Imports

In [5]:
import pandas as pd
import numpy as np
from datasets import load_dataset, Audio
from transformers import TrainingArguments
from transformers import AutoFeatureExtractor
from transformers import AutoModelForAudioClassification
from transformers import Trainer
import evaluate




## Dataset loading

In [6]:
dataset = load_dataset("audiofolder", data_dir="./split_songs_v2/")

Resolving data files:   0%|          | 0/3918 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/1179 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/3918 [00:00<?, ?files/s]

Downloading data:   0%|          | 0/1179 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 3918
    })
    test: Dataset({
        features: ['audio', 'label'],
        num_rows: 1179
    })
})

Let's have a close look into one element.

In [8]:
dataset["train"][0]

{'audio': {'path': '/content/drive/.shortcut-targets-by-id/1RsUKtn8GSoGO50v9MBuv2WF86mc7JWQR/split_songs_v2/train/drumbass/preview_10_part1.mp3',
  'array': array([-6.84566709e-04, -7.71552703e-04,  2.18750007e-04, ...,
          4.01656196e-01,  4.46715504e-01,  3.16405728e-01]),
  'sampling_rate': 44100},
 'label': 0}

Let's print out the labels.

In [9]:
dataset["train"].features["label"]

ClassLabel(names=['drumbass', 'dubtechno', 'dupstep', 'hardcore_breaks', 'house', 'psytrance', 'techno', 'ukgarage'], id=None)

To extract audio features effectively and efficiently, we can use the power of transformer models. Transformers have demonstrated exceptional performance in various natural language processing tasks and can be adapted for audio processing.

We select our wav2vec base model and we can rely on the library to extract the necessary audio features automatically, saving us from manual transformations.


In [11]:
model_id = "facebook/wav2vec2-base"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)
sampling_rate = feature_extractor.sampling_rate
sampling_rate


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



16000

We create the two datasets.

In [12]:
dataset["train"] = dataset["train"].cast_column("audio", Audio(sampling_rate=sampling_rate))
dataset["test"] = dataset["test"].cast_column("audio", Audio(sampling_rate=sampling_rate))

Let's print out 5 examples.

In [13]:
import random
for _ in range(5):
    rand_idx = random.randint(0, len(dataset["train"])-1)
    example = dataset["train"][rand_idx]
    audio = example["audio"]

    print(f'Label: {dataset["train"].features["label"].int2str([example["label"]])}')
    print(f'Shape: {audio["array"].shape}, sampling rate: {audio["sampling_rate"]}')
    print()

Label: ['house']
Shape: (48000,), sampling rate: 16000

Label: ['drumbass']
Shape: (48000,), sampling rate: 16000

Label: ['psytrance']
Shape: (48000,), sampling rate: 16000

Label: ['dubtechno']
Shape: (48000,), sampling rate: 16000

Label: ['house']
Shape: (48000,), sampling rate: 16000



## Data Preprocessing

Let's preprocess the clips.

In [14]:
max_duration = 3.0
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs


In [15]:
encoded_audio = dataset.map(preprocess_function, remove_columns="audio", batched=True)

Map:   0%|          | 0/3918 [00:00<?, ? examples/s]

Map:   0%|          | 0/1179 [00:00<?, ? examples/s]

In [40]:
num_labels = dataset["train"].features["label"].num_classes
num_labels

8

We can login interactivily to upload the model to the hub.

In [17]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [18]:
model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels
)



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training Arguments Definition

Let's define all the training arguments.

In [41]:
model_name = model_id.split("/")[-1]+ "-music_genre_classifier"
batch_size = 12
gradient_accumulation_steps = 1
num_train_epochs = 10

training_args = TrainingArguments(
    model_name,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    report_to='tensorboard',
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
    fp16=True
)





## Evaluation Function Definition

Let's define the metrics to evaluate our model.

In [42]:
metric = evaluate.load("accuracy")


def compute_metrics(p):



    accuracy_metric = evaluate.load("accuracy")
    accuracy = accuracy_metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)['accuracy']

    ### ------------------- F1 scores -------------------

    f1_score_metric = evaluate.load("f1")
    f1_score = f1_score_metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids, average = "macro")["f1"]

    ### ------------------- recall -------------------

    recall_metric = evaluate.load("recall")
    recall = recall_metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids, average = "macro")["recall"]

    ### ------------------- precision -------------------

    precision_metric = evaluate.load("precision")
    precision = precision_metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids, average = "macro")["precision"]

    return {"accuracy" : accuracy,
            "F1" : f1_score,
            "Recall" : recall,
            "Precision" : precision,
            }


## Model Training
Now we can train our model.

In [None]:

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = encoded_audio["train"],
    eval_dataset = encoded_audio["test"],
    tokenizer = feature_extractor,
    compute_metrics = compute_metrics,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss,Accuracy,F1,Recall,Precision
1,0.1706,2.080968,0.726039,0.719655,0.719744,0.745987
2,0.0923,1.848292,0.731128,0.732166,0.737397,0.762104
3,0.1388,1.740142,0.749788,0.749553,0.750061,0.767742
4,0.0573,1.592242,0.793893,0.792314,0.792737,0.802891
5,0.0009,1.65483,0.788804,0.788668,0.791123,0.795365
6,0.0007,1.70275,0.783715,0.781873,0.787052,0.791813


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss,Accuracy,F1,Recall,Precision
1,0.1706,2.080968,0.726039,0.719655,0.719744,0.745987
2,0.0923,1.848292,0.731128,0.732166,0.737397,0.762104
3,0.1388,1.740142,0.749788,0.749553,0.750061,0.767742
4,0.0573,1.592242,0.793893,0.792314,0.792737,0.802891
5,0.0009,1.65483,0.788804,0.788668,0.791123,0.795365
6,0.0007,1.70275,0.783715,0.781873,0.787052,0.791813
7,0.0006,1.529715,0.807464,0.806755,0.808768,0.81113
8,0.0004,1.535364,0.810857,0.80883,0.807088,0.820173


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
