# **Fine-tuning wav2vec for MspPodcast dataset**

In [None]:
model_checkpoint = "facebook/wav2vec2-base"
batch_size = 2

##### Installation

In [None]:
%%capture
!pip install datasets
!pip install transformers
!pip install evaluate
!pip install librosa
!pip install --upgrade huggingface_hub

In [None]:
!pip show datasets

Name: datasets
Version: 3.6.0
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /usr/local/lib/python3.11/dist-packages
Requires: dill, filelock, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyyaml, requests, tqdm, xxhash
Required-by: evaluate


Dataset is stored on my private Hugging Face account, so before using it we need to login using credentials. Dataset cannot be shared as I signed agreement with Interspeech 2025 organisators not to share it.

In [None]:
from huggingface_hub import login

login()

install Git-LFS to upload your model checkpoints:

In [None]:
%%capture
!apt install git-lfs

## Fine-tuning a model on an audio classification task

### Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the Accuracy metric we need to use for evaluation. This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
!pip install --upgrade datasets
!pip install --upgrade evaluate



In [None]:
from datasets import load_dataset


In [None]:
dataset = load_dataset("marbar16/podcastThesis", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.96k [00:00<?, ?B/s]

podcastThesis.py:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

train2.tar.gz:   0%|          | 0.00/758M [00:00<?, ?B/s]

test2.tar.gz:   0%|          | 0.00/152M [00:00<?, ?B/s]

train_metadata_v2.csv:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

test_metadata_v2.csv:   0%|          | 0.00/265k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

The `dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains attributes and labels for the training and test set.

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['filename', 'audio', 'path', 'category', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['filename', 'audio', 'path', 'category', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'],
        num_rows: 1000
    })
})

In [None]:
dataset = dataset.rename_column("category", "label")
dataset

DatasetDict({
    train: Dataset({
        features: ['filename', 'audio', 'path', 'label', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['filename', 'audio', 'path', 'label', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'],
        num_rows: 1000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["test"][10]

{'filename': 'MSP-PODCAST_2513_0131.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/fc1d185843ff54c04a961cae03e0904e9d2f7677d122fd932aef3b67587a97af/MSP-PODCAST_2513_0131.wav',
  'array': array([ 0.01077271,  0.00170898,  0.00320435, ..., -0.04727173,
         -0.04727173, -0.03009033]),
  'sampling_rate': 16000},
 'path': None,
 'label': 8,
 'arousal': 4.400000095367432,
 'valence': 2.799999952316284,
 'dominance': 4.400000095367432,
 'emotion_secondary': '["Angry", "Fear", "Disappointed", "Contempt", "Concerned", "Neutral"]',
 'transcript': "wasn't worth doing. and i think that this is an incredibly dangerous and irresponsible perspective for a field like computer science to have-",
 'speaker_id': '1643',
 'gender': 'Male'}

Let's explore classes of our dataset:

In [None]:
dataset["train"].features["label"]

ClassLabel(names=['Neutral', 'Angry', 'Sad', 'Happy', 'Suprise', 'Fear', 'Disgust', 'Contempt', 'Other', 'No agreement'], id=None)

Let's create an `id2label` dictionary to decode them back to strings and see what they are. The inverse `label2id` will be useful too, when we load the model later.

In [None]:
labels = dataset["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

id2label["9"]

'No agreement'

`Wav2Vec2` expects the input in the format of a 1-dimensional array of 16 kHz. This means that the audio file has to be loaded and resampled.

`datasets` does this automatically when calling the column `audio`.

In [None]:
dataset["test"][10]["audio"]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/fc1d185843ff54c04a961cae03e0904e9d2f7677d122fd932aef3b67587a97af/MSP-PODCAST_2513_0131.wav',
 'array': array([ 0.01077271,  0.00170898,  0.00320435, ..., -0.04727173,
        -0.04727173, -0.03009033]),
 'sampling_rate': 16000}

### Analisys of audio files

To get a sense of what the records sound like, the following code will render
some audio examples picked randomly from the dataset.
**Note**: Running it couple of times will give different audios, as we are chosing them randomly.

In [None]:
import random
from IPython.display import Audio, display

for _ in range(5):
    rand_idx = random.randint(0, len(dataset["train"])-1)
    example = dataset["train"][rand_idx]
    audio = example["audio"]

    print(f'Label: {id2label[str(example["label"])]}')
    print(f'Shape: {audio["array"].shape}, sampling rate: {audio["sampling_rate"]}')
    display(Audio(audio["array"], rate=audio["sampling_rate"]))
    print()

Label: Fear
Shape: (138432,), sampling rate: 16000



Label: Neutral
Shape: (100001,), sampling rate: 16000



Label: Happy
Shape: (140800,), sampling rate: 16000



Label: Happy
Shape: (104448,), sampling rate: 16000



Label: Fear
Shape: (102400,), sampling rate: 16000





### Preprocessing the data

Before we can feed those audio clips to our model, we need to preprocess them. This is done by a `FeatureExtractor` which will normalize the inputs and put them in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our feature extractor with the `AutoFeatureExtractor.from_pretrained` method, which will ensure that we get a preprocessor that corresponds to the model architecture we want to use.

In [None]:
from datasets import DatasetDict
import random
from collections import defaultdict

# Group indices by label
label2indices = defaultdict(list)
for i, example in enumerate(dataset['train']):
    label2indices[example['label']].append(i)

# Determine number of validation samples per class
val_size = 800
total_samples = len(dataset['train'])
val_indices = []

# Calculate per-class sample counts proportional to label distribution
for label, indices in label2indices.items():
    n_samples = max(1, round(val_size * len(indices) / total_samples))  # at least 1 sample per class
    val_indices.extend(random.sample(indices, min(n_samples, len(indices))))

# Ensure unique indices and avoid duplicates
val_indices = list(set(val_indices))
train_indices = list(set(range(total_samples)) - set(val_indices))

# Select subsets
train_dataset = dataset['train'].select(train_indices)
val_dataset = dataset['train'].select(val_indices)

# Create new DatasetDict
new_dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    "test": dataset["test"]
})
new_dataset_dict

DatasetDict({
    train: Dataset({
        features: ['filename', 'audio', 'path', 'label', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'],
        num_rows: 4201
    })
    validation: Dataset({
        features: ['filename', 'audio', 'path', 'label', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'],
        num_rows: 799
    })
    test: Dataset({
        features: ['filename', 'audio', 'path', 'label', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'],
        num_rows: 1000
    })
})

In [None]:
new_dataset_dict["validation"][0]

{'filename': 'MSP-PODCAST_2996_0013.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/2b8060e8269de9213b72aed2ecbe4fa3f3a7db421eb88b55bec3d8334312ba0f/MSP-PODCAST_2996_0013.wav',
  'array': array([-0.0032959 , -0.00292969, -0.00289917, ..., -0.04141235,
         -0.04586792, -0.04632568]),
  'sampling_rate': 16000},
 'path': None,
 'label': 2,
 'arousal': 2.0,
 'valence': 2.799999952316284,
 'dominance': 3.0,
 'emotion_secondary': '["Surprise", "Sad", "Confused", "Disappointed", "Concerned", "Annoyed", "Frustrated"]',
 'transcript': 'four or five times over the years. i was a little surprised he put me on a list...',
 'speaker_id': '1669',
 'gender': 'Female'}

In [None]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
feature_extractor

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

As we've noticed earlier, the samples are in very different lengths. Unfortunately for a models it is hard to learn from audios longer than 10s so we will truncate them at this length.

In [None]:
max_duration = 10.0  # seconds

We can then write the function that will preprocess our samples. We just feed them to the `feature_extractor` with the argument `truncation=True`, as well as the maximum sample length. This will ensure that very long inputs can be safely batched.

In [None]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
    )
    return inputs

The feature extractor will return a list of numpy arays for each example:

In [None]:
preprocess_function(new_dataset_dict['validation'][:5])

{'input_values': [array([-0.05354237, -0.04709264, -0.04655516, ..., -0.7248519 ,
       -0.8033237 , -0.8113858 ], dtype=float32), array([-0.14496697, -0.09746623, -0.06184068, ..., -0.00246475,
       -0.01730873, -0.06926267], dtype=float32), array([-2.3033160e-01, -2.3672777e-01, -1.1360151e-01, ...,
       -1.7658968e-02,  3.1285845e-03, -6.9500369e-05], dtype=float32), array([-0.00016025, -0.00071268, -0.00071268, ..., -0.0009889 ,
       -0.00071268, -0.00126512], dtype=float32), array([1.3141031e+00, 1.3575935e+00, 1.2583621e+00, ..., 2.0436413e-04,
       2.0436413e-04, 2.0436413e-04], dtype=float32)]}

In [None]:
encoded_dataset = new_dataset_dict.map(preprocess_function, remove_columns=['filename', 'audio', 'path', 'arousal', 'valence', 'dominance', 'emotion_secondary', 'transcript', 'speaker_id', 'gender'], batched=True)
encoded_dataset

Map:   0%|          | 0/4201 [00:00<?, ? examples/s]

Map:   0%|          | 0/799 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_values'],
        num_rows: 4201
    })
    validation: Dataset({
        features: ['label', 'input_values'],
        num_rows: 799
    })
    test: Dataset({
        features: ['label', 'input_values'],
        num_rows: 1000
    })
})

### Training the model

Now that our data is ready, we can download the pretrained model and fine-tune it. For classification we use the `AutoModelForAudioClassification` class. Like with the feature extractor, the `from_pretrained` method will download and cache the model for us. As the label ids and the number of labels are dataset dependent, we pass `num_labels`, `label2id`, and `id2label` alongside the `model_checkpoint` here:

In [None]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)




pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model_name = model_checkpoint.split("/")[-1]
batch_size=2
args = TrainingArguments(
    f"{model_name}-finetuned-ser",
    eval_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
    report_to="none"
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

Next, we need to define a function for how to compute the metrics from the predictions, which will just use the `metric` we loaded earlier. The only preprocessing we have to do is to take the argmax of our predicted logits:

In [None]:
from evaluate import load

metric = load("accuracy")


model.safetensors:   0%|          | 0.00/380M [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


Now we can finetune our model by calling the `train` method:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.8426,1.806702,0.34418
2,1.693,1.79646,0.361702
3,1.9211,1.719588,0.380476
4,1.6324,1.673034,0.434293


TrainOutput(global_step=2625, training_loss=1.75256514558338, metrics={'train_runtime': 4528.056, 'train_samples_per_second': 4.639, 'train_steps_per_second': 0.58, 'total_flos': 1.3227995084877297e+18, 'train_loss': 1.75256514558338, 'epoch': 4.991908614945264})

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

{'eval_loss': 1.6730341911315918,
 'eval_accuracy': 0.43429286608260326,
 'eval_runtime': 65.7042,
 'eval_samples_per_second': 12.161,
 'eval_steps_per_second': 6.088,
 'epoch': 4.991908614945264}