## Fine-tuning Wav2Vec2 Speech Models (by Facebook) using Huggingface

In `1.data_analysis_and_processing.ipynb`, We understand about the distribution of the data.
And we also created dataset for Audio Classification and push it to the huggingface hub.

In this notebook, we will write script to Fine-tune `Wav2Vec2` models on our dataset.

### NOTE:

- I used kaggle GPU to fine-tune the models and there I didn't get any error.
So, I suggest you to use the Kaggle GPU.

- This code is not fixed for CPU usage. So, Please run only on GPU.

In [23]:
%%capture
!pip install datasets
!pip install transformers
!pip install librosa
!pip install evaluate
!apt install git-lfs

In [18]:
# imports
from pathlib import Path

import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

from datasets import load_dataset

# import evaluate for loading metrics
import evaluate

#### Loading audio dataset

In [4]:
audio_dataset = load_dataset("MuhammadIqbalBazmi/intent-dataset")
audio_dataset

Downloading:   0%|          | 0.00/944 [00:00<?, ?B/s]

Using custom data configuration MuhammadIqbalBazmi--intent-dataset-3a947b6d9cc17a8c


Downloading and preparing dataset None/None (download: 17.09 MiB, generated: 17.19 MiB, post-processed: Unknown size, total: 34.28 MiB) to C:/Users/modassir/.cache/huggingface/datasets/MuhammadIqbalBazmi___parquet/MuhammadIqbalBazmi--intent-dataset-3a947b6d9cc17a8c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to C:/Users/modassir/.cache/huggingface/datasets/MuhammadIqbalBazmi___parquet/MuhammadIqbalBazmi--intent-dataset-3a947b6d9cc17a8c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 112
    })
    test: Dataset({
        features: ['audio', 'label'],
        num_rows: 48
    })
})

In [5]:
# sample of the audio dataset
audio_dataset["train"][0]

{'audio': {'path': None,
  'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00131226,
         -0.00140381, -0.00042725]),
  'sampling_rate': 16000},
 'label': 1}

In [7]:
# labels in the dataset
audio_dataset["train"].features["label"].names

['battery',
 'Running_operating_cost',
 'Locate_Dealer',
 'casual_talk_greeting',
 'Top_speed',
 'casual_talk_goodbye',
 'About_iQube',
 'bike_modes',
 'book_now']

#### Log in to huggingface_hub to push the fine-tuned model to the hub


In [6]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Preprocessing data

##### Creating label2id and id2label

In [8]:
label2id, id2label = dict(), dict()
labels = audio_dataset["train"].features["label"].names
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

id2label["5"]

'casual_talk_goodbye'

In [9]:
# initialize model checkpoint name
model_checkpoint = "facebook/wav2vec2-base"

# You can use many other models like below
# model_checkpoint="facebook/wav2vec2-large-960h"
# model_checkpoint="facebook/wav2vec2-large-xlsr-53"
# model_checkpoint="facebook/wav2vec2-xlsr-53-espeak-cv-ft" # multilingual
# model_checkpoint="facebook/wav2vec2-xls-r-300m" # multilingual, 128 lang, 436K hours of speech
# model_checkpoint="facebook/wav2vec2-large-960h-lv60-self" # 
# model_checkpoint="facebook/wav2vec2-conformer-rel-pos-large-960h-ft"
### Unable to train these large models
# model_checkpoint="facebook/wav2vec2-xls-r-1b" # multilingual (1 billion params), 128 lang, 436K hours of speech
# model_checkpoint="facebook/wav2vec2-xls-r-2b" # multilingual (2 billion params), 128 lang, 436K hours of speech

##### Loading feature Extractor

In [10]:
# loading Feature Extractor
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
feature_extractor

Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

##### Creating preprocessing function

In [12]:
# Creating preprocessing function
max_audio_len = 8# by observation we found the max audio length is 8 seconds
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate*int(max_audio_len)),
        # truncation=True, # Uncomment it, If you want to truncate longer audios to max_length
        # padding=True, # Uncomment it, if you want to pad shorter audio to max_length
    )
    return inputs

##### Preprocess the audio dataset

We can use `.map()` function to apply feature extraction logic (`preprocess_function()` contains all of the logic)

In [13]:
encoded_audio_dataset = audio_dataset.map(preprocess_function, remove_columns=["audio"], batched=True)
encoded_audio_dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_values'],
        num_rows: 112
    })
    test: Dataset({
        features: ['label', 'input_values'],
        num_rows: 48
    })
})

Now, we are having two columns `input_values`(default name) and `label`. We've removed `audio` column as no need further.

### Fine-tuning the model

In [15]:
# use f1 to get both better precision and recall
metric = "accuracy"

##### Loding model

In [14]:
from transformers import AutoModelForAudioClassification
from transformers import TrainingArguments
from transformers import Trainer

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForSequenceClassification: ['project_q.bias', 'quantizer.weight_proj.bias', 'project_hid.weight', 'project_hid.bias', 'project_q.weight', 'quantizer.weight_proj.weight', 'quantizer.codevectors']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector

##### Initializing training arguments

In [19]:
model_name = model_checkpoint.split("/")[-1]
batch_size = 2 # Use lower batch size, If you have less RAM
args = TrainingArguments(
    f"{model_name}-intent-classification-ori-{metric}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    save_total_limit = 2, # Will save just two models (best and current)
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=45, # no.of epochs
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model=f"eval_{metric}",
#     logging_dir="./logs", # Uncomment to save logs, and specify the directory for same 
    push_to_hub=False, # make True, if you want to push it to the hub
)

##### Initialize `compute_metrics()` function

In [20]:
def compute_metrics(eval_pred):
    """
    this method compute metrics and return the result

    In case of accuracy
    """
    metric_type=metric # metric is initialized earlier (global variable)
    metric_loaded = evaluate.load(metric_type)
    predictions = np.argmax(eval_pred.predictions, axis=-1)
    references = eval_pred.label_ids
    average = "micro" # micro takes label imbalance into account, not valid for accuracy
    
    if metric_type!="accuracy":
        result = metric_loaded.compute(predictions=predictions, references=references, average=average)
    else:
        result = metric_loaded.compute(predictions=predictions, references=references)

    return result


##### Utility code to free the GPU Cache


In [22]:
%%capture
!pip install GPUtil

# https://www.kaggle.com/getting-started/140636
# clear CUDA cache
import torch
from GPUtil import showUtilization as gpu_usage
from numba import cuda

def free_gpu_cache():
    print("Initial GPU Usage")
    gpu_usage()

    torch.cuda.empty_cache()

    cuda.select_device(0)
    cuda.close()
    cuda.select_device(0)

    print("GPU Usage after emptying the cache")
    gpu_usage()

if torch.cuda.is_available():
    free_gpu_cache()

##### Initalizing Trainer

In [24]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_audio_dataset["train"],
    eval_dataset=encoded_audio_dataset["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

##### Fine-tune the model (finally)

It is better to use GPU only, otherwise you might be getting a lot of error.
This code is not optimized or fixed for CPU training.

__[NOTE]__:
You might get the error below
```
RuntimeError: expected scalar type Long but found Int
```

In [None]:
trainer.train()

##### Evaluate the model on test dataset

In [None]:
trainer.evaluate()

##### Push the fine-tuned model to hub

In [None]:
trainer.push_to_hub()