<a href="https://colab.research.google.com/github/iammartian0/Audio_Tasks/blob/main/Audio_classification/whisper_base_finetuned_gtzan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audio Classification
---
This notebook is guide to perform audio classification by finetuning OpenAi's [Whisper](https://github.com/openai/whisper) on [Music Genre](https://huggingface.co/datasets/marsyas/gtzan) dataset and this notebook is a part of process of certification of [Audio Course](https://huggingface.co/learn/audio-course/chapter0/introduction) by HuggingFace.




In [None]:
## Downloading necessary Modules
!pip install transformers datasets[audio] evaluate

In [None]:
!pip install accelerate -U

Dataset contains music audio samples with generes. The labels are: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock.

In [None]:
## Using Huggingface datasets
from datasets import load_dataset

gtzan = load_dataset("marsyas/gtzan", "all")
gtzan

In [None]:
## Exploring a sample
gtzan['train'][0]

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/5022b0984afa7334ff9a3c60566280b08b5179d4ac96a628052bada7d8940244/genres/blues/blues.00000.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/5022b0984afa7334ff9a3c60566280b08b5179d4ac96a628052bada7d8940244/genres/blues/blues.00000.wav',
  'array': array([ 0.00732422,  0.01660156,  0.00762939, ..., -0.05560303,
         -0.06106567, -0.06417847]),
  'sampling_rate': 22050},
 'genre': 0}

In [None]:
## Splittng into train test data
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1,)
gtzan

In [None]:
## Checking the labels
id2label_fn = gtzan["train"].features["genre"].int2str
id2label_fn(gtzan["train"][0]["genre"])

'pop'

# Model
---
The model I am using is OpenAi's Whisper, which was orginally developed for Automatic Speech Recognition and Speech Translation. HuggingFace transformers provides with different transformer architectures which can be easily downloaded with few lines of code.



Since every transformer architecture expects its input in specific embeddings, the inputs should be preprocessed with respective feature extractor.

In [None]:
## Downloading the feature extractor
from transformers import WhisperFeatureExtractor
model_id="openai/whisper-base"

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)



Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

###One more crucial step is audio data preprocessing is matching the sampling rate of inputs and the model.

In [None]:
## Resampling the input sampling rate
from datasets import Audio

gtzan = gtzan.cast_column("audio", Audio(sampling_rate=16000))

###After preprocessing we can observe that it is normalized too.

In [None]:
## Checking the distribution of raw data
import numpy as np

sample = gtzan["train"][11]["audio"]

print(f"Mean: {np.mean(sample['array']):.3}, Variance: {np.var(sample['array']):.3}")

Mean: -0.000765, Variance: 0.0665


In [None]:
## Checking the pre-processed data distribution
inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"],return_attention_mask=True,do_normalize=True,)

print(f"inputs keys: {list(inputs.keys())}")

print(
    f"Mean: {np.mean(inputs['input_features']):.3}, Variance: {np.var(inputs['input_features']):.3}"
)

inputs keys: ['input_features', 'attention_mask']
Mean: 0.96, Variance: 0.0567


In [None]:
## Defining a Preprocess function
max_duration = 30.0


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

In [None]:
## Mapping function to whole dataset
gtzan_encoded = gtzan.map(
    preprocess_function, remove_columns=["audio", "file"],batched = True,batch_size=200,num_proc=1,
)
gtzan_encoded

Map:   0%|          | 0/899 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['genre', 'input_features', 'attention_mask'],
        num_rows: 899
    })
    test: Dataset({
        features: ['genre', 'input_features', 'attention_mask'],
        num_rows: 100
    })
})

In [None]:
## Changing the target category name to 'label'
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

In [None]:
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

id2label["7"]

'pop'

#WhisperForAudioClassification
---
Only encoder part of Whisper is used for classification. Hidden embeddings are
passed into encoder and meaningful representations are learnt and by placing a
sequence classification head on top, these representations are mapped to respective labels.

And this setup is downloaded directly from HuggingFace transformers library.

In [None]:
from transformers import WhisperForAudioClassification

num_labels = len(id2label)

model = WhisperForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/290M [00:00<?, ?B/s]

Some weights of the model checkpoint at openai/whisper-base were not used when initializing WhisperForAudioClassification: ['model.decoder.layers.2.self_attn.out_proj.bias', 'model.decoder.layers.4.fc2.bias', 'model.decoder.layers.3.self_attn.v_proj.weight', 'model.decoder.layers.2.encoder_attn.v_proj.weight', 'model.decoder.layers.5.self_attn.out_proj.weight', 'model.decoder.layers.0.self_attn.v_proj.bias', 'model.decoder.layers.3.encoder_attn_layer_norm.weight', 'model.decoder.layers.3.self_attn.out_proj.bias', 'model.decoder.layers.3.self_attn.v_proj.bias', 'model.decoder.layers.4.final_layer_norm.bias', 'model.decoder.layers.4.self_attn.v_proj.weight', 'model.decoder.layers.4.self_attn.k_proj.weight', 'model.decoder.layers.2.self_attn.v_proj.weight', 'model.decoder.layers.2.encoder_attn.v_proj.bias', 'model.decoder.layers.5.self_attn.q_proj.bias', 'model.decoder.layers.2.self_attn.v_proj.bias', 'model.decoder.layers.5.encoder_attn.v_proj.bias', 'model.decoder.layers.5.final_layer_n

If you want to save your model, login with your credential and set the training parameter push_to_hub = True

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#Hyperparameters for Training.
---
- Since I am using collab free GPU, I limited the batch size to 8. If you have less powerful GPU decrease the batch_size argument by 2x and increase the
gradient_accumulation_steps by 2x.
- As I am trying for certification, my aim is to 87% accuracy. So I am training for 10 epochs.


In [None]:
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 10

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    weight_decay=0.02,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    push_to_hub=True,
)

###Since I have predefined goal , I am using 'accuracy' metric.

In [None]:
import evaluate

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
import torch

torch.cuda.empty_cache()


###Instantiating the Trainer class by passing all the necessary arguments.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

Cloning https://huggingface.co/iammartian0/whisper-base-finetuned-gtzan into local empty directory.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.1813,1.122397,0.62
2,0.6839,0.71117,0.78
3,0.4336,0.631242,0.8
4,0.1472,0.536638,0.83
5,0.1193,0.797263,0.8
6,0.008,0.504363,0.87
7,0.1485,0.705391,0.86
8,0.0155,0.614543,0.87
9,0.1364,0.603441,0.88
10,0.0017,0.587726,0.88


TrainOutput(global_step=1130, training_loss=0.4074223043792675, metrics={'train_runtime': 1794.5103, 'train_samples_per_second': 5.01, 'train_steps_per_second': 0.63, 'total_flos': 2.58348736944e+17, 'train_loss': 0.4074223043792675, 'epoch': 10.0})

In [None]:
kwargs = {
    "dataset_tags": "marsyas/gtzan",
    "dataset": "GTZAN",
    "model_name": f"{model_name}-finetuned-gtzan",
    "finetuned_from": model_id,
    "tasks": "audio-classification",
}

### Pushing the model to save in Huggingface Hub


In [None]:
trainer.push_to_hub(**kwargs)

Upload file runs/Jul04_11-46-22_c88809a34287/events.out.tfevents.1688471212.c88809a34287.642.0:   0%|         …

To https://huggingface.co/iammartian0/whisper-base-finetuned-gtzan
   d601c15..2a992af  main -> main

   d601c15..2a992af  main -> main

To https://huggingface.co/iammartian0/whisper-base-finetuned-gtzan
   2a992af..dbf749e  main -> main

   2a992af..dbf749e  main -> main



'https://huggingface.co/iammartian0/whisper-base-finetuned-gtzan/commit/2a992afc0aa7ef19425baa7fb128a037016f8890'