# Automatic Speech Recognition
---
This notebook is guide to perform Automatic Speech Recognition by finetuning OpenAi's [Whisper](https://huggingface.co/openai/whisper-tiny) on [Speech](https://huggingface.co/datasets/PolyAI/minds14) dataset and this notebook is a part of process of certification of Audio Course by HuggingFace.

In [None]:
## Adding necessary modules
!pip install transformers datasets[audio]

In [None]:
!pip install --upgrade evaluate jiwer


In [None]:
!pip install accelerate -U

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Data
---
The Dataset contains speech in 14 language varieties in 14 contexts. So this data can be used for translation, transription, classification tasks too.


In [3]:
##Downloading dataset from HuggingFace
from datasets import load_dataset, DatasetDict

minds = load_dataset("PolyAI/minds14",'en-US',split='train')


Downloading builder script:   0%|          | 0.00/5.95k [00:00<?, ?B/s]

Downloading and preparing dataset minds14/en-US to /root/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696...


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset minds14 downloaded and prepared to /root/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696. Subsequent calls will reuse this data.


In [11]:
##Generating train_test_split
minds = minds.train_test_split(test_size=0.2)

In [18]:
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

## Model
---
The model is OpenAi's Whisper which is Seq2Seq model (encoder-decoder architecture) which is pretrained for Automatic Speech recognition for English and Multilingual speech data.

In [None]:
## Checking the codes of possible languges  
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE

TO_LANGUAGE_CODE

In [15]:
## Downloading a preprocessor 
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-tiny", language="english", task="transcribe"
)

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

In [16]:
## Resampling the input data.
from datasets import Audio

sampling_rate = processor.feature_extractor.sampling_rate
minds = minds.cast_column("audio", Audio(sampling_rate=sampling_rate))

In [20]:
## Creating a preprocess function
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        audio=audio["array"],
        sampling_rate=audio["sampling_rate"],
        text=example["transcription"],
    )

    # compute input length of audio sample in seconds
    example["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    return example

In [21]:
## Mapping preprocess funtion to whole dataset
minds_encoded = minds.map(
    prepare_dataset, remove_columns=['path','english_transcription', 'intent_class', 'lang_id'], num_proc=1
)

  0%|          | 0/450 [00:00<?, ?ex/s]

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


  0%|          | 0/113 [00:00<?, ?ex/s]

In [22]:
## Filtering data so that it is no longer than 30 seconds
max_input_length = 30.0


def is_audio_in_length_range(length):
    return length < max_input_length

In [23]:
minds_encoded["train"] = minds_encoded["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [24]:
minds_encoded["train"]

Dataset({
    features: ['audio', 'transcription', 'input_features', 'labels', 'input_length'],
    num_rows: 446
})

### Data Collator
---
Data Collators are objects which provides dynamic padding based on inputs in given batch. And Data Collator for seq2seq for Automatic Speech recognition is different as inputs are handled by preprocessor and labels by tokenizer. So it processes inputs and labels separately. 



In [25]:
##Defining the matching data collator
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [26]:
##Instantiating the data_collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Metric
---
It is explained in detail in readme

In [27]:
## Downloading the suitable metric
import evaluate

metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [28]:
## Defining a function to use the metric
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

    # compute orthographic wer
    wer_ortho = 100 * metric.compute(predictions=pred_str, references=label_str)

    # compute normalised WER
    pred_str_norm = [normalizer(pred) for pred in pred_str]
    label_str_norm = [normalizer(label) for label in label_str]
    # filtering step to only evaluate the samples that correspond to non-zero references:
    pred_str_norm = [
        pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
    ]
    label_str_norm = [
        label_str_norm[i]
        for i in range(len(label_str_norm))
        if len(label_str_norm[i]) > 0
    ]

    wer = metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {"wer_ortho": wer_ortho, "wer": wer}

In [29]:
## Downloading the model
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/151M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

In [30]:
## Defining the training goal
from functools import partial

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(
    model.generate, language="english", task="transcribe", use_cache=True
)


### Training
---
The goal of this notebook is to reach wer rate of less than 0.37. Since the model is already pre-trained on data similar to input we can achieve that easily.


In [36]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-tiny-minds14",  # name on the HF Hub
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=500,  # increase to 4000 if you have your own GPU or a Colab paid plan
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=2,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

In [37]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=minds_encoded["train"],
    eval_dataset=minds_encoded["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

/kaggle/working/./whisper-tiny-minds14 is already a clone of https://huggingface.co/iammartian0/whisper-tiny-minds14. Make sure you pull the latest changes with `repo.git_pull()`.


In [38]:
trainer.train()



Step,Training Loss,Validation Loss,Wer Ortho,Wer
500,0.0009,0.569417,33.518519,0.337079


TrainOutput(global_step=500, training_loss=0.3531700435988605, metrics={'train_runtime': 1977.7699, 'train_samples_per_second': 4.045, 'train_steps_per_second': 0.253, 'total_flos': 1.9611403886592e+17, 'train_loss': 0.3531700435988605, 'epoch': 17.86})

In [40]:
kwargs = {
    "dataset_tags": "PolyAI/minds14",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
}

In [41]:
trainer.push_to_hub(**kwargs)

To https://huggingface.co/iammartian0/whisper-tiny-minds14
   e52dcee..b248b63  main -> main

To https://huggingface.co/iammartian0/whisper-tiny-minds14
   b248b63..c0afa26  main -> main



'https://huggingface.co/iammartian0/whisper-tiny-minds14/commit/b248b63952e5e20e25d8f4881f9d6a04aca41d98'