### Step 1: Installation of Required Libraries

In this step, we install all necessary Python libraries that are crucial for processing and training the machine learning model. Here's a breakdown of each library and its purpose:

- `datasets[audio]`: This is an extension of the `datasets` library by Hugging Face, specifically enhanced with audio functionalities. It provides tools for loading, preprocessing, and handling audio datasets, which is essential for speech recognition tasks.

- `transformers`: Also from Hugging Face, this library offers a vast range of pre-trained models for Natural Language Processing (NLP) and speech tasks. It includes support for models like Whisper, which are tailored for audio and speech processing.

- `accelerate`: This library simplifies running machine learning models on different hardware setups (like CPUs, GPUs, and TPUs) without deep knowledge of the underlying frameworks. It helps in speeding up the model training process.

- `evaluate`: A library used for measuring the performance of models via various metrics. In the context of speech recognition, it's likely used for calculating metrics like Word Error Rate (WER) and others.

- `jiwer`: This stands for "Just-In-Time Word Error Rate." It's a specific tool for computing the Word Error Rate, a common metric for evaluating the performance of speech recognition systems.

- `tensorboard`: A tool integrated with TensorFlow (but also usable independently) that helps in visualizing different parameters of the model training process, like loss and accuracy over time, helping in debugging and optimizing the training process.

- `gradio`: A library for building simple, customizable UI components around machine learning models. It allows users to create interfaces for human interaction with models, useful for demonstration and testing purposes.

By installing these libraries, we ensure that all the necessary tools for handling, training, and evaluating the speech recognition model are available in the environment.


In [None]:
!pip install --upgrade datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio

### Step 2: Importing and Logging into Hugging Face Hub

In this step, we prepare for interactions with the Hugging Face Hub by importing the necessary module and logging into the service. Here's what each part of the code does:

- `import huggingface_hub`: This imports the `huggingface_hub` library into the Python environment. The Hugging Face Hub is a central repository where pre-trained models, datasets, and other machine learning resources are shared. Importing this library allows you to interact with the Hub, such as downloading models or uploading your trained models.

- `huggingface_hub.login()`: This function prompts the user to log into their Hugging Face account. During its execution in a notebook environment like Google Colab, it will display a URL that you can visit to log in or sign up. Once logged in, you'll receive a token which you should paste back into your notebook to authenticate your session. This authentication is necessary to perform operations that require permissions, like uploading models or accessing private models and datasets.

This step is crucial for leveraging the vast resources available on the Hugging Face Hub and for sharing your work with the community.


In [None]:
import huggingface_hub

huggingface_hub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Imports

### Step 3: Importing Necessary Libraries and Classes for Model Training

This step involves importing various classes and libraries that are essential for processing the audio data, setting up the model, and organizing the training workflow. Here's a detailed explanation of each import:

- `from datasets import load_dataset, DatasetDict`: These functions from the `datasets` library are used to load datasets and manipulate them as `DatasetDict` objects, which organize datasets into training, validation, and test splits for easy access.

- `from transformers import (WhisperTokenizer, WhisperProcessor, WhisperFeatureExtractor, WhisperForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer)`:
  - `WhisperTokenizer`: A tokenizer that is specialized for processing the text associated with audio data in the Whisper model format.
  - `WhisperProcessor`: Combines the tokenizer and feature extractor into a single class for handling the complete preprocessing of both audio and text data.
  - `WhisperFeatureExtractor`: Extracts features from audio files necessary for the Whisper model, like spectrograms or mel frequency cepstral coefficients (MFCCs).
  - `WhisperForConditionalGeneration`: The Whisper model class tailored for tasks like speech recognition, where the output is generated text based on input audio.
  - `Seq2SeqTrainingArguments`: Configuration class that sets up the training parameters such as learning rate, batch size, and the number of training epochs.
  - `Seq2SeqTrainer`: Handles the training loop, making it easier to train sequence-to-sequence models like Whisper without extensive boilerplate code.

- `from datasets import Audio`: Imports the `Audio` class from `datasets`, which provides utilities for working with audio data within the dataset.

- `from dataclasses import dataclass`: A decorator for creating data classes which are classes that are primarily used to store data with little to no business logic.

- `from typing import Any, Dict, List, Union`: These are type hinting classes from Python’s `typing` module, used to specify the type of variables, return types of functions, and arguments.

- `import torch`: This is PyTorch, a popular deep learning library that provides comprehensive tools and libraries for building neural network models.

- `import evaluate`: Imports the `evaluate` library which provides a standardized interface to evaluate the performance of models using various metrics, facilitating consistent and comparable assessment across different models and tasks.

By importing these libraries and classes, the notebook is equipped with all the necessary tools to preprocess the data, configure, train, and evaluate the Whisper model effectively.

In [None]:
from datasets import load_dataset, DatasetDict
from transformers import (
    WhisperTokenizer,
    WhisperProcessor,
    WhisperFeatureExtractor,
    WhisperForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
from datasets import Audio
from dataclasses import dataclass
from typing import Any, Dict, List, Union

import torch
import evaluate

### Step 4: Configuration of Model and Output Settings

In this step, we set up several important variables that define the configuration for the model training:

- `model_id = 'openai/whisper-small'`: This specifies the identifier for the pre-trained model we intend to use from Hugging Face's model hub. In this case, 'openai/whisper-small' refers to the smaller version of OpenAI's Whisper model. This model is a popular choice for tasks involving speech recognition due to its efficiency and relatively smaller size, making it faster to fine-tune and deploy.

- `out_dir = 'whisper_small_atco2'`: This is the directory path where the fine-tuned model and any related output files will be saved. Naming the output directory `whisper_small_atco2` suggests that the results are specific to the ATCO2 dataset, a collection presumably related to air traffic control communications, which the Whisper model is being fine-tuned on.

- `epochs = 10`: This sets the number of training epochs, where an epoch represents one complete pass through the entire training dataset. The choice of 10 epochs indicates a balance between sufficient training to learn from the data and avoiding overfitting by not running too many epochs.

These settings are crucial as they directly affect how the model is initialized, trained, and where the trained model is stored. Such configurations are essential for managing the training process and ensuring that the outputs are organized and easy to retrieve for evaluation or deployment.


In [None]:
model_id = 'openai/whisper-small'
out_dir = 'whisper_small_atco2'
epochs = 10
# batch_size = 32

## Load Dataset

### Step 5: Loading the Dataset

In this step, we load subsets of a specific dataset for training and validation purposes using the Hugging Face `datasets` library. The dataset in question is tailored for Automatic Speech Recognition (ASR) in Air Traffic Control (ATC) simulations, identified by the dataset name 'jlvdoorn/atco2-asr-atcosim'. Here’s what each line of code does:

- `atc_dataset_train = load_dataset('jlvdoorn/atco2-asr-atcosim', split='train[:50]')`: This line loads the first 50 samples from the training split of the dataset. The dataset `jlvdoorn/atco2-asr-atcosim` is presumably structured to assist in training models for recognizing and transcribing ATC communications. By specifying `train[:50]`, we are limiting the amount of data to the first 50 entries, which might be intended for initial testing, quick iterations, or demonstrations where training on a full dataset might be computationally expensive or unnecessary.

- `atc_dataset_valid = load_dataset('jlvdoorn/atco2-asr-atcosim', split='validation[:10]')`: Similarly, this line loads the first 10 samples from the validation split of the dataset. Validation datasets are crucial for evaluating the model during and after training, providing a way to measure the model's performance on data it hasn't seen during training. Here, only 10 samples are used, likely for quick validation checks.

These subsets provide a manageable amount of data for developing and fine-tuning the Whisper model, ensuring that the model can be trained and validated quickly, which is especially useful during the development phase to tweak parameters and settings without the overhead of handling a full dataset.



In [None]:
atc_dataset_train = load_dataset('jlvdoorn/atco2-asr-atcosim', split='train[:50]')
atc_dataset_valid = load_dataset('jlvdoorn/atco2-asr-atcosim', split='validation[:10]')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/942 [00:00<?, ?B/s]

(…)-00000-of-00005-c6681348ac8543dc.parquet:   0%|          | 0.00/406M [00:00<?, ?B/s]

(…)-00001-of-00005-464e7b29cac82caf.parquet:   0%|          | 0.00/407M [00:00<?, ?B/s]

(…)-00002-of-00005-008f85162351773d.parquet:   0%|          | 0.00/401M [00:00<?, ?B/s]

(…)-00003-of-00005-13846616069619e5.parquet:   0%|          | 0.00/387M [00:00<?, ?B/s]

(…)-00004-of-00005-0565e63298f50d49.parquet:   0%|          | 0.00/419M [00:00<?, ?B/s]

(…)-00000-of-00002-7a5ea3756991bf72.parquet:   0%|          | 0.00/260M [00:00<?, ?B/s]

(…)-00001-of-00002-56cef56513136770.parquet:   0%|          | 0.00/245M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8092 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2026 [00:00<?, ? examples/s]

In [None]:
print(atc_dataset_train)
print(atc_dataset_valid)

Dataset({
    features: ['audio', 'text', 'info'],
    num_rows: 50
})
Dataset({
    features: ['audio', 'text', 'info'],
    num_rows: 10
})


In [None]:
print(atc_dataset_train[0])

{'audio': {'path': 'LKPR_RUZYNE_Radar_120_520MHz_20201025_091112.wav', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -6.10351562e-05, -6.10351562e-05, -6.10351562e-05]), 'sampling_rate': 16000}, 'text': 'Oscar Kilo Papa Mike Bravo descend flight level one hundred level one hundred Oscar Kilo Papa Mike Bravo ', 'info': 'LKPR\nPraha Ruzyne\nRadar\nAKEVA ARVEG BAGRU BAROX BAVIN BEKVI ELMEK ELPON ERASU EVEMI KENOK KUVIX LETNA RATEV RISUK SOMIS SULOV TIPRU UTORO\nBLA131 BLA1XQ BTI7PY CTN480 DLH3NL DLH9TP ETD72E EWG6HP FIN1DH IRA711 KLM44K MLD863 MLD864 OKHBT OKLLZ OKMHZ OKPHM OKWUS17 OKYAI14 RYR1JU RYR4945 SXS7D THY32B THY6577 TIE790J UAE73  \nAll Charter Air Baltic Croatia Lufthansa Etihad Eurowings Finn Iranair Klm Moldova Oklahoma Okapi Alfa Ryan Sunexpress Turkish Time Emirates'}


## Feature Extractor and Tokenizer

### Step 6: Initializing the Feature Extractor

In this step, we initialize the `WhisperFeatureExtractor` from the pre-trained model specified earlier (`model_id = 'openai/whisper-small'`). The feature extractor is a crucial component in preparing audio data for the model. Here’s what this line of code accomplishes:

- `feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)`: This function call loads the feature extractor that is associated with the 'openai/whisper-small' model from Hugging Face's model hub. The feature extractor is configured to convert raw audio files into a format that the Whisper model can understand and process efficiently.

The `WhisperFeatureExtractor` handles various preprocessing tasks such as:
- **Resampling**: Adjusting the audio sampling rate to what the model expects (typically 16kHz for Whisper models).
- **Normalization**: Making sure the audio signal has zero mean and unit variance, if not already normalized.
- **Feature Transformation**: Extracting features like spectrograms or mel-spectrograms from the audio, which are more effective representations for the model to process than raw audio waves.

By using a pre-trained feature extractor, we leverage the exact preprocessing steps that were used when the Whisper model was originally trained, ensuring compatibility and optimal performance. This is essential for maintaining the integrity and effectiveness of the model's input processing pipeline.



In [None]:
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

### Step 7: Initializing the Tokenizer

In this step, we initialize the `WhisperTokenizer` with specific configurations suitable for our task. This tokenizer is used to convert textual data associated with audio files into a format that can be processed by the model. Here's an explanation of the code:

- `tokenizer = WhisperTokenizer.from_pretrained(model_id, language='English', task='transcribe')`: This line loads the `WhisperTokenizer` for the model specified by `model_id` ('openai/whisper-small'). The tokenizer is set up with additional parameters to optimize it for transcribing English language audio.

The parameters specified are:
- **language='English'**: This parameter sets the tokenizer to focus on the English language, which influences how it handles language-specific nuances in transcription.
- **task='transcribe'**: This specifies the task for which the tokenizer is being configured. In this case, the task is transcription, which means the tokenizer will prepare text data specifically for converting speech into written text.

The `WhisperTokenizer` plays a key role in processing the textual output generated by the model from the audio input. It ensures that the text is formatted correctly and tokenized in a way that aligns with how the model was trained, facilitating effective learning and accurate transcriptions.

By using a tokenizer that is pre-trained and specifically configured for the English language and transcription task, we ensure that the model's performance is optimized for accuracy and efficiency in processing audio files into text.


In [None]:
tokenizer = WhisperTokenizer.from_pretrained(model_id, language='English', task='transcribe')

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

### Step 8: Initializing the Processor

In this step, we initialize the `WhisperProcessor`, which combines the functionalities of both the `WhisperFeatureExtractor` and the `WhisperTokenizer`. This integrated approach simplifies the handling of both audio and text data during model training and inference. Here’s what the line of code achieves:

- `processor = WhisperProcessor.from_pretrained(model_id, language='English', task='transcribe')`: This function call loads a `WhisperProcessor` that is configured for the model identified by `model_id` ('openai/whisper-small'). It is specifically tuned for processing English audio for the task of transcription.

The parameters specified are:
- **language='English'**: This setting optimizes the processor for handling English, ensuring that both audio features and textual data are processed in a way that accommodates the linguistic characteristics of English.
- **task='transcribe'**: By specifying this task, the processor is tailored to support the conversion of spoken language into written text. This involves both extracting meaningful features from the audio and preparing the text for the model to generate accurate transcriptions.

The `WhisperProcessor` plays a crucial role by encapsulating the preprocessing steps required for both the audio input and text output. This dual functionality ensures that:
- The audio is processed into features that are suitable for the model to interpret.
- The text generated by the model is tokenized and formatted correctly for subsequent evaluation or use.

Using a pre-trained processor that is specifically configured for the task and language ensures that the model receives data in the optimal format for training and inference, thereby enhancing the model’s efficiency and effectiveness.

In [None]:
processor = WhisperProcessor.from_pretrained(model_id, language='English', task='transcribe')

### Prepare Data

### Step 9: Casting Audio Data to Uniform Sampling Rate

This step involves modifying the audio data in the dataset to ensure it has a consistent sampling rate, which is crucial for the model to process the audio effectively. Here’s how the code achieves this:

- `atc_dataset_train = atc_dataset_train.cast_column('audio', Audio(sampling_rate=16000))`: This line of code adjusts the 'audio' column in the training dataset to ensure all audio files are resampled to a sampling rate of 16,000 Hz. The `Audio` class from the `datasets` library is used to specify the desired sampling rate, and `cast_column` applies this setting to the entire column.

- `atc_dataset_valid = atc_dataset_valid.cast_column('audio', Audio(sampling_rate=16000))`: Similarly, this line ensures that the audio in the validation dataset is also resampled to 16,000 Hz.

#### Why 16,000 Hz?
The choice of 16,000 Hz as a sampling rate is standard in the field of speech recognition because it balances the quality of the audio with computational efficiency. This rate is sufficient to capture the frequencies most important for understanding human speech while not being so high as to require excessive computational resources to process.

#### Importance of Consistent Sampling Rate
Having a consistent sampling rate across all audio files is essential for a few reasons:
- **Model Compatibility**: Ensures that all audio files are compatible with the expectations of the Whisper model, which is trained to process audio at this specific rate.
- **Feature Extraction Consistency**: Helps maintain uniformity in the features extracted from the audio, which is important for the model to learn effectively.
- **General Performance**: Prevents issues related to audio processing and improves the overall reliability and accuracy of the model’s predictions.

By applying these changes to the dataset, we standardize the input for the model, enhancing the reliability of the training and validation process.

In [None]:
atc_dataset_train = atc_dataset_train.cast_column('audio', Audio(sampling_rate=16000))
atc_dataset_valid = atc_dataset_valid.cast_column('audio', Audio(sampling_rate=16000))

In [None]:
print(atc_dataset_train[0])

{'audio': {'path': 'LKPR_RUZYNE_Radar_120_520MHz_20201025_091112.wav', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -6.10351562e-05, -6.10351562e-05, -6.10351562e-05]), 'sampling_rate': 16000}, 'text': 'Oscar Kilo Papa Mike Bravo descend flight level one hundred level one hundred Oscar Kilo Papa Mike Bravo ', 'info': 'LKPR\nPraha Ruzyne\nRadar\nAKEVA ARVEG BAGRU BAROX BAVIN BEKVI ELMEK ELPON ERASU EVEMI KENOK KUVIX LETNA RATEV RISUK SOMIS SULOV TIPRU UTORO\nBLA131 BLA1XQ BTI7PY CTN480 DLH3NL DLH9TP ETD72E EWG6HP FIN1DH IRA711 KLM44K MLD863 MLD864 OKHBT OKLLZ OKMHZ OKPHM OKWUS17 OKYAI14 RYR1JU RYR4945 SXS7D THY32B THY6577 TIE790J UAE73  \nAll Charter Air Baltic Croatia Lufthansa Etihad Eurowings Finn Iranair Klm Moldova Oklahoma Okapi Alfa Ryan Sunexpress Turkish Time Emirates'}


### Step 10: Defining the Dataset Preparation Function

This step involves defining a function `prepare_dataset` that processes each batch of data from the dataset to make it suitable for training the Whisper model. Here's what happens within the function:

- `def prepare_dataset(batch):`: This defines a function that takes a single batch (a group of samples) as input.

- `audio = batch['audio']`: This line extracts the audio data from the batch. Each `audio` item is a dictionary containing both the audio array (`'array'`) and its sampling rate (`'sampling_rate'`).

- `batch['input_features'] = feature_extractor(audio['array'], sampling_rate=audio['sampling_rate']).input_features[0]`: This line processes the audio array through the `WhisperFeatureExtractor` to convert it into the model-compatible features. The `feature_extractor` is configured to handle the audio at its native sampling rate and extracts the first set of input features, which are then assigned back to the batch under the key `'input_features'`.

- `batch['labels'] = tokenizer(batch['text']).input_ids`: Here, the text associated with the audio is tokenized using the `WhisperTokenizer`. The tokenizer converts the text into a sequence of input IDs, which are integers representing each token. These IDs are then stored back in the batch under the key `'labels'`.

- `return batch`: Finally, the modified batch, now containing both the input features (audio data processed into a form the model can understand) and labels (tokenized text data), is returned.

#### Purpose of the Function
The `prepare_dataset` function is crucial for transforming the raw data in the dataset into a format that can be directly fed into the Whisper model for training. It ensures that:
- The audio is correctly preprocessed to extract meaningful features the model can learn from.
- The corresponding text is tokenized to provide a target for the model to predict during training, facilitating the learning of how to transcribe spoken content into text.

By applying this function to all data in the training and validation datasets, we ensure consistency and compatibility with the model's requirements, paving the way for effective training and accurate model performance.

In [None]:
def prepare_dataset(batch):
    audio = batch['audio']

    batch['input_features'] = feature_extractor(audio['array'], sampling_rate=audio['sampling_rate']).input_features[0]

    batch['labels'] = tokenizer(batch['text']).input_ids

    return batch

### Step 11: Applying the Preparation Function to the Dataset

This step involves applying the `prepare_dataset` function defined in Step 10 to both the training and validation datasets. The function is applied using the `map` method, which processes each batch of data through the function efficiently. Here’s what each line of code does:

- `atc_dataset_train = atc_dataset_train.map(prepare_dataset, num_proc=4)`: This line applies the `prepare_dataset` function to each batch of the training dataset. The `map` method ensures that the function is applied across all entries in the dataset. The parameter `num_proc=4` specifies that the operation should use four processor cores to parallelize the work, significantly speeding up the process.

- `atc_dataset_valid = atc_dataset_valid.map(prepare_dataset, num_proc=4)`: Similarly, this line applies the `prepare_dataset` function to each batch of the validation dataset with the same parallelization, ensuring that the validation data is processed in the same way as the training data.

#### Importance of Parallel Processing
Using the `num_proc=4` argument is critical for performance:
- **Efficiency**: By processing multiple batches simultaneously, the total time required to prepare the entire dataset is reduced.
- **Scalability**: This approach scales well as the dataset size increases, making it feasible to process large amounts of data without a linear increase in preparation time.

#### Consistency Across Datasets
Applying the same preparation function to both training and validation datasets ensures consistency in how data is handled. This is crucial for fair evaluation and reliable performance metrics, as both datasets undergo the same preprocessing steps and transformations.

By the end of this step, both datasets are fully preprocessed and in the correct format for training and validating the model, ensuring that the training pipeline can proceed efficiently and effectively.

In [None]:
atc_dataset_train = atc_dataset_train.map(
    prepare_dataset,
    num_proc=4
)

atc_dataset_valid = atc_dataset_valid.map(
    prepare_dataset,
    num_proc=4
)

Map (num_proc=4):   0%|          | 0/50 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/10 [00:00<?, ? examples/s]

### Data Collator

### Step 12: Defining a Custom Data Collator

In this step, we define a custom data collator class, `DataCollatorSpeechSeq2SeqWithPadding`, designed to handle the batch preparation necessary for training the Whisper model. This class is essential for ensuring that batches of data are correctly formatted and padded before being fed into the model during training. Here's a breakdown of the class and its method:

#### Class Definition and Initialization
- `@dataclass`: This decorator is used to define a class that primarily stores data. It automatically generates special methods like `__init__()` based on the fields defined in the class.
- `processor: Any`: This field stores an instance of `WhisperProcessor`, which will handle both audio feature extraction and text tokenization.
- `decoder_start_token_id: int`: This field stores the ID of the start token used by the decoder in the model, crucial for correct sequence generation during training.

#### `__call__` Method
This method is called when an instance of the data collator is used as a function. It performs several key steps:
- `input_features = [{'input_features': feature['input_features']} for feature in features]`: This line extracts the input features (audio data processed into model-compatible features) from each batch item.
- `batch = self.processor.feature_extractor.pad(input_features, return_tensors='pt')`: Pads the input features to ensure that all sequences in the batch are the same length and converts them into PyTorch tensors.
- `label_features = [{'input_ids': feature['labels']} for feature in features]`: Extracts the labels (tokenized text data) for each batch item.
- `labels_batch = self.processor.tokenizer.pad(label_features, return_tensors='pt')`: Pads the labels similarly to ensure uniform sequence length and converts them into tensors.
- `labels = labels_batch['input_ids'].masked_fill(labels_batch.attention_mask.ne(1), -100)`: This adjusts the labels for the training by replacing non-attention areas with `-100` (commonly used to ignore these tokens during loss calculation).
- `if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item()`: Checks if all sequences in the batch correctly start with the decoder start token.
- `labels = labels[:, 1:]`: If the above condition is true, it strips the start token from the labels, preparing them for the model, which expects not to receive the start token as part of the input during training.

#### Purpose and Impact
The custom data collator is crucial for:
- **Handling Variable Lengths**: Ensures that all input feature sequences and label sequences are padded to the same length, which is necessary for batch processing in neural networks.
- **Optimizing Training**: The adjustments made to labels help in aligning them correctly for the loss calculations, thereby improving the training process.

By using this data collator, we ensure that the model receives well-prepared batches of data, which is essential for efficient and effective training.

In [None]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{'input_features': feature['input_features']} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors='pt')

        label_features = [{'input_ids': feature['labels']} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors='pt')

        labels = labels_batch['input_ids'].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch['labels'] = labels

        return batch

## Whisper Model

### Step 13: Loading the Pretrained Whisper Model

This step involves initializing the Whisper model specifically designed for conditional generation tasks, such as speech recognition. Here's how the code works:

- `model = WhisperForConditionalGeneration.from_pretrained(model_id)`: This line of code loads a pretrained version of the Whisper model identified by `model_id` ('openai/whisper-small'). The `WhisperForConditionalGeneration` class is a variant of the Whisper model that is optimized for generating one sequence from another, which is typical in tasks where the model needs to generate textual output from audio input.

#### Model Features and Capabilities
- **Conditional Generation**: This model is capable of generating text based on the conditions set by the input features. In the case of Whisper, it means generating transcriptions from spoken audio.
- **Pretrained**: The model comes pretrained on a diverse range of languages and domains, which provides a strong foundation for further fine-tuning on specific tasks like transcribing air traffic control communications.

#### Benefits of Using a Pretrained Model
- **Speed up Development**: By using a model that has already been trained on a large dataset, you significantly reduce the time and resources required for training.
- **Improved Performance**: Pretrained models often perform better out of the box compared to models trained from scratch, especially on tasks similar to those on which the model was originally trained.
- **Adaptability**: Although pretrained on diverse data, these models can be fine-tuned on specific datasets (like ATCO2 in this case) to adapt to the nuances of a particular application.

By loading this model, we prepare the foundation for fine-tuning it on the ATCO2 dataset, aiming to enhance its ability to accurately transcribe specific audio data related to air traffic control.

In [None]:
model = WhisperForConditionalGeneration.from_pretrained(model_id)

config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

### Step 14: Configuring the Model for Specific Generation Task

In this step, we adjust specific configuration settings within the Whisper model to tailor it for the transcription task. These settings optimize the model's generation behavior to better suit the needs of the specific application. Here's what each line of code achieves:

- `model.generation_config.task = 'transcribe'`: This line configures the model's generation task to 'transcribe'. By setting this attribute, we specifically inform the model that its primary function is to transcribe audio to text. This setting may influence how the model processes input and generates output, focusing its capabilities on producing accurate textual transcriptions from spoken audio.

- `model.generation_config.forced_decoder_ids = None`: This line ensures that no specific decoder tokens are forced during the generation process. The attribute `forced_decoder_ids` typically allows the model to include or prioritize certain token IDs during the decoding phase, which can be useful for guiding the model’s output. Setting this to `None` allows the model to freely generate text based on the audio input without any external influence on its decoder sequence, ensuring that the transcriptions are generated based solely on the learned patterns and the input audio features.

#### Purpose and Impact of These Configurations
- **Task-Specific Optimization**: By explicitly setting the task to 'transcribe', the model's internal mechanisms can optimize for transcription accuracy, potentially adjusting how audio inputs are evaluated and how text outputs are structured.
- **Unbiased Generation**: Removing forced decoder IDs prevents the model from being biased towards generating specific tokens, which can be crucial for maintaining the natural flow and accuracy of the generated text.

These configuration adjustments are essential for fine-tuning the model’s behavior to align with the goals of the specific machine learning task, ensuring that it performs optimally for transcription of air traffic control communications.

In [None]:
model.generation_config.task = 'transcribe'

model.generation_config.forced_decoder_ids = None

### Step 15: Initializing the Data Collator with Model-Specific Configurations

In this step, we initialize the custom data collator `DataCollatorSpeechSeq2SeqWithPadding` with specific settings derived from the model's configuration. This initialization is crucial for preparing the batches of data in a way that aligns with the model's expectations during training. Here’s a breakdown of the code:

- `data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor, decoder_start_token_id=model.config.decoder_start_token_id)`: This line creates an instance of the `DataCollatorSpeechSeq2SeqWithPadding` class. The data collator is configured with:
  - `processor`: The `WhisperProcessor` instance created earlier. This processor handles the necessary feature extraction and tokenization for both the input audio and target text.
  - `decoder_start_token_id`: This is set to the `decoder_start_token_id` from the model’s configuration. This ID is crucial as it signifies the start of a new output sequence in the model's decoder, helping to correctly format the labels for training.

#### Purpose of the Data Collator in the Training Pipeline
- **Batch Preparation**: Ensures that each batch of data is processed and padded correctly, aligning the input features and labels to the format required by the model. This includes padding sequences to a uniform length and setting up labels in a way that the model's loss function can effectively use them.
- **Integration with Model Configuration**: By using the model's `decoder_start_token_id`, the collator aligns the prepared batches with the specific needs of the model's decoder, ensuring that the training process is seamless and efficient.

The configuration of the data collator with model-specific parameters ensures that the input data is correctly processed to maximize training effectiveness and model performance during the fine-tuning process.

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

### Evaluation Metrics

### Step 16: Loading the Word Error Rate (WER) Metric

In this step, we load a performance evaluation metric to assess the accuracy of the model during training and validation. Here's the explanation of the code:

- `metric = evaluate.load('wer')`: This line uses the `evaluate` library to load the 'wer' metric, which stands for Word Error Rate. WER is a common metric used in speech recognition to measure the performance of a transcription model. It quantifies the percentage of words that were incorrectly predicted, offering a straightforward way to assess the accuracy of the generated text compared to the true text.

#### Importance of WER in Speech Recognition
- **Performance Evaluation**: WER provides a clear measure of how well the model understands and transcribes spoken language. A lower WER indicates better performance, with 0% being a perfect score where the transcribed text matches the target text exactly.
- **Model Optimization**: During training, observing changes in WER allows for adjustments in model parameters and training approach to improve accuracy.
- **Comparative Analysis**: WER is widely used, making it a standard for comparing the performance of different speech recognition models or different configurations of the same model.

By integrating the WER metric into the training and validation process, we can continuously monitor and evaluate the model’s ability to transcribe audio accurately, facilitating targeted improvements and ensuring the model meets the necessary standards for its intended application.

In [None]:
metric = evaluate.load('wer')

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

### Step 17: Defining the Metric Computation Function

This step involves creating a function, `compute_metrics`, that is used to calculate the Word Error Rate (WER) for evaluating the performance of the Whisper model during training and validation. This function takes the model's predictions and the true labels as input and returns the WER. Here's a detailed breakdown of the code:

- `def compute_metrics(pred):`: This defines the function that takes a prediction output from the model during evaluation.

- `pred_ids = pred.predictions`: Extracts the predicted token IDs from the model's output. These are the model's guesses for what the text should be based on the input audio.

- `label_ids = pred.label_ids`: Retrieves the actual token IDs that represent the true transcription. These serve as the ground truth during training.

- `label_ids[label_ids == -100] = tokenizer.pad_token_id`: This line replaces any label ID values of `-100` with the tokenizer's pad token ID. The value `-100` is typically used in training to indicate tokens that should be ignored (e.g., padding or non-relevant tokens), but for the purpose of evaluating WER, these need to be converted to a neutral token that the tokenizer recognizes.

- `pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)`: Converts the predicted token IDs back into strings, skipping special tokens like padding or start tokens that don't contribute to the actual text.

- `label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)`: Similarly, converts the true token IDs back into strings, providing the reference text against which the predictions are evaluated.

- `wer = 100 * metric.compute(predictions=pred_str, references=label_str)`: Calculates the Word Error Rate using the `evaluate` library. The WER is the percentage of words that were incorrectly predicted, and it is multiplied by 100 to convert it from a proportion to a percentage.

- `return {'wer': wer}`: Returns a dictionary containing the WER, which is how the `evaluate` library formats metric results.

#### Purpose and Impact of the Metric Computation
- **Accuracy Measurement**: This function directly measures how accurately the model transcribes spoken language into text, which is critical for assessing its effectiveness.
- **Model Tuning**: By quantifying errors in the model's output, this function helps identify areas where the model may need further training or adjustment.
- **Validation**: Regular computation of WER during training provides ongoing validation of the model's performance, ensuring that improvements are tracked over time.

Integrating this function into the training loop allows for continuous monitoring of the model’s transcription accuracy, providing essential feedback that can be used to refine the model during the training process.

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {'wer': wer}

### Define the Training Configuration

### Step 18: Configuring the Training Arguments

In this step, we define the training configuration using the `Seq2SeqTrainingArguments` class from the Transformers library. This configuration specifies how the model should be trained, how metrics should be evaluated, and how the model and its outputs should be managed. Here's a detailed explanation of the parameters set in this configuration:

- `output_dir=out_dir`: Sets the directory where the training outputs, including checkpoints and logs, will be saved.

- `per_device_train_batch_size=4`: Specifies the number of samples processed on each training step per device (e.g., per GPU).

- `per_device_eval_batch_size=4`: Specifies the batch size for evaluation.

- `gradient_accumulation_steps=2`: This parameter allows for accumulating gradients over multiple steps before performing a backpropagation, effectively increasing the effective batch size.

- `learning_rate=0.00001`: Sets the initial learning rate for the optimizer.

- `warmup_steps=500`: The number of steps to perform learning rate warmup, which gradually increases the learning rate from zero to the initial set learning rate.

- `bf16=True`: Enables training using Brain Floating Point (bfloat16) format if supported by the hardware, which can improve performance due to reduced precision operations.

- `fp16=False`: Disables training using half-precision floating point (FP16), which is another method for reducing precision to speed up training.

- `num_train_epochs=epochs`: Specifies the total number of training epochs.

- `evaluation_strategy='epoch'`: Indicates that the model should be evaluated at the end of each epoch.

- `logging_strategy='epoch'`: Configures logging to occur at the end of each epoch.

- `save_strategy='epoch'`: Specifies that the model checkpoints should be saved at the end of each epoch.

- `predict_with_generate=True`: Enables the generation of predictions by the model during evaluation, which is necessary for tasks like transcription.

- `generation_max_length=225`: Sets the maximum length of the generated sequences, important for ensuring generated texts do not exceed reasonable lengths.

- `report_to='none'`: Disables reporting to any external services (like TensorBoard or WandB).

- `load_best_model_at_end=True`: Configures the training to load the best model (according to the specified metric) at the end of training.

- `metric_for_best_model='wer'`: Specifies that the Word Error Rate should be used to evaluate the best model during training.

- `greater_is_better=False`: Indicates that a lower metric score (WER in this case) is better, which is typical for error rates.

- `dataloader_num_workers=2`: Sets the number of subprocesses to use for data loading.

- `save_total_limit=2`: Limits the number of model checkpoints to keep, helping to manage disk space.

- `lr_scheduler_type='constant'`: Specifies that the learning rate scheduler should maintain a constant learning rate throughout training.

- `seed=42`: Sets the random seed for reproducibility of training results.

- `data_seed=42`: Sets the seed for data shuffling and batching operations for consistency.

#### Purpose and Impact of These Configurations
This comprehensive setup of training arguments is crucial for managing how the model learns, how it's evaluated, and how resources like memory and computational power are utilized. These settings ensure that the model training is efficient, reproducible, and aligned with the specific needs of the transcription task. By fine-tuning these parameters, we can optimize the model's performance and manage system resources effectively during training.

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir=out_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=0.00001,
    warmup_steps=500,
    bf16=True,
    fp16=False,
    num_train_epochs=epochs,
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    save_strategy='epoch',
    predict_with_generate=True,
    generation_max_length=225,
    report_to='none',
    load_best_model_at_end=True,
    metric_for_best_model='wer',
    greater_is_better=False,
    dataloader_num_workers=2,
    save_total_limit=2,
    lr_scheduler_type='constant',
    seed=42,
    data_seed=42
)



### Step 19: Implementing a Custom Callback to Manage GPU Memory

This step involves creating a custom callback for the training process that helps manage GPU memory by clearing the cache periodically. Here's an overview of the code and its components:

#### Custom Callback Class
- `from transformers import TrainerCallback`: This imports the base class `TrainerCallback` from the Hugging Face Transformers library, which allows customization of the training process through various callback events.
  
- `class ClearCacheCallback(TrainerCallback)`: Defines a new class `ClearCacheCallback` that inherits from `TrainerCallback`. This class is designed to clear the GPU memory cache at specified intervals during training.

#### Class Initialization and Method
- `def __init__(self, clear_every_n_steps=1)`: The constructor for the callback class, which takes an argument `clear_every_n_steps`. This argument determines how frequently (in terms of training steps) the GPU cache should be cleared. The default is set to 1, indicating that the cache will be cleared after every training step.

- `def on_step_end(self, args, state, control, **kwargs)`: This method is called at the end of each training step. It checks if the current step is one at which the cache should be cleared (based on `clear_every_n_steps`).

  - `if state.global_step % self.clear_every_n_steps == 0`: This condition checks if the modulus of the global step count and the clearing frequency is zero, which indicates that it's time to clear the cache.
  
  - `torch.cuda.empty_cache()`: This command clears the GPU's cache, freeing up memory that might no longer be needed but is still being held.

  - `print(f"Cleared cache at step {state.global_step}")`: Prints a message indicating that the cache has been cleared at the current step.

#### Instantiating the Callback
- `clear_cache_callback = ClearCacheCallback(clear_every_n_steps=1)`: Creates an instance of `ClearCacheCallback`. The frequency is set to clear the cache every step by default, which can be adjusted depending on memory requirements and the size of the training data.

#### Purpose and Impact
This custom callback is especially useful in scenarios where memory overflow might be a concern, such as with large models or extensive data. By periodically clearing the cache, it helps to prevent CUDA out-of-memory errors and can improve the overall efficiency of the training process by ensuring that only necessary data is held in memory.

Including this callback in the training loop ensures smooth training operations by effectively managing GPU memory, which is crucial for maintaining performance and avoiding interruptions during long training sessions.

In [None]:
from transformers import TrainerCallback
class ClearCacheCallback(TrainerCallback):
    def __init__(self, clear_every_n_steps=1):  # Changed to 5
        self.clear_every_n_steps = clear_every_n_steps
    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step % self.clear_every_n_steps == 0:  # Changed to self.clear_every_n_steps
            torch.cuda.empty_cache()
            print(f"Cleared cache at step {state.global_step}")

clear_cache_callback = ClearCacheCallback(clear_every_n_steps=1)  # Changed to 5

### Step 20: Clearing GPU Memory Cache

This step involves a straightforward but crucial operation, especially in computational environments with GPU resources. Here's what the code does:

- `torch.cuda.empty_cache()`: This command clears the memory cache that the CUDA engine has allocated but is no longer in use by PyTorch. This operation is useful for freeing up unused GPU memory, which can be helpful before starting intensive operations like training a deep learning model.

#### Purpose and Impact
- **Memory Management**: Over the course of running various operations on a GPU, PyTorch and other libraries may accumulate data in the GPU memory cache. Not all of this memory is freed up immediately after operations that use it are completed. Clearing the cache manually ensures that memory is available for new tasks.
- **Prevent Memory Errors**: By clearing the cache, this operation can help prevent CUDA out-of-memory errors, which often occur when the GPU memory is insufficient for the tasks being executed.
- **Optimize GPU Usage**: It ensures that the GPU resources are utilized efficiently by removing data that is no longer needed, which might otherwise occupy memory space and potentially slow down other processes.

Executing this command is particularly advisable before beginning a computationally intensive task, such as training a neural network, to ensure that the maximum amount of GPU memory is available.

### Step 20: Clearing GPU Memory Cache

This step involves a straightforward but crucial operation, especially in computational environments with GPU resources. Here's what the code does:

- `torch.cuda.empty_cache()`: This command clears the memory cache that the CUDA engine has allocated but is no longer in use by PyTorch. This operation is useful for freeing up unused GPU memory, which can be helpful before starting intensive operations like training a deep learning model.

#### Purpose and Impact
- **Memory Management**: Over the course of running various operations on a GPU, PyTorch and other libraries may accumulate data in the GPU memory cache. Not all of this memory is freed up immediately after operations that use it are completed. Clearing the cache manually ensures that memory is available for new tasks.
- **Prevent Memory Errors**: By clearing the cache, this operation can help prevent CUDA out-of-memory errors, which often occur when the GPU memory is insufficient for the tasks being executed.
- **Optimize GPU Usage**: It ensures that the GPU resources are utilized efficiently by removing data that is no longer needed, which might otherwise occupy memory space and potentially slow down other processes.

Executing this command is particularly advisable before beginning a computationally intensive task, such as training a neural network, to ensure that the maximum amount of GPU memory is available.

In [None]:
torch.cuda.empty_cache()

### Step 21: Initializing the Seq2SeqTrainer

This step involves setting up the `Seq2SeqTrainer` from the Transformers library, a specialized trainer class for sequence-to-sequence models like the Whisper model. This trainer encapsulates all the components needed for training and evaluating the model. Here’s a detailed explanation of the initialization and its components:

- `trainer = Seq2SeqTrainer(...)`: This statement creates an instance of the `Seq2SeqTrainer`. The trainer is configured with several important components:

  - `args=training_args`: Passes the training configuration set in previous steps. These arguments define various training parameters like batch size, number of epochs, learning rate, etc.

  - `model=model`: Specifies the model that will be trained. This is the Whisper model loaded and configured in previous steps.

  - `train_dataset=atc_dataset_train`: Sets the dataset that will be used for training. This is the ATCO2 dataset that has been preprocessed and prepared for training.

  - `eval_dataset=atc_dataset_valid`: Sets the dataset used for validation. Validation is crucial for monitoring the model's performance on data it hasn't seen during training.

  - `data_collator=data_collator`: Specifies the data collator that handles batch creation and data preparation during training. The custom data collator ensures that batches are correctly formatted for the model.

  - `compute_metrics=compute_metrics`: Sets the function that will be used to compute metrics during evaluation. The function computes the Word Error Rate (WER), providing insight into the model's transcription accuracy.

  - `tokenizer=processor.feature_extractor`: Although typically the tokenizer would be passed here, for models like Whisper, the feature extractor plays a critical role in processing the audio input. This ensures that the input data is correctly preprocessed before being fed into the model.

  - `callbacks=[clear_cache_callback]`: Includes custom callbacks that are called during the training process. The `ClearCacheCallback` is used here to manage GPU memory effectively, ensuring that the cache is cleared periodically to avoid memory overflow.

#### Purpose and Impact of the Trainer Configuration
- **Comprehensive Management**: The `Seq2SeqTrainer` manages all aspects of training, from data loading and batching to model evaluation and checkpointing. This setup allows for a streamlined training process that is easy to monitor and optimize.
  
- **Performance Monitoring**: By integrating evaluation and custom metrics computation, the trainer provides ongoing feedback about the model's performance, allowing for adjustments to be made in real-time during training.

- **Resource Optimization**: The inclusion of a custom callback for memory management is critical for maintaining efficient GPU usage, preventing potential disruptions during training due to resource limitations.

This trainer setup is crucial for executing a smooth and effective training operation, ensuring that all components work together seamlessly to optimize the model's performance on the transcription task.

In [None]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=atc_dataset_train,
    eval_dataset=atc_dataset_valid,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[clear_cache_callback]
)

  trainer = Seq2SeqTrainer(


### Training

### Step 22: Evaluating the Model

This step involves using the `Seq2SeqTrainer` to evaluate the performance of the Whisper model on the validation dataset. Here's what the code does and why it's important:

- `trainer.evaluate()`: This method calls the evaluate function of the `Seq2SeqTrainer`. The function uses the validation dataset (`eval_dataset`) that was specified when the trainer was configured. This evaluation process involves running the model on the validation dataset and calculating the performance metrics that were defined earlier, specifically the Word Error Rate (WER).

#### Purpose and Impact of Model Evaluation
- **Performance Measurement**: The primary purpose of evaluation is to measure how well the model performs on a set of data it has not been trained on. This helps in understanding the model's generalization capabilities.
- **Metric Calculation**: During the evaluation, the `compute_metrics` function is invoked to calculate the WER for the predictions made by the model compared to the actual labels in the validation dataset. This metric provides a quantitative measure of the model's accuracy in transcribing speech to text.
- **Model Tuning**: Based on the results of this evaluation, further tuning of the model's parameters might be necessary to improve performance or address any issues like overfitting or underfitting.
- **Validation Feedback**: Regular evaluation during training (if set to evaluate at the end of each epoch or at specific intervals) provides ongoing feedback on the model's progress and effectiveness of the training strategy.

#### Benefits of Continuous Evaluation
- **Model Improvement**: Continuous monitoring of the model’s performance during the training process allows for iterative improvements, enhancing the model's effectiveness with each epoch.
- **Early Stopping**: If the evaluation shows that the model's performance has plateaued or is beginning to degrade, training can be stopped early to save resources and prevent potential overfitting.

This step is crucial for ensuring that the model is effectively learning and adapting to the task of speech recognition, and it provides essential insights that guide the ongoing training and development process.

In [None]:
trainer.evaluate()

You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50259], [2, 50359], [3, 50363]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_c

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


{'eval_loss': 2.111818313598633,
 'eval_model_preparation_time': 0.0072,
 'eval_wer': 99.36305732484077,
 'eval_runtime': 17.2913,
 'eval_samples_per_second': 0.578,
 'eval_steps_per_second': 0.116}

### Step 23: Starting the Model Training Process

This final step involves initiating the actual training of the Whisper model using the `Seq2SeqTrainer` configured in previous steps. Here’s what this operation entails:

- `trainer.train()`: This method starts the training loop of the `Seq2SeqTrainer`. Throughout the training process, the model will use the training dataset (`train_dataset`) to learn how to accurately transcribe speech to text. The training arguments (`training_args`) previously defined will dictate the specifics of the training process, such as the number of epochs, batch sizes, learning rate, and other parameters.

#### Key Aspects of the Training Process
- **Batch Processing**: The model processes batches of data, each consisting of audio inputs and their corresponding text transcriptions. The `data_collator` ensures that each batch is correctly prepared and formatted for input into the model.
- **Loss Calculation and Backpropagation**: During each batch processing, the model computes loss based on the difference between its predictions and the actual labels. It then updates its weights through backpropagation to minimize this loss.
- **Metrics Monitoring**: The `compute_metrics` function is used to calculate metrics (such as WER) after each evaluation step (if configured to evaluate during training). This provides insights into the model's performance and effectiveness.
- **Checkpointing and Logging**: Depending on the `save_strategy` and `logging_strategy`, the trainer will save checkpoints and log training progress at specified intervals. This helps in monitoring the training process and recovering the training from the last checkpoint in case of interruptions.
- **GPU Memory Management**: The custom `ClearCacheCallback` included in the trainer’s callbacks helps manage GPU memory usage, ensuring efficient resource utilization and preventing potential memory overflow.

#### Purpose and Impact of Training
- **Model Optimization**: Training is the core phase where the model learns and optimizes its parameters to perform the specific task of speech transcription as accurately as possible.
- **Model Validation**: If `evaluation_strategy` is set to evaluate during training, periodic validation checks help gauge the model's generalization capabilities and prevent overfitting.
- **Resource Management**: Effective management of computational resources during training ensures that the process is as efficient as possible, maximizing the use of available hardware without unnecessary wastage.

By the end of the training process, the model should be well-tuned and capable of transcribing speech with a high degree of accuracy, ready for further validation or deployment in real-world applications.

In [None]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Cleared cache at step 1


Epoch,Training Loss,Validation Loss,Wer
1,3.3807,1.166981,43.949045
2,1.3417,0.906171,45.22293
3,0.5638,0.789673,34.394904
4,0.2268,0.769878,27.388535
5,0.0726,0.768183,29.299363
6,0.0375,0.765243,30.573248
7,0.0252,0.748279,25.796178
8,0.0055,0.777459,27.070064


Cleared cache at step 2
Cleared cache at step 3
Cleared cache at step 4
Cleared cache at step 5
Cleared cache at step 6
Cleared cache at step 7


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50259], [2, 50359], [3, 50363]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use

Cleared cache at step 8
Cleared cache at step 9
Cleared cache at step 10
Cleared cache at step 11
Cleared cache at step 12
Cleared cache at step 13
Cleared cache at step 14


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Cleared cache at step 15
Cleared cache at step 16
Cleared cache at step 17
Cleared cache at step 18
Cleared cache at step 19
Cleared cache at step 20
Cleared cache at step 21


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Cleared cache at step 22
Cleared cache at step 23
Cleared cache at step 24
Cleared cache at step 25
Cleared cache at step 26
Cleared cache at step 27
Cleared cache at step 28


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Cleared cache at step 29
Cleared cache at step 30
Cleared cache at step 31
Cleared cache at step 32
Cleared cache at step 33
Cleared cache at step 34
Cleared cache at step 35


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Cleared cache at step 36
Cleared cache at step 37
Cleared cache at step 38
Cleared cache at step 39
Cleared cache at step 40
Cleared cache at step 41
Cleared cache at step 42


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Cleared cache at step 43
Cleared cache at step 44
Cleared cache at step 45
Cleared cache at step 46
Cleared cache at step 47
Cleared cache at step 48
Cleared cache at step 49


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Cleared cache at step 50
Cleared cache at step 51
Cleared cache at step 52
Cleared cache at step 53
Cleared cache at step 54
Cleared cache at step 55
Cleared cache at step 56


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Cleared cache at step 57
Cleared cache at step 58
Cleared cache at step 59
Cleared cache at step 60


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

TrainOutput(global_step=60, training_loss=0.6610405718286833, metrics={'train_runtime': 1105.9831, 'train_samples_per_second': 0.452, 'train_steps_per_second': 0.054, 'total_flos': 1.2466889293824e+17, 'train_loss': 0.6610405718286833, 'epoch': 8.615384615384615})

In [None]:
trainer.evaluate()

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


{'eval_loss': 0.7482790350914001,
 'eval_wer': 25.796178343949045,
 'eval_runtime': 16.7973,
 'eval_samples_per_second': 0.595,
 'eval_steps_per_second': 0.179,
 'epoch': 8.615384615384615}

### Step 24: Saving the Trained Model and Associated Components

This step involves saving the trained model and its associated components to ensure they can be reused or deployed later. Here’s a detailed explanation of the operations performed:

- `model.save_pretrained(f"{out_dir}/best_model")`: This command saves the trained Whisper model to the specified directory (`out_dir`), under a subdirectory named `best_model`. This function ensures that all model parameters and configurations are preserved, allowing the model to be loaded later with the same state as when it was saved.

- `tokenizer.save_pretrained(f"{out_dir}/best_model")`: Alongside the model, the tokenizer used during the training and data preparation is also saved in the same directory. Saving the tokenizer is crucial because it ensures that the same tokenization process used during training is applied during future predictions, maintaining consistency in how input data is processed.

- `processor.save_pretrained(f"{out_dir}/best_model")`: Finally, the processor, which encompasses both the feature extraction and tokenization components, is saved. This is important for ensuring that any preprocessing steps that were applied to the data during training can be exactly replicated when the model is used in the future.

#### Purpose and Impact of Saving Trained Components
- **Model Deployment**: Saving the model and its components allows for easy deployment in production environments where the model can be used to perform actual speech transcription tasks.
- **Consistency and Reproducibility**: By saving the tokenizer and processor along with the model, we ensure that the same preprocessing steps are used during both training and inference, which is critical for model performance and accuracy.
- **Research and Development**: Saved models can be shared with the community or used in further research and development. They allow others to reproduce the results, experiment with the model, or fine-tune it on additional data.

By preserving the model and its associated components, this step finalizes the training process, making the outcomes of the model training accessible and usable for further application and analysis.

In [None]:
model.save_pretrained(f"{out_dir}/best_model")
tokenizer.save_pretrained(f"{out_dir}/best_model")
processor.save_pretrained(f"{out_dir}/best_model")

[]

In [None]:
!zip -r whisper_small_atco2 whisper_small_atco2

/bin/bash: line 1: zip: command not found


## New Section