# Danish ASR: Fine-tuning Whisper

This notebook documents our pipeline for fine-tuning and knowledge distilling (KD) Whisper-Large-v3-Turbo on Danish ASR data. It was created for reproducibility purposes, providing a detailed walkthrough of our experimental setup and implementation.

## Hardware Requirements

Our experiments were conducted on two different GPU setups:
- Preliminary work: NVIDIA A100 (80GB)
- Final models: NVIDIA H200 (141GB)

To reproduce our results with the provided configurations, you'll need similar high-end GPU resources, particularly for:
- Batch sizes (32 for fine-tuning, 24 for distillation)
- Mixed precision training (bfloat16)
- Model sizes (Whisper-Large-v3 (Hviske-v2): ~1.5B parameters & Whisper-Large-v3-Turbo 809M parameters))

If using different GPU hardware, you may need to adjust:
- `batch_size` in train_config.yaml
- `per_device_train_batch_size` in distill_config.yaml
- Precision settings (`fp16`/`bf16`) - perhaps using 8 or 4 bit quantization
- Gradient accumulation steps and or using gradient checkpoint

## Pipeline Overview and Execution

Before running the experiments, configure your settings in:
1. `configs/train_config.yaml`: Fine-tuning parameters and dataset paths
2. `configs/distill_config.yaml`: Knowledge distillation settings
3. `configs/baseline_config.yaml`: Evaluation parameters for benchmark testing

The pipeline consists of the following steps:

### 1. Dataset Preparation
- Downloads the ASR dataset(s): `alexandrainst/coral`
- Saves preprocessed data locally
- Ensures consistent format for training

### 2. Training (2 Approaches)

- Both approaches use bfloat16 precision
- Fine-tuning uses batch size of 32
- Knowledge distillation uses batch size of 24
- Models are saved to the specified output directory

### 3. Evaluation
- Comprehensive testing on multiple Danish benchmarks
- Detailed performance analysis across datasets
- Results saved in `evaluation/` directory

NB: for simplicity and ease of the reader we have included python blocks to print the contents of the respective files. Otherwise, please refer to the actual python files in their respective locations.

## 1. Dataset Preparation and Download

Due to the large size of Danish ASR datasets (several hundred GBs), we've separated the dataset downloading and preprocessing steps. The `download_dataset.py` script handles this through configuration parameters in `train_config.yaml`:

```yaml
dataset:
  download:
    output_dir: "huge_subset"      # Directory where the dataset will be saved
    train_size: 50000             # Number of training samples to download
    val_size: 1000               # Number of validation samples to download
```

To download a subset of the dataset:
```bash
python src/data/download_dataset.py
```

**Important Notes:**
- The download process saves the data locally for faster access during training
- We avoid streaming the full dataset during training because streaming + audio column casting can cause memory issues
- We also download the dataset for consistency across runs, it's very tedious to apply shuffling on these large datasets
- The downloaded data is saved in `{output_dir}/data/train` and `{output_dir}/data/val`
- Make sure you have sufficient disk space before downloading (each audio sample can be several MB)

In [None]:
# Show the download_dataset.py script
from pathlib import Path

def show_file(filepath):
    with open(filepath, 'r') as f:
        print(f.read())

print("download_dataset.py:")
show_file('../src/data/download_dataset.py')

## 2. Data Loading

The `data_loader.py` script provides a consistent interface for loading locally stored datasets across different computing environments (local machines, VM instances, etc.). The actual data preprocessing happens in the training scripts.

```python
# Example from data_loader.py
def load_dataset(cfg):
    """
    Load dataset from local directory with consistent path handling
    Args:
        cfg: Configuration containing dataset parameters
    Returns:
        DatasetDict: Dataset with 'train' and 'validation' splits
    """
    data_dir = Path(cfg.dataset.data_dir)
    return DatasetDict({
        'train': load_from_disk(data_dir / 'train'),
        'validation': load_from_disk(data_dir / 'val')
    })
```

**Key Points:**
- The loader ensures consistent path handling across different OS environments
- Audio preprocessing (resampling, feature extraction) is handled in the training scripts
- The loader expects data in the structure created by `download_dataset.py`:
  ```
  huge_subset/
  └── data/
      ├── train/
      └── val/
  ```

The actual data preprocessing pipeline (audio transformations, batching, etc.) is implemented in the respective training scripts (`finetune_whisper.py`, `knowledge_distil.py`) to maintain flexibility for different training approaches.

In [None]:
print("data_loader.py:")
show_file('../src/data/data_loader.py')

## 3. Fine-tuning

The `finetune_whisper.py` script fine-tunes Whisper on Danish ASR data. The hyperparameters are controlled through `train_config.yaml`:

```yaml
# Training configuration
training:
  output_dir: "models/whisper-large-v3-turbo-finetuned_50k"  # Where to save the model
  num_train_epochs: 3                                         # Number of training epochs
  batch_size: 32                                             # Batch size per GPU
  gradient_accumulation_steps: 1                             # Accumulation for larger effective batch
  learning_rate: 3e-5                                        # Learning rate
  weight_decay: 0.01                                         # Weight decay for regularization
  bf16: true                                                # Use bfloat16 precision
  warmup_steps: 500                                         # Learning rate warmup
  save_steps: 1000                                          # Save checkpoint every N steps
  eval_steps: 1000                                          # Evaluate every N steps

# Model configuration
model:
  name: "openai/whisper-large-v3"                           # Base model to fine-tune
  device: 0                                                 # GPU device ID
  fp16: false                                              # Don't use float16 precision

# Dataset configuration
dataset:
  name: "alexandrainst/coral"                              # Dataset identifier
  data_dir: "huge_subset/data"                             # Path to local dataset
  sampling_rate: 16000                                     # Audio sampling rate
  num_proc: 30                                             # Number of preprocessing workers
  batch_size_per_proc: 8                                   # Batch size per worker
```

The script:
1. Loads the pretrained Whisper model with the specified configuration
2. Sets up training with the defined hyperparameters
3. Fine-tunes on the Danish dataset
4. Saves checkpoints and the final model to the specified `output_dir`

**Note:** Due to the large size of the fine-tuned models (several GBs), they are not included in the repository. The trained models can be accessed via our [Google Drive link](https://drive.google.com/drive/folders/1AoEGmsw_cjO7eRFs3s6dPXd2oSiWj7cO?usp=sharing).

In [None]:
print("finetune_whisper.py:")
show_file('../src/models/finetune_whisper.py')

## 4. Knowledge Distillation

The `knowledge_distil.py` script performs knowledge distillation from a teacher to a student Whisper model. The hyperparameters are controlled through `distill_config.yaml`:

```yaml
# Teacher model configuration
teacher_model:
  name: "syvai/hviske-v2"                                  # Teacher model to distill from
  device: 1                                                # GPU device ID for teacher
  fp16: false                                             # Don't use float16 precision

# Student model configuration
student_model:
  name: "openai/whisper-large-v3-turbo"                   # Student model to train
  device: 1                                               # GPU device ID for student
  fp16: false                                            # Don't use float16 precision

# Dataset configuration
dataset:
  name: "alexandrainst/coral"                            # Dataset identifier
  data_dir: "huge_subset/data"                           # Path to preprocessed dataset
  num_proc: 5                                            # Number of preprocessing workers

# Training configuration
training:
  output_dir: "models/distilled-whisper-turbo-large_subset"  # Where to save the model
  per_device_train_batch_size: 24                            # Batch size per GPU
  per_device_eval_batch_size: 24                             # Evaluation batch size
  gradient_accumulation_steps: 1                             # Accumulation for larger batch
  learning_rate: 3e-5                                        # Learning rate
  num_train_epochs: 1                                        # Number of training epochs
  bf16: true                                                # Use bfloat16 precision
  temperature: 2                                             # Softmax temperature
  alpha: 0.7                                                # KD loss weight

# LoRA parameters (implemented but not used in final experiments)
lora:
  rank: 16
  alpha: 64
  dropout: 0.05
  target_modules: [
    "q_proj", "v_proj", "k_proj",
    "out_proj", "fc1", "fc2"
  ]
```

The script:
1. Loads both teacher and student models with their respective configurations
2. Uses the preprocessed dataset from the specified `data_dir`
3. Sets up knowledge distillation training with the defined hyperparameters
4. Saves checkpoints and the final distilled model to the specified `output_dir`

**Note:** Like the fine-tuned models, the distilled models are also available via our [Google Drive link](insert_link_here) due to their large size.

**Key Concepts:**
- `temperature`: Controls the softness of the teacher's predictions
- `alpha`: Balances between distillation and task-specific losses
- `data_dir`: Points to the preprocessed dataset created by `download_dataset.py`

**Future Work:**
While we implemented LoRA (Low-Rank Adaptation) for memory-efficient fine-tuning, this approach was not included in the final report or experiments. The implementation remains in the codebase for future exploration and comparison with our current approaches.

In [None]:
# Show the knowledge_distil.py script
from pathlib import Path

def show_file(filepath):
    with open(filepath, 'r') as f:
        print(f.read())

print("knowledge_distil.py:")
show_file('../src/KD/knowledge_distil.py')

## 5. Evaluation

Our evaluation pipeline consists of two main components:

The `finetuned_test.py` script evaluates models on the Coral dataset (our primary fine-tuning target):
- Detailed demographic analysis (age, gender, dialect)
- Word Error Rate (WER) and Character Error Rate (CER)
- Example transcriptions for qualitative analysis
- Results saved in `evaluation/finetuned/`

The `baseline2.py` script tests model generalization across different data distributions using three Danish ASR benchmarks:
1. Mozilla Common Voice 17.0
2. NST (Danish)
3. Google/FLEURS Danish

**Note:** Results are organized in separate directories:
- `evaluation/finetuned/`: Target dataset (Coral) results
- `evaluation/benchmarking/`: Cross-dataset benchmark results (we also test our trained models here, but on acessory datasets)

In [None]:
print("Evaluation on target dataset:")
show_file('../src/models/finetuned_test.py')

print("\nBenchmark testing across datasets:")
show_file('../src/models/baseline2.py')