# Fine-tune Whisper for Polish voice commands
Install necessary libraries

In [1]:
!pip install transformers datasets evaluate librosa torchaudio jiwer

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.5.1->torchaudio)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.5.1-

Provide huggingface token

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
The token `imlla` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate

Import datasets library. Load MASSIVE dataset for polish from huggingface hub.

In [3]:
from datasets import load_dataset, DatasetDict


dataset = load_dataset("FBK-MT/Speech-MASSIVE-test", 'pl-PL', split='test', trust_remote_code=True)
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/62.7k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/27 [00:00<?, ?it/s]

test-00000-of-00003.parquet:   0%|          | 0.00/335M [00:00<?, ?B/s]

test-00001-of-00003.parquet:   0%|          | 0.00/337M [00:00<?, ?B/s]

test-00002-of-00003.parquet:   0%|          | 0.00/335M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/2974 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'locale', 'partition', 'scenario', 'scenario_str', 'intent_idx', 'intent_str', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments', 'tokens', 'labels', 'audio', 'path', 'is_transcript_reported', 'is_validated', 'speaker_id', 'speaker_sex', 'speaker_age', 'speaker_ethnicity_simple', 'speaker_country_of_birth', 'speaker_country_of_residence', 'speaker_nationality', 'speaker_first_language'],
    num_rows: 2974
})


Select only relevant columns: utt - a reference text and audio - a array of numbers representing the sound

In [4]:
dataset = dataset.select_columns(['utt','audio'])
dataset[0]

{'utt': 'jaki film jest teraz najwyżej oceniany',
 'audio': {'path': 'dab7ab8b100ece47b0a1e822a6beac5b.wav',
  'array': array([-6.98491931e-10,  8.14907253e-10, -8.14907253e-10, ...,
         -7.34902860e-04, -6.60274061e-04, -4.50702908e-04]),
  'sampling_rate': 16000}}

Import libaries and a model for sound processing. Define preprocessing raw array into voice features using pretrained model.

In [5]:
import librosa
import os
from transformers import WhisperProcessor


processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

def preprocess_function(batch):

    audio = batch["audio"]
    input_features = processor(
        audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt"
    ).input_features

    batch["input_features"] = input_features
    batch["input_ids"] = processor.tokenizer(batch["utt"]).input_ids

    return batch

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Execute preprocessing function. Split the dataset into train, validation and test subsets with proportion 80:20:20.

In [6]:
processed_dataset = dataset.map(preprocess_function)

train_val_dataset = processed_dataset.train_test_split(test_size=0.2, seed=42)

train_val_split = train_val_dataset["train"].train_test_split(test_size=0.25, seed=42)

processed_dataset = DatasetDict({
    "train": train_val_split["train"],
    "validation": train_val_split["test"],
    "test": train_val_dataset["test"]
})

Map:   0%|          | 0/2974 [00:00<?, ? examples/s]

Import base Wisper podel from huggingface hub.

In [7]:
import torch
from transformers import WhisperForConditionalGeneration


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.to(device)


Using device: cuda


config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 384)
      (layers): ModuleList(
        (0-3): 4 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=384, out_features=384, bias=False)
            (v_proj): Linear(in_features=384, out_features=384, bias=True)
            (q_proj): Linear(in_features=384, out_features=384, bias=True)
            (out_proj): Linear(in_features=384, out_features=384, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=384, out_features=1536, bias=True)
          (fc2): Linear(in_features=1536, out_features=384, bias=True)
          

Verify how base model transcribe polish voice commands.

In [8]:
def transcribe(batch):
    input_features = batch["input_features"]
    input_features = torch.tensor(input_features).to(device)

    with torch.no_grad():
        predicted_ids = model.generate(input_features)

    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

examples = processed_dataset["test"].select(range(5))

results = examples.map(lambda x: {"transcription": transcribe(x)})

for result in results:
    print(f"Original Text: {result['utt']}")
    print(f"Transcription: {result['transcription'].lower()}")
    print("-" * 50)


Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Original Text: wyślij maila do mojego brata i przypomnij o rocznicy ślubu
Transcription:  wysli myę latą mojego biata i przypamni o nici ślubu.
--------------------------------------------------
Original Text: przypomnij mi o jutrzejszym spotkaniu godzinę wcześniej
Transcription:  przypomnij mi o jutrzejszym spotkaniu godzinę wcześniej.
--------------------------------------------------
Original Text: graj plejlistę boba dylana
Transcription:  gra i play listę boba dylana.
--------------------------------------------------
Original Text: graj ale jazz autorki sanah
Transcription:  grei, al het rust autoorkisana.
--------------------------------------------------
Original Text: olly posłuchajmy sto jeden i trzy f. m.
Transcription:  oli posłuchajmy sto jeden i trzefam.
--------------------------------------------------


Calculate WER for base model.

In [9]:
from evaluate import load


wer_metric = load("wer")

results = processed_dataset["test"].map(lambda x: {"transcription": transcribe(x)})

wer = wer_metric.compute(
    predictions=results["transcription"], references=results["utt"]
)


print(f"Word Error Rate (WER) on the test set: {wer:.4f}")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Map:   0%|          | 0/595 [00:00<?, ? examples/s]

Word Error Rate (WER) on the test set: 0.8435


Define training arguments.

In [10]:
from transformers import TrainingArguments
from transformers import DataCollatorWithPadding

training_args = TrainingArguments(
    output_dir="./whisper-tiny-polish",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=15,
    remove_unused_columns=False,
    push_to_hub=False,
    report_to="none",
)

def data_collator(features):
    input_features = [{"input_features": torch.tensor(feature["input_features"]).squeeze(0)} for feature in features]
    input_features = processor.feature_extractor.pad(input_features, return_tensors="pt")

    labels = [{"input_ids": feature["input_ids"]} for feature in features]
    labels = processor.tokenizer.pad(labels, return_tensors="pt")

    return {
        "input_features": input_features["input_features"],
        "labels": labels["input_ids"],
    }

Fine-tune the model.

In [11]:
from transformers import Trainer
import os

os.environ["WANDB_DISABLED"] = "true"

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"],
    data_collator=data_collator,
    tokenizer=processor.tokenizer,
)
trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.311058
2,No log,0.296428
3,No log,0.295069
4,No log,0.297304
5,0.150200,0.295954
6,0.150200,0.299015
7,0.150200,0.293699
8,0.150200,0.28956
9,0.004200,0.291046
10,0.004200,0.289003




Epoch,Training Loss,Validation Loss
1,No log,0.311058
2,No log,0.296428
3,No log,0.295069
4,No log,0.297304
5,0.150200,0.295954
6,0.150200,0.299015
7,0.150200,0.293699
8,0.150200,0.28956
9,0.004200,0.291046
10,0.004200,0.289003


TrainOutput(global_step=1680, training_loss=0.04619488564009468, metrics={'train_runtime': 7073.8726, 'train_samples_per_second': 3.783, 'train_steps_per_second': 0.237, 'total_flos': 6.588013658112e+17, 'train_loss': 0.04619488564009468, 'epoch': 15.0})

Calculate WER for fine-tuned model.

In [12]:
results = processed_dataset["test"].map(lambda x: {"transcription": transcribe(x)})

wer = wer_metric.compute(
    predictions=results["transcription"], references=results["utt"]
)


print(f"Word Error Rate (WER) on the test set: {wer:.4f}")

Map:   0%|          | 0/595 [00:00<?, ? examples/s]

Word Error Rate (WER) on the test set: 0.3176


Verify how fine-tuned model transcribe polish voice commands.

In [13]:
results = examples.map(lambda x: {"transcription": transcribe(x)})

for result in results:
    print(f"Original Text: {result['utt']}")
    print(f"Transcription: {result['transcription'].lower()}")
    print("-" * 50)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Original Text: wyślij maila do mojego brata i przypomnij o rocznicy ślubu
Transcription: wyślij maila do mojego bryata i przypomnij mi o lepszy ślubu
--------------------------------------------------
Original Text: przypomnij mi o jutrzejszym spotkaniu godzinę wcześniej
Transcription: przypomnij mi o jutrzejszym spotkaniu godzina wcześniej
--------------------------------------------------
Original Text: graj plejlistę boba dylana
Transcription: graj playlistę boba delana
--------------------------------------------------
Original Text: graj ale jazz autorki sanah
Transcription: graj ale jazz autorki sanah
--------------------------------------------------
Original Text: olly posłuchajmy sto jeden i trzy f. m.
Transcription: olly posłuchaj we z to jeden i trzy f. m.
--------------------------------------------------


Push the model to huggingface hub.

In [None]:
model.push_to_hub("whisper-tiny-polish")
processor.push_to_hub("whisper-tiny-polish")

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/gs224/whisper-tiny-polish/commit/eb54d908839e48e9d829d21012e21b509e00618f', commit_message='Upload processor', commit_description='', oid='eb54d908839e48e9d829d21012e21b509e00618f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/gs224/whisper-tiny-polish', endpoint='https://huggingface.co', repo_type='model', repo_id='gs224/whisper-tiny-polish'), pr_revision=None, pr_num=None)