# OpenAI Whisper Fine-tuning for Korean ASR with HuggingFace Transformers

본격적인 작업에 앞서, Colab을 사용하여 본 작업을 수행한다면 데이터의 전처리 작업까지는 CPU를 활용하는 것을 권장한다. 데이터 전처리 후 Huggingface에 전처리된 데이터셋을 업로드하고, 그 후에 런타임 유형을 GPU로 변환하여 전처리된 데이터셋을 로드, 모델의 학습을 수행한다면 Colab 컴퓨팅 자원의 소모를 방지할 수 있을 것이다.

### 1. Prepare Environment

In [1]:
# 훈련이 끝난 모델을 HuggingFace Hub에 업로드하기 위해 로그인
# 비공개 혹은 제한된 공개의 데이터셋에 접근할 경우에도 로그인이 필요하다.
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install evaluate>=0.30
!pip install jiwer
!pip install accelerate -U
!pip install transformers[torch]
!pip install wandb

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 16.1.0 which is incompatible.
google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 16.1.0 which is incompatible.[0m[31m
[0mCollecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-h94qd39g
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-h94qd39g
  Resolved https://github.com/huggingface/transformers to commit cee768d97e42c6fcf744ba4d2a4dc8a8e78da4c1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadat

In [None]:
from datasets import Dataset, DatasetDict
from datasets import Audio

In [None]:
from datasets import load_dataset

### 2. Youtube Dataset download from Hugging Face

log mel spectrogram 및 정규화 전처리 완료된 유튜브 데이터셋을 허깅페이스에서 다운로드

In [None]:
from datasets import load_dataset, DatasetDict, concatenate_datasets

# 캐시 디렉토리 설정
# cache_dir = "your_cache_directory_here"

# 도메인 리스트
domains = ["arc_de", "cul_his", "eco", "edu", "job", "life", "nat_sci_it_tec", "pol", "soc", "sp_rel"]

# 각 도메인에 대해 데이터셋 로드 및 병합
merged_datasets = {
    "train": [],
    "test": [],
    "valid": []
}

for domain in domains:
    dataset_name = f"svenskpotatis/Youtube_{domain}_preprocessed"
    dataset = load_dataset(dataset_name)    #cache_dir=cache_dir

    merged_datasets["train"].append(dataset["train"])
    merged_datasets["test"].append(dataset["test"])
    merged_datasets["valid"].append(dataset["valid"])

# 병합된 데이터셋을 DatasetDict 형태로 변환
youtube_dataset_preprocessed = DatasetDict({
    "train": concatenate_datasets(merged_datasets["train"]),
    "test": concatenate_datasets(merged_datasets["test"]),
    "valid": concatenate_datasets(merged_datasets["valid"])
})

# 최종 데이터셋 확인
youtube_dataset_preprocessed

Downloading readme:   0%|          | 0.00/561 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/26 [00:00<?, ?files/s]

Downloading data:   0%|          | 0.00/345M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/348M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/344M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/349M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/348M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/346M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/346M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13164 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1646 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/1645 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/26 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/561 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/28 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/28 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/28 [00:00<?, ?files/s]

### 3. Finetuning

In [None]:
# GPU 사용 여부 확인
import torch
print(torch.cuda.is_available())

In [None]:
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

#### 4-1. Load a Pre-Trained Checkpoint

In [None]:
# # 파인 튜닝한 모델을 로드
from transformers import WhisperForConditionalGeneration, WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer

model = WhisperForConditionalGeneration.from_pretrained("NexoChatFuture/whisper-small-youtube-extra")
feature_extractor = WhisperFeatureExtractor.from_pretrained("NexoChatFuture/whisper-small-youtube-extra")
tokenizer = WhisperTokenizer.from_pretrained("NexoChatFuture/whisper-small-youtube-extra-tokenizer")
processor = WhisperProcessor.from_pretrained("NexoChatFuture/whisper-small-youtube-extra-processor")

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.5k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

>Evaluation Metrics

In [None]:
import evaluate

cer_metric = evaluate.load('cer')
wer_metric = evaluate.load('wer')

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # pad_token을 -100으로 치환
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # metrics 계산 시 special token들을 빼고 계산하도록 설정
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    cer = 100 * cer_metric.compute(predictions=pred_str, references=label_str)
    wer = 100 * wer_metric.compute(predictions=pred_str, references=label_str)

    return {"cer": cer, "wer": wer}

In [None]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # 인풋 데이터와 라벨 데이터의 길이가 다르며, 따라서 서로 다른 패딩 방법이 적용되어야 한다. 그러므로 두 데이터를 분리해야 한다.
        # 먼저 오디오 인풋 데이터를 간단히 토치 텐서로 반환하는 작업을 수행한다.
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Tokenize된 레이블 시퀀스를 가져온다.
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # 레이블 시퀀스에 대해 최대 길이만큼 패딩 작업을 실시한다.
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # 패딩 토큰을 -100으로 치환하여 loss 계산 과정에서 무시되도록 한다.
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # 이전 토크나이즈 과정에서 bos 토큰이 추가되었다면 bos 토큰을 잘라낸다.
        # 해당 토큰은 이후 언제든 추가할 수 있다.
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


In [None]:
# 데이터 콜레이터 초기화
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### 4-1. Define the Training Arguments

In [None]:
from transformers import Seq2SeqTrainingArguments, EarlyStoppingCallback, Seq2SeqTrainer

early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=3)
model_checkpoint = "/content/drive/MyDrive/NexoChat_share_data/STT Model/finetunning checkpoint/checkpoint-10000"

# 학습 파라미터 설정
training_args = Seq2SeqTrainingArguments(
    output_dir="NexoChatFuture/whisper-small-ko",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=15000,
    num_train_epochs=1,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="cer",
    greater_is_better=False,
    push_to_hub=True,
    resume_from_checkpoint=model_checkpoint  # 체크포인트에서 재개
)

# 트레이너 설정
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=youtube_dataset_preprocessed["train"],
    eval_dataset=youtube_dataset_preprocessed["valid"],  # or "test"
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[early_stopping_callback]
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs


In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=youtube_dataset_preprocessed["train"],
    eval_dataset=youtube_dataset_preprocessed["valid"],  # or "test"
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[early_stopping_callback]
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs


### 4-2. Training

In [None]:
# checkpoint-7000 부터 학습 재개
trainer.train(resume_from_checkpoint=model_checkpoint)

There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


Step,Training Loss,Validation Loss,Cer,Wer
7500,0.3951,0.361256,10.200833,25.492484
8000,0.4018,0.358888,10.066174,25.2825
8500,0.3957,0.356814,10.073854,25.25666
9000,0.3944,0.355761,10.018933,25.165084
9500,0.3879,0.355661,9.984299,25.121497
10000,0.3849,0.355494,9.982705,25.09438


Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=10000, training_loss=0.11852426528930664, metrics={'train_runtime': 117391.5222, 'train_samples_per_second': 5.452, 'train_steps_per_second': 0.085, 'total_flos': 1.846744552267776e+20, 'train_loss': 0.11852426528930664, 'epoch': 5.561735261401557})

### 4-3. Model uploading to HuggingFace hub

In [None]:
kwargs = {
    "dataset_tags": "youtube",
    "dataset": "youtube-preprocessed",  # a 'pretty' name for the training dataset
    "dataset_args": "config: ko, split: valid",
    "language": "ko",
    "model_name": "whisper-small-youtube",  # a 'pretty' name for your model
    "finetuned_from": "SungBeom/whisper-small-ko",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard"
}


In [None]:
# 드라이브에 모델 저장
model_save_path = '/content/drive/MyDrive/NexoChat_share_data/STT Model/finetunning checkpoint/model_archive/whisper-small-ko'
processor_save_path = '/content/drive/MyDrive/NexoChat_share_data/STT Model/finetunning checkpoint/model_archive/whisper-small-ko-processor'
# 모델과 토크나이저 저장
trainer.save_model(model_save_path)
processor.save_pretrained(processor_save_path)
processor.save_pretrained(model_save_path)
trainer.save_state()

Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}


In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

# 모델과 프로세서 모두 학습 중 저장된 경로에서 불러올 수 있다.
model_upload = WhisperForConditionalGeneration.from_pretrained(model_save_path, ignore_mismatched_sizes=True)
processor_upload = WhisperProcessor.from_pretrained(processor_save_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from huggingface_hub import HfApi

# Hugging Face Hub에 업로드
api = HfApi()

# API 토큰 설정
api_token = "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"

# 모델 업로드
api.upload_folder(
    repo_id="NexoChatFuture/whisper-small-ko",
    folder_path=model_save_path,
    commit_message="Upload trained model and processor to Hugging Face Hub",
    token=api_token
)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/NexoChatFuture/whisper-small-youtube/commit/4f8123a0b970f53a36cea6fa061692b5cce17c8c', commit_message='Upload trained model and processor to Hugging Face Hub', commit_description='', oid='4f8123a0b970f53a36cea6fa061692b5cce17c8c', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# 에러가 발생해서 주석처리
# trainer.push_to_hub(**kwargs)

In [None]:
processor.push_to_hub("NexoChatFuture/whisper-small-ko-processor")
tokenizer.push_to_hub("NexoChatFuture/whisper-small-ko-tokenizer")

CommitInfo(commit_url='https://huggingface.co/NexoChatFuture/whisper-small-youtube-tokenizer/commit/c9ead8ce34065e3916d6fb81a5be780c32a81612', commit_message='Upload tokenizer', commit_description='', oid='c9ead8ce34065e3916d6fb81a5be780c32a81612', pr_url=None, pr_revision=None, pr_num=None)

### 5. Evaluation

#### 5-1. Model Selection

In [None]:
# 파인 튜닝한 모델을 로드
from transformers import WhisperForConditionalGeneration, WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer

model = WhisperForConditionalGeneration.from_pretrained("NexoChatFuture/whisper-small-ko")

feature_extractor = WhisperFeatureExtractor.from_pretrained("NexoChatFuture/whisper-small-ko")
tokenizer = WhisperTokenizer.from_pretrained("NexoChatFuture/whisper-small-ko-tokenizer")
processor = WhisperProcessor.from_pretrained("NexoChatFuture/whisper-small-ko-processor")

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.5k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### 5-2. Training argument 설정

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="NexoChatFuture/whisper-small-youtube-eval",  # 원하는 리포지토리 이름을 임력한다.
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # 배치 크기가 2배 감소할 때마다 2배씩 증가
    learning_rate=1e-5,
    warmup_steps=200,
    max_steps=2000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=200,
    eval_steps=200,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="cer",  # 한국어의 경우 'wer'보다는 'cer'이 더 적합할 것
    greater_is_better=False,
    push_to_hub=False,
)




In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=youtube_dataset_preprocessed["train"],
    eval_dataset=youtube_dataset_preprocessed["test"],  # for evaluation(not validation)
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)


Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


max_steps is given, it will override any value given in num_train_epochs


> Evaluation 진행

In [None]:
trainer.evaluate()

{'eval_loss': 0.3763847351074219,
 'eval_cer': 10.624634485664574,
 'eval_wer': 26.41957956402921,
 'eval_runtime': 11220.8664,
 'eval_samples_per_second': 1.282,
 'eval_steps_per_second': 0.16}