## Deep Learning Ulaanbaatar (DLUB) 2023 - Summer School 🇲🇳

**Seminar:** Techniques to train LLM - LoRA (low-rank approximation technique) on Mongolian GPT2 (part 2)

- Pretrained model: [mongolian-gpt2](https://huggingface.co/bayartsogt/mongolian-gpt2)
    - Huggingface Space: https://huggingface.co/spaces/flax-community/Mongolian-GPT2
- Load dataset using [datasets](https://github.com/huggingface/datasets) library
    - Source: [Home page of DLUB](https://sites.google.com/view/dlub/2023)
    - Dataset: [bayartsogt/test_dlub_2023](https://huggingface.co/datasets/bayartsogt/test_dlub_2023)
- train model using [transformers.Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)


**Claimer!**
- Only shows for a fine-tuning pipeline demonstration purpose.
- Contents generated from models might offend some people, but it is all/our responsibility to fix those biases. 


## Kaggle Notebook Бэлтгэл

In [25]:
!pip install --upgrade accelerate
!pip install peft

[0m

### [Important] Run -> Restart & clear cell outputs

In [26]:
# we upgraded `accelerate` just because to import Trainer API
from transformers import Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model

## Өгөгдлөө унших

Энэхүү demo-д зориулж жижиг хэмжээтэй дата бэлдсэн байгаа.

Data link: https://huggingface.co/datasets/bayartsogt/test_dlub_2023

In [27]:
from datasets import load_dataset

datasets = load_dataset('bayartsogt/test_dlub_2023')

  0%|          | 0/2 [00:00<?, ?it/s]

In [28]:
datasets

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 38
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 38
    })
})

In [29]:
def preprocess(example):
    example["text"] = (example["question"] + " " + example["answer"])
    return example

datasets = datasets.map(preprocess, remove_columns=["question", "answer"])

  0%|          | 0/38 [00:00<?, ?ex/s]

  0%|          | 0/38 [00:00<?, ?ex/s]

In [None]:
datasets

## Tokenize

In [30]:
MODEL_NAME = "bayartsogt/mongolian-gpt2"

In [31]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = "<pad>"

In [32]:
def tokenize_function(examples):
    return tokenizer(examples["text"], max_length=64, truncation=True, padding="max_length")

In [33]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=2, remove_columns=["text"])
tokenized_datasets

    

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

    

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 38
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 38
    })
})

In [34]:
def copy_input_ids(example):
    example["labels"] = example["input_ids"].copy()
    return example

In [35]:
tokenized_datasets = tokenized_datasets.map(copy_input_ids)
tokenized_datasets

  0%|          | 0/38 [00:00<?, ?ex/s]

  0%|          | 0/38 [00:00<?, ?ex/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 38
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 38
    })
})

## Сургалт хийх

Forward pass for `GPT2LMHeadModel` class
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py#L1051

In [36]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

### LoRA-г одоо apply хийнэ.

References:
- https://arxiv.org/abs/2106.09685
- https://pytorch.org/docs/stable/generated/torch.numel.html

In [37]:
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # task_type, token classification (TaskType.CAUSAL_LM)
    inference_mode=False,
    r=8,                           # r, the dimension of the low-rank matrices
    lora_alpha=16,                 # lora_alpha, scaling factor for the weight matrices
    lora_dropout=0.3,              # lora_dropout, dropout probability of the LoRA layers
    fan_in_fan_out=True,
    bias="lora_only"               # bias, set to only lora layers to train
    
)

In [38]:
lora_model = get_peft_model(model, peft_config)
lora_model.print_trainable_parameters()

trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.23643136409814364


In [39]:
from transformers import Trainer, TrainingArguments
# https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments

training_args = TrainingArguments(
    "lora-mn-gpt2-on-dlub",
    
    num_train_epochs=400,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    dataloader_num_workers=2,

    evaluation_strategy = "steps",
    logging_strategy="steps",
    save_strategy="steps",
    eval_steps=30,
    logging_steps=30,
    save_steps=30,

    learning_rate=1e-3,
    weight_decay=0.01,
    save_total_limit=10,
    report_to='none',

    load_best_model_at_end=True,

    # automatic version handling with huggingface
    # push_to_hub=True,
)

In [40]:
from transformers import Trainer, TrainingArguments
from datasets import load_metric


In [41]:
def compute_metrics(eval_preds):
    predictions, labels = eval_preds
    primt(predictions, labels)
    if isinstance(predictions, tuple):
        predictions = predictions[0]
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = [[tokenizer.decode(l, skip_special_tokens=True)] for l in labels]
    bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": bleu_result["score"]}

In [None]:
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics
)

In [None]:
train_output = trainer.train()
print(train_output)
print("BLEU scores по эпохам:", bleu_scores)

In [None]:
trainer.model.save_pretrained("lora-mn-gpt2-on-dlub")

## `generate` хийх

In [None]:
from transformers import Trainer, TrainingArguments
from nltk.translate.bleu_score import sentence_bleu
import torch

In [None]:
# prompt = "DLUB-ийн хэд дэх удаагийн сургалт явагдах гэж байгаа вэ?"  # in train data
prompt = "DLUB-гийн зохион байгуулагч Я.Баярцогт нь ямар байгууллагат ажилладаг вэ?"  # NOT in train data - but similar
encoded_prompt = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids
encoded_prompt = encoded_prompt.to(trainer.model.device)

# prediction
output_sequences = trainer.model.generate(
    input_ids=encoded_prompt,
    max_length=64,
    min_length=1,
    temperature=1.,
    top_p=0.95,
    do_sample=True,
    num_return_sequences=10,
    pad_token_id=tokenizer.pad_token_id,
)

generated_sequences = []

# decode prediction
for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
    generated_sequence = generated_sequence.tolist()
    text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=False)
    generated_sequences.append(text.strip())

corpus_bleu для текстов

In [None]:
generated_sequences

In [None]:
from nltk.translate.bleu_score import sentence_bleu

reference = [['это', 'эталонный', 'текст', 'на', 'русском']]
candidate = ['это', 'сгенерированный', 'текст', 'на', 'русском']

bleu_score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score}")

bleu - схожесть между сгенерированным текстом и эталонным текстом, используя сравнение n-грамм (последовательных последовательностей из n элементов)

1) N-gram Precision:
Вычисляется точность каждого порядка n-грамм (unigram, bigram, trigram и так далее) для сгенерированного текста по сравнению с эталонным текстом. 

2) Brevity Penalty (Штраф за краткость):
Применяется штраф за краткость для учета того, насколько коротким может быть сгенерированный текст по сравнению с эталоном.

3) BLEU Score:
Итоговый BLEU Score рассчитывается как геометрическое среднее точности n-грамм с взвешиванием штрафа за краткость.

In [None]:
reference_bleu = [tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True, skip_special_tokens=False).split()]
reference_bleu

In [None]:
for idx, generated_text in enumerate(generated_sequences):
    candidate = generated_text.split()
    bleu_score = sentence_bleu(reference_bleu, candidate)
    print(f"BLEU Score {idx + 1}: {bleu_score}")

average_bleu_score = sum([sentence_bleu(reference_bleu, generated_text.split()) for generated_text in generated_sequences]) / len(generated_sequences)
print(f"Average BLEU Score: {average_bleu_score}")

In [None]:
!pip install rouge

In [None]:
from rouge import Rouge

In [None]:
references_rouge = tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True, skip_special_tokens=False).strip()
references_rouge

In [None]:
rouge = Rouge()

for idx, generated_text in enumerate(generated_sequences):
    rouge_scores = rouge.get_scores(generated_text, references_rouge)
    print(f"ROUGE Scores {idx + 1}: {rouge_scores}")

average_rouge_1 = sum([score["rouge-1"]["f"] for score in rouge_scores]) / len(rouge_scores)
print(f"Average ROUGE-1: {average_rouge_1}")


In [None]:
generated_sequences[0]

ROUGE метрики ориентированы на оценку схожести между двумя текстами, поэтому сравнение выполняется на уровне предложений или слов в зависимости от выбранной метрики (например, rouge-1 сравнивает униграммы).
оценка оценивает схожесть униграмм в сгенерированном ответе с эталонным ответом.
ROUGE-1 F1-оценка оценивает схожесть униграмм в сгенерированном ответе с эталонным ответом.

ROUGE метрики включают в себя, например:

rouge-1: униграммы
rouge-2: биграммы
rouge-l: наибольшая общая последовательность

In [None]:
import torch

In [None]:
input_ids = tokenizer(generated_text, return_tensors="pt").input_ids
    with torch.no_grad():
        logits = model(input_ids).logits
        target_ids = input_ids[:, 1:].contiguous()  
        loss = torch.nn.functional.cross_entropy(logits[:, :-1, :].reshape(-1, logits.shape[-1]), target_ids.view(-1))

    perplexity = torch.exp(loss)

## Дараа нь хэрхэн унших вэ?

In [None]:
prompt_in_train = "DLUB-гийн зохион байгуулагч Я.Баярцогт нь хаана ажилладаг вэ?"  # in train data
prompt_not_in_train = "DLUB-гийн зохион байгуулагч Я.Баярцогт нь ямар байгууллагат ажилладаг вэ?"  # NOT in train data - but similar
encoded_prompt_in_train = tokenizer(prompt_in_train, add_special_tokens=False, return_tensors="pt").input_ids
encoded_prompt_not_in_train = tokenizer(prompt_not_in_train, add_special_tokens=False, return_tensors="pt").input_ids

_model = PeftModel.from_pretrained(model, "lora-mn-gpt2-on-dlub").to("cpu")

for _encoded_prompt in [encoded_prompt_in_train, encoded_prompt_not_in_train]:
    output_sequences = _model.generate(
        input_ids=_encoded_prompt,
        max_length=64,
        min_length=10,
        temperature=1.,
        top_p=0.95,
        do_sample=True,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
    )

    text = tokenizer.decode(output_sequences[0], clean_up_tokenization_spaces=True, skip_special_tokens=False)

    # Дэмо-д зориулж хялбарчилъя:
    question, answer = text.split("?")[:2]
    answer = answer.split(".")[0]
    print(question + "?", answer )