# Fine Tune the Viking LLM

This notebook fine-tunes the Viking LLM to perform GEC with regard to both minimal edits and fluency edit.

Chose model-version and edit-version further down.

## Imports

Import all relevant packages


In [None]:
from prompts import minimal_prompt, fluency_prompt
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
    Trainer,
    DataCollatorForSeq2Seq,
)
from datasets import load_from_disk
import torch
from os import path, makedirs
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from tqdm.notebook import tqdm

## Ensure GPU is available

In [None]:
if not torch.cuda.is_available():
    raise RuntimeError("GPU is not available for training!")
device = "cuda"

## Variables

The `version` variable can be either "minimal" or "fluency".

The `model_name` variable can be any Hugging Face model name, e.g "LumiOpen/Viking-7B".

The `model_label` variable is the part after the slash, e.g "Viking-7B".


In [None]:
version = "minimal"
model_name = "LumiOpen/Viking-7B"
model_label = model_name.split("/")[1]
MAX_LENGTH = 4096  # Well above the longest token sequence

### Verify Version

Verify that the value of `version` is valued and raise a `ValueError` otherwise.

In [None]:
if version not in ["minimal", "fluency"]:
    raise ValueError("Invalid version.")

## Model

### Setup Quantization Config

-   Train the LLM with the normalized float 4 `nf4` data type.
-   Do not double quantize, i.e do not quantize the quantization constants.
-   Perform computations in the brain-float 16 `bfloat16` data type.


In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

### Load Model

Load the model with the above quantization config.

The last two lines prepare the model for LoRA training.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

### Load Tokenizer and Data Collator

The tokenizer converts the input text into tokens for the LLM to use.

The data collator groups input essays into batches.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## Dataset

### Load Prompt

Load the prompt corresponding to the `version` variable.

In [None]:
prompts = {"minimal": minimal_prompt, "fluency": fluency_prompt}

prompt = prompts[version]

### Load Base Dataset

Load the base (untokenized) dataset from disk.

In [None]:
dataset_path = path.join("datasets", version)
dataset = load_from_disk(dataset_path)

### Tokenize Dataset

Wrap each source-target input pair in a prompt, which looks like this:

```markdown
### Instruktioner:
<CORRECTION_PROMPT>

### Indata:
<SOURCE_TEXT>

### Utdata:
<TARGET_TEXT>


```


In [None]:
def preprocess_function(examples):
    inputs = [
        f"### Instruktioner:\n{prompt}\n\n### Indata:\n{source}\n\n### Utdata:\n{target}\n"
        for source, target in zip(examples["source"], examples["target"])
    ]

    model_inputs = tokenizer(
        inputs,
        max_length=MAX_LENGTH,
        padding="max_length",
        return_tensors="pt"
    )

    labels = tokenizer(
        examples["target"],
        max_length=MAX_LENGTH,
        padding="max_length",
        return_tensors="pt"
    )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs


tokenized_dataset = dataset.map(preprocess_function, batched=True)

## Trainer

### Setup LoRA

Important arguments are explained in the list below:

- Use rank $ r = 128 $ to replace each weight matrix $ W \in \mathbb{R}^{ N \times M } $ with two smaller matrices $ A \in \mathbb{R}^{ N \times r } $ and $ B \in \mathbb{R}^{ r \times M } $, where $ r \ll \min ( N, M ) $ .
- Use $ \alpha = 64 $ to scale the matrix-product $ A B $ by the factor $ \alpha / r $.
- Target the projection matrices $ W^{ Q } $, $ W^{ V } $, and $ W^{ K } $.

Then use the LoRA config to prepare the LLM for PEFT training.

In [None]:
lora_config = LoraConfig(
    r=128,
    lora_alpha=64,
    bias="none",
    target_modules=["q_proj", "v_proj", "k_proj"],
    task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, lora_config)
peft_model.config.use_cache = False

### Setup Training Arguments

Begin by setting up number of epochs and batch size.


In [None]:
epochs = 3
batch_size = 4

### Setup Model Directory

Define the directory to save the trained model in as `./models/<model_label>/<version>`.


In [None]:
model_dir = path.join("models", model_label, version)
makedirs(model_dir, exist_ok=True)  # Ensure directory exists

### Setup Training Arguments

Important arguments are explained below:

- Use 8-bit AdamW optimizer.
- Use a constant learning-rate of $ 5 \times 10^{ - 5 } $.

In [None]:
training_arguments = TrainingArguments(
    num_train_epochs=epochs,
    logging_steps=1,
    prediction_loss_only=True,
    optim="adamw_bnb_8bit",
    learning_rate=5e-5,
    bf16=True,
    per_device_train_batch_size=batch_size,
    label_names=["labels"]
)

### Initialize Trainer

In [None]:
trainer = Trainer(
    model=peft_model,
    args=training_arguments,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
)

## Train

Train the model.

Training loss is logged at every training step, since the dataset is so small.


In [None]:
trainer.train()

## Save Locally

Save the model and tokenizer locally.

Do **not** push to hub.

In [None]:
peft_model.save_pretrained(model_dir, push_to_hub=False)
tokenizer.save_pretrained(model_dir, push_to_hub=False)