# Task 1: Training a Large Language Model

In this notebook, I fine-tune Mistral 7B (quantized to 4bit) on the given PlantUML dataset with [unsloth](https://github.com/unslothai/unsloth). Unsloth is chosen as it supports [QLoRA](https://github.com/artidoro/qlora) to efficiently make use of the limited hardware resources for fine-tuning an LLM (<10b) on the free Google Colab GPU. The fine-tuned model can be found [here](https://huggingface.co/jost/mistral7b_plantuml).

#### Imports

In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

#### Load pre-trained Mistral 7B

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True # Use 4bit quantization

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

#### Define LoRA parameters
I use the default LoRA parameters from the unsloth [repo](https://github.com/unslothai/unsloth).

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


#### Load and prepare the dataset
After inspecting the dataset, I noticed that about half of the dataset does not contain PlantUML code corresponding to a given description, but a detailed bullet point description and no PlantUML code at all. As the training objective is to generate PlantUML code from a given description, I filter out the other examples from the dataset with a simple regular expression. There are some minor irregularities in the dataset as well (e.g. typos, repetitions) that I will ignore here, but to get the best model performance, the dataset should be cleaned (especially if the dataset is rather small).

In [None]:
from datasets import load_dataset
import re

dataset = load_dataset("coai/plantuml_generation")

pattern = re.compile(r'^<s>\[INST\].*?\[\/INST\]@startuml[\s\S]*?@enduml<\/s>$', re.DOTALL)

def filter_with_regex(example):
    text = example['text']
    return bool(pattern.match(text))

dataset = dataset['train'].filter(filter_with_regex)

print(dataset)

Downloading readme:   0%|          | 0.00/447 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.06M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1940 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1940 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 967
})


In [None]:
dataset["text"][0]

'<s>[INST]\nFor the given description, generate \na Sequence diagram diagram using plantuml. \nDescription: Use Case Name: Patient Registration\nUse Case ID: HC-001\n\nUse Case Description:\nThis use case describes the process of registering a new patient in a healthcare system.\n\nUse Case Actors:\n1. Front desk staff\n2. Patient\n\nUse Case Triggers:\n- A new patient arrives at the healthcare facility and wants to register.\n\nUse Case Preconditions:\n- The patient has not been registered in the system before.\n- The front desk staff is available to assist the patient.\n\nUse Case Postconditions:\n- The patient\'s information is recorded in the healthcare system.\n- The patient is assigned a unique identification number.\n\nUse Case Flow:\n1. The patient approaches the front desk and expresses the intention to register.\n2. The front desk staff welcomes the patient and requests basic information such as name, date of birth, address, contact number, and insurance details.\n3. The fron

#### Training config

I use the default training config from the unsloth [repo](https://github.com/unslothai/unsloth).

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
)

max_steps is given, it will override any value given in num_train_epochs


#### Start training

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 967 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,0.8073
2,0.8166
3,0.7662
4,0.6926
5,0.6336
6,0.6364
7,0.5628
8,0.4947
9,0.4352
10,0.4477


#### Upload model weights to Hugging Face

In [None]:
model.push_to_hub("jost/mistral7b_plantuml", token = "INSERT_YOUR_TOKEN_HERE")
tokenizer.push_to_hub("jost/mistral7b_plantuml", token = "INSERT_YOUR_TOKEN_HERE")

README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/jost/mistral7b_plantuml


tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

**Note:** An example for querying the fine-tuned model can be found in the inference.ipynb file.