# General Breakdown of the ABB-LLM Model:

## Motivation:
* As a Tool for Translation, Summarization, QA Tasks:These tasks require the generation of new text.
* As a Baseline for Classification, Named Entity Recognition (NER), and Other Tasks: These tasks require understanding of the text.

## What to Expect?

Small LLM Models:
* While not as good as larger models, they can still be useful for many tasks.
* Current 7B models are not good enough for QA tasks. Small models tend to hallucinate more and are less accurate, thus general QA tasks are not advised.
* It is recommended to stick to summarization, translation, and classification tasks until better models are available.
* For tasks that take in a 'context' (document, search result, etc.) and only use the information in the context to generate the output (e.g., summarization, translation, classification, NER), small models can be effective.
* he svercoutere/llama-3-8b-instruct-abb model is fine-tuned specifically for these 'simpler' tasks and should perform better than the base 8B model.
* It is trained to return JSON output, which is easier to work with than the default output of the 8B model, which tends to add a lot of noise or unnecessary information (being chatty).


### Use Cases (all task where the context/facts are provided):
* Summarization: Summaries of any text (e.g., agenda items, BPMN files).
* Translation: Simple translations of text (e.g., agenda items, BPMN files).
* Classification: Classify text into any hierarchy (e.g., agenda items, BPMN files).
* Named Entity Recognition (NER): Extract entities from text (e.g., agenda items, BPMN files).
* Keyword Extraction: Extract keywords from text (e.g., agenda items, BPMN files).


# Long Term:
* When enough data is available, a custom model should be trained for specific tasks.
* While a (small) LLM can be used, a custom model for classification and NER tasks should function more efficiently and perform better than a model trained on general tasks.
* This can be easily achieved by fine-tuning models such as BERT, RoBERTa, etc.

# Parameter Efficient Fine-Tuning (PEFT)

Parameter Efficient Fine-Tuning (PEFT) is a resource-efficient alternative to full fine-tuning for instruction-based large language models (LLMs). Unlike full fine-tuning, which involves updating all model parameters and requires significant computational resources, PEFT updates only a subset of parameters, keeping the rest frozen. This reduces the number of trainable parameters, thus lowering memory requirements and preserving the original LLM weights to prevent catastrophic forgetting. PEFT is particularly useful for mitigating storage constraints when fine-tuning across multiple tasks. Techniques such as Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA) exemplify effective methods for PEFT.

# LoRA and QLoRA

**LoRA (Low-Rank Adaptation):** LoRA fine-tunes two smaller matrices that approximate the pre-trained LLM's weight matrix, forming a LoRA adapter. After fine-tuning, the original LLM remains unchanged, while the LoRA adapter is significantly smaller (measured in MB rather than GB). During inference, the LoRA adapter is fused with the original LLM, allowing multiple LoRA adapters to repurpose the LLM for different tasks, reducing overall memory requirements.

**QLoRA (Quantized Low-Rank Adaptation):** QLoRA improves upon LoRA by quantizing the LoRA adapter weights to lower precision, typically 4-bit instead of 8-bit. This further reduces the memory footprint and storage overhead. Despite the reduced bit precision, QLoRA maintains performance levels comparable to LoRA, optimizing memory usage without compromising effectiveness.

# Full Fine-Tuning vs. PEFT-LoRA vs. PEFT-QLoRA

- **Full Fine-Tuning:** Updates all model parameters, requires substantial computational resources, and has high memory and storage demands.
- **PEFT-LoRA:** Updates only small matrices, preserving most original weights, reducing memory needs, and allowing easy adaptation for multiple tasks.
- **PEFT-QLoRA:** Further reduces memory and storage by quantizing weights, maintaining effective performance while optimizing resource usage.

![Different approaches for training an LLM](./llm_finetuning.png)

In [None]:
%%capture
!mamba install --force-reinstall aiohttp -y
!pip install -U "xformers<0.0.26" --index-url https://download.pytorch.org/whl/cu121
!pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"
# Temporary fix for https://github.com/huggingface/datasets/issues/6753
!pip install datasets==2.16.0 fsspec==2023.10.0 gcsfs==2023.10.0

import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# configuration of unsloth
dataset_repo = "svercoutere/llama3_abb_instruct_dataset"  # The dataset repository

config = {
    "model_config": {
        "base_model": "unsloth/llama-3-8b-Instruct-bnb-4bit",  # The base model
        "max_seq_length": 8096,  # The maximum sequence length
        "dtype": None,  # The data type
        "load_in_4bit": True,  # Load the model in 4-bit
    },
    "lora_config": {
        "r": 8,  # The number of LoRA layers 8, 16, 32, 64
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],  # The target modules
        "lora_alpha": 16,  # The alpha value for LoRA
        "lora_dropout": 0,  # The dropout value for LoRA
        "bias": "none",  # The bias for LoRA
        "use_gradient_checkpointing": True,  # Use gradient checkpointing
        "random_state": 3407,  # The random state
        "use_rslora": False,  # Use RSLora
        "loftq_config": None  # The LoFTQ configuration
    },
    "training_config":  {
        "dataset_input_field": "prompt",  # The input field
        
        "per_device_train_batch_size": 2,  # The batch size
        "gradient_accumulation_steps": 4,  # The gradient accumulation steps
        "warmup_steps": 5,  # The warmup steps
        "max_steps": 0,  # The maximum steps (0 if the epochs are defined)
        "num_train_epochs": 1,  # The number of training epochs(0 if the maximum steps are defined)
        "learning_rate": 2e-4,  # The learning rate
        "logging_steps": 10,  # The logging steps

        "eval_strategy": "steps",  # The evaluation strategy
        "per_device_eval_batch_size": 2,  # The batch size for evaluation
        "eval_steps": 10,  # The evaluation steps

        "save_strategy": "steps",  # The save strategy
        "save_steps": 10,  # The save steps
        "save_total_limit": 5,  # The total limit for saving
        "resume_from_checkpoint": "outputs/checkpoint-208",  # The checkpoint to resume from
        
        "optim": "adamw_8bit",  # The optimizer
        "weight_decay": 0.01,  # The weight decay
        "lr_scheduler_type": "linear",  # The learning rate scheduler
        "seed": 3407,  # The seed
        "output_dir": "outputs",  # The output directory
        "report_to": "none",  # The report destination
    }
}

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=config["model_config"]["base_model"],
        max_seq_length=config["model_config"]["max_seq_length"],
        dtype=config["model_config"]["dtype"],
        load_in_4bit=config["model_config"]["load_in_4bit"],
    )

In [None]:
model = FastLanguageModel.get_peft_model(
        model,
        r=config["lora_config"]["r"],
        target_modules=config["lora_config"]["target_modules"],
        lora_alpha=config["lora_config"]["lora_alpha"],
        lora_dropout=config["lora_config"]["lora_dropout"],
        bias=config["lora_config"]["bias"],
        use_gradient_checkpointing=config["lora_config"]["use_gradient_checkpointing"],
        random_state=config["lora_config"]["random_state"],
        use_rslora=config["lora_config"]["use_rslora"],
        loftq_config=config["lora_config"]["loftq_config"],
    )

### Preparing Instruction Data for Llama 3 8B Instruct

To fully utilize the Llama 3 8B Instruct model, we need to adhere to its prompt instruction template:

```

<|start_header_id|>system<|end_header_id|>{ system_message }<|eot_id|>
<|start_header_id|>user<|end_header_id|>{ prompt }<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>{ output }<|eot_id|>

```

The dataset [svercoutere/llama3_abb_instruct_dataset](https://huggingface.co/datasets/svercoutere/llama3_abb_instruct_dataset) is LLM-agnostic and does not have the data in the format required by the model. We need to prepare the data in the required format. Refer to the `add_prompt_to_dataset` function for detailed implementation.

In [None]:
from datasets import load_dataset

# adding the prompt to the dataset in the llama-3-8b-instruct format

def get_prompt(system_message, prompt):    
    llama_format = f"""<|start_header_id|>system<|end_header_id|>{ system_message }<|eot_id|><|start_header_id|>user<|end_header_id|>{ prompt }<|eot_id|>"""
    return llama_format

def get_prompt_training(system_message, prompt, output):
    llama_format = f"""<|start_header_id|>system<|end_header_id|>{ system_message }<|eot_id|><|start_header_id|>user<|end_header_id|>{ prompt }<|eot_id|><|start_header_id|>assistant<|end_header_id|>{ output }<|eot_id|>"""
    return llama_format

def add_prompt_to_dataset(dataset):
    dataset = dataset.map(lambda x: {"prompt": get_prompt_training(x["instruction"], x["input"], x["output"])}, batched=True,)
    return dataset


dataset_train = load_dataset(dataset_repo, split = "train")
dataset_val = load_dataset(dataset_repo, split = "validation")

dataset_train = add_prompt_to_dataset(dataset_train)
dataset_val = add_prompt_to_dataset(dataset_val)

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
    dataset_text_field=config["training_dataset"]["input_field"],
    max_seq_length=config["model_config"]["max_seq_length"],
    dataset_num_proc=2,
    packing=False,

    args=TrainingArguments(
        per_device_train_batch_size=config["training_config"]["per_device_train_batch_size"],
        gradient_accumulation_steps=config["training_config"]["gradient_accumulation_steps"],
        warmup_steps=config["training_config"]["warmup_steps"],
        max_steps=config["training_config"]["max_steps"],
        num_train_epochs=config["training_config"]["num_train_epochs"],
        learning_rate=config["training_config"]["learning_rate"],
        fp16=not torch.cuda.is_bf16_supported(), # Add it here to avoid the error with tensors being on different devices
        bf16=torch.cuda.is_bf16_supported(), # Add it here to avoid the error with tensors being on different devices
        logging_steps=config["training_config"]["logging_steps"],
        eval_strategy=config["training_config"]["eval_strategy"],
        per_device_eval_batch_size=config["training_config"]["per_device_eval_batch_size"],
        eval_steps=config["training_config"]["eval_steps"],
        save_strategy=config["training_config"]["save_strategy"],
        save_steps=config["training_config"]["save_steps"],
        save_total_limit=config["training_config"]["save_total_limit"],
        resume_from_checkpoint=config["training_config"]["resume_from_checkpoint"],
        optim=config["training_config"]["optim"],
        weight_decay=config["training_config"]["weight_decay"],
        lr_scheduler_type=config["training_config"]["lr_scheduler_type"],
        seed=config["training_config"]["seed"],
        output_dir=config["training_config"]["output_dir"],
        report_to=config["training_config"]["report_to"],
        ),
    )

In [None]:
# Memory statistics before training
gpu_statistics = torch.cuda.get_device_properties(0)
reserved_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 2)
max_memory = round(gpu_statistics.total_memory / 1024**3, 2)
print(f"Reserved Memory: {reserved_memory}GB")
print(f"Max Memory: {max_memory}GB")

In [None]:
def clear_gpu_memory(model):
    import gc

    model.cpu()
    del model
    gc.collect()
    torch.cuda.empty_cache()
    
#clear_gpu_memory(model)

In [None]:
#trainer_stats = trainer.train(resume_from_checkpoint = True)
trainer_stats = trainer.train()

In [None]:
# Memory statistics after training
used_memory = round(torch.cuda.max_memory_allocated() / 1024**3, 2)
used_memory_lora = round(used_memory - reserved_memory, 2)
used_memory_persentage = round((used_memory / max_memory) * 100, 2)
used_memory_lora_persentage = round((used_memory_lora / max_memory) * 100, 2)
print(f"Used Memory: {used_memory}GB ({used_memory_persentage}%)")
print(f"Used Memory for training(fine-tuning) LoRA: {used_memory_lora}GB ({used_memory_lora_persentage}%)")

# Save the model

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
# Locally saving the model and pushing it to the Hugging Face Hub (only LoRA adapters)
model.save_pretrained("llama-3-8b-instruct-abb")
model.push_to_hub("llama-3-8b-instruct-abb", tokenizer = tokenizer)

In [None]:
# Saving the model using merged_16bit(float16), merged_4bit(int4) or quantization options(q8_0, q4_k_m, q5_k_m)...

model.save_pretrained_merged(config.get("model_config").get("finetuned_model"), tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged(config.get("model_config").get("finetuned_model"), tokenizer, save_method = "merged_16bit")

model.save_pretrained_merged(config.get("model_config").get("finetuned_model"), tokenizer, save_method = "merged_4bit",)
model.push_to_hub_merged(config.get("model_config").get("finetuned_model"), tokenizer, save_method = "merged_4bit")

model.save_pretrained_gguf(config.get("model_config").get("finetuned_model"), tokenizer)
model.push_to_hub_gguf(config.get("model_config").get("finetuned_model"), tokenizer)

model.save_pretrained_gguf(config.get("model_config").get("finetuned_model"), tokenizer, quantization_method = "f16")
model.push_to_hub_gguf(config.get("model_config").get("finetuned_model"), tokenizer, quantization_method = "f16")

model.save_pretrained_gguf(config.get("model_config").get("finetuned_model"), tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf(config.get("model_config").get("finetuned_model"), tokenizer, quantization_method = "q4_k_m")

# Inference

In [None]:
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "llama-3-8b-instruct-abb",
        max_seq_length = 4096,
        dtype = None,
        load_in_4bit = True,
    )

# Using FastLanguageModel for fast inference
FastLanguageModel.for_inference(model)

In [None]:
from datasets import load_dataset

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    texts = []
    for instruction, input in zip(instructions, inputs):
        text = get_prompt(instruction, input)
        texts.append(text)
    return { "text" : texts, }


dataset_test = load_dataset("svercoutere/llama3_abb_instruct_dataset", split = "train")
dataset_test = dataset_test.map(formatting_prompts_func, batched = True,)

In [None]:
def generate_text(prompt):
    inputs = tokenizer(
    [prompt], return_tensors = "pt", padding = True).to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, temperature = 0.0)
    print("tokens(total):",len(outputs[0]), "tokens(prompt):", len(inputs[0]))
    
    new_tokens = outputs[0][len(inputs[0]):]
    print("tokens(new):",len(new_tokens))
    return tokenizer.batch_decode([new_tokens], skip_special_tokens = False)

In [None]:
for index in range(5,20):
    print("\n---------------Sample:",str(index),"----------------------")

    print("Input:")
    print(dataset_test[index]["input"].split("####")[1])

    response = generate_text(dataset_test[index]["text"])[0]
    print("Prediction:\n")
    print(response.replace("<|start_header_id|>assistant<|end_header_id|>","").replace("<|eot_id|>","").strip())
    print("Expected:\n")
    print(dataset_test[index]["output"])