<!-- Banner Image -->
<center>
<img src="https://a.storyblok.com/f/139616/x/17b1d8406f/kili-icon-dark-mode.svg" width="10%">

  <a href="https://kili-technology.com/" style="color: #06b6d4;">Kili website</a> --
  <a href="https://cloud.kili-technology.com/label" style="color: #06b6d4;">Log-in</a> --
  <a href="https://docs.kili-technology.com/docs" style="color: #06b6d4;">Docs</a>



# Kili - Fine-tune LLAMA2 on your own data
</center>

Welcome!

In this notebook, we will fine-tune the [7B version of LLAMA2](https://huggingface.co/meta-llama/Llama-2-7b) from Meta leveraging QLoRA.
To do so, we will rely on the `transformers` library and `bitsandbytes`to load model in 4bit, `PEFT` and `trl`for model training using LoRA.

First, we'll load an off-the-shelf financial dataset and prepare the data. Then, we'll test how well the original model performs on our dataset. After that, we'll launch our training to fine-tune it. Finally, we'll be able to test our fine-tuned model and compare it to the original model.

Feel free to adapt with a different dataset to create your custom model!


## 0. Pre-requisites

In addition to package loading, you need to be connected to your Hugging Face space to download the model (and upload your fine-tuned model if you want to). Make sure that you have access rights to [Meta's HF space](https://huggingface.co/meta-llama/Llama-2-7b-hf) to be able to download the model later.

FIrst, let's download and prepare all the tools that we'll need

In [None]:
!pip install -q accelerate==0.21.0 peft==0.5.0 bitsandbytes==0.41.1 transformers==4.34.0 trl==0.7.2

In [None]:
import os
from getpass import getpass

import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    logging,
    pipeline,
)
from trl import SFTTrainer

In [None]:
# Hugging face token will be required to download & upload dataset & models
HF_TOKEN = getpass()

## 1. Load data

Data is in that case loaded from an existing Hugging Face dataset (see [AdaptLLM space](https://huggingface.co/datasets/AdaptLLM/finance-tasks)).

You can use a different dataset, or even load your own data to fine-tune the model with it. You can start with several hundred examples.

If you need to create the correct input for such fine-tuning, this is where [Kili](https://kili-technology.com/) can help with our state-of-the-art [labeling interfaces](https://kili-technology.com/platform/llm-tool) and [quality workflows](https://kili-technology.com/platform/explore-and-fix). Note that if you don't have the time or sufficient resources, Kili also offers professional end-to-end [labeling services](https://kili-technology.com/professional-services/kili-simple-offer).

In [None]:
# Dataset characteristics
# We select the ConvFinQA dataset from the finance-tasks space. This dataset only has a test split - See https://huggingface.co/datasets/AdaptLLM/finance-tasks
dataset_name = "AdaptLLM/finance-tasks"
subset = "ConvFinQA"
split = "test"

# Dataset loading
raw_dataset = load_dataset(dataset_name, subset, split=split)

In [None]:
raw_dataset

### Format the prompt

Let's create a function to format the raw data.

In our case, our dataset has an input and a label that we merge into a single piece of text for the training.

If you use a different dataset, you'll have to adapt this function.



In [None]:
def formatting(dataset):
    dataset["input+labels"] = f"### Question: {dataset['input']}\n ### Answer: {dataset['label']}"
    return dataset

We apply our formating to the raw dataset and split it  with  `train_test_split` to keep a couple of examples for later trials.



In [None]:
dataset = raw_dataset.map(formatting).train_test_split(test_size=0.001)

In [None]:
dataset

## 2. Load base model

Before loading the base model from the meta-llama repository, make sure that you have access rights (see in [Meta's HF space](https://huggingface.co/meta-llama/Llama-2-7b-hf)).

In [None]:
# Model name in HF
model_name = "meta-llama/Llama-2-7b-hf"

# Load tokenizer and model with QLoRA configuration
compute_dtype = torch.float16

# Configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Activate 4-bit
    bnb_4bit_use_double_quant=False,  # double quantization for 4-bit base models
    bnb_4bit_quant_type="nf4",  # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=compute_dtype,  # Compute dtype for 4-bit base models
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, token=HF_TOKEN
)

# Key values are disregarded for the fine-tuned model.
model.config.use_cache = False

## 3. Tokenization

Let's load our tokenizer:

In [None]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name, padding_side="right", add_eos_token=True, add_bos_token=True, token=HF_TOKEN
)

tokenizer.pad_token = tokenizer.eos_token

We apply the tokenizer to be able to understand the length of the items and adapt as needed.

In [None]:
def generate_and_tokenize_prompt(dataset):
    result = tokenizer(dataset["input+labels"])
    return result


tokenized_train_dataset = dataset["train"].map(generate_and_tokenize_prompt)

In [None]:
from statistics import mean, median

lengths = [len(x["input_ids"]) for x in tokenized_train_dataset]

max_length = max(lengths)

print(f"Mean:{mean(lengths)}, Median: {median(lengths)}, Max:{max(lengths)}")

If the token is too long, our computing resources won't be able to process it. We'll remove these items from our dataset.

In [None]:
max_length = 1250

# Use the map function to filter out items below the threshold length
filtered_dataset = tokenized_train_dataset.filter(lambda item: len(item["input_ids"]) <= max_length)

# Display the filtered dataset
print(filtered_dataset)

Let's now check lengths of our filtered dataset

In [None]:
lengths = [len(x["input_ids"]) for x in filtered_dataset]

print(f"Mean:{mean(lengths)}, Median: {median(lengths)}, Max:{max(lengths)}")

We'll create two datasets, so that we have some data to later evaluate the model. Also: we'll remove columns that are not required from the training dataset.

In [None]:
train_dataset = filtered_dataset.map(
    remove_columns=["id", "input", "label", "input_ids", "attention_mask"]
)
val_dataset = dataset["test"]

In [None]:
train_dataset

### Test base model

Let's check how an off-the-shelf Llama 2 7B does on one of our data samples with the following `prompt_eval`:

In [None]:
eval_prompt = """
### Question: Given the following data
cash flowsmillions | 2014 | 2013 | 2012
cash provided by operating activities | $7385 | $6823 | $6161
cash used in investing activities | -4249 (4249) | -3405 (3405) | -3633 (3633)
cash used in financing activities | -2982 (2982) | -3049 (3049) | -2682 (2682)
net change in cash and cashequivalents | $154 | $369 | $-154 (154)

what was the net change in cash and cashequivalents for 2013?

### Answer:
"""

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = eval_prompt
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=15)
result = pipe(prompt)
print(result[0]["generated_text"])

### 4. Training

*Optional* - You can use Weights & Biases for experiment tracking.

In [None]:
!pip install -q wandb -U

import os

import wandb

wandb.login()

os.environ["WANDB_PROJECT"] = "finance-finetune"

Now let's train our model.

The parameters below have been set based on the standard configuration, but feel free to adapt them based on your requirements.

In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="wandb",
    gradient_checkpointing=True,  ## Required since the introduction of update to prepare_model_for_kbit_training
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="input+labels",
    max_seq_length=max_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

# Train model
trainer.train()

In [None]:
# Fine-tuned model name
new_model = "Llama-2-7b-hf-finance-v01"

# Save trained model
trainer.model.save_pretrained(new_model)

### 5.Try the Trained Model!

The PEFT library only saves the QLoRA adapters, so the initial Llama2 needs to be loaded.


In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,  # Llama 2 7B, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map={"": 0},
    trust_remote_code=True,
    token=HF_TOKEN,
    low_cpu_mem_usage=True,
)

# Associate with QLoRA adaptaters of the new_model
model = PeftModel.from_pretrained(base_model, new_model)

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="right", trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token

We can use our pipeline to test our new model.

Let's try our new model on one of the elements that we left in our validation dataset.

In [None]:
val_item = val_dataset[0]

prompt = f"""### Question: {val_item['input']}
"""

print(val_item["label"])

In [None]:
max_tokens = 15

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=max_tokens)
result = pipe(prompt)
print(result[0]["generated_text"])

Let's compare with the original model:

In [None]:
original_model = AutoModelForCausalLM.from_pretrained(
    model_name,  # Llama 2 7B, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map={"": 0},
    trust_remote_code=True,
    token=HF_TOKEN,
    low_cpu_mem_usage=True,
)

In [None]:
# Run text generation pipeline with our next model
pipe = pipeline(
    task="text-generation", model=original_model, tokenizer=tokenizer, max_new_tokens=max_tokens
)
result = pipe(prompt)
print(result[0]["generated_text"])

### 6. Push to HF

*Optional* - if you want to save the adaptaters of your model

In [None]:
model.push_to_hub(new_model, use_temp_dir=False, token=HF_TOKEN)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=HF_TOKEN)

### Conclusion

With this notebook we have adapted our generic LLAMA2 model to a finance dataset for domain adaptation & specialization purposes.
We can assess that with a few hundred examples, the fine-tuned model has been able to structure its answers according to the fine-tuning dataset format, and also better identify the answer to the financial question.

Next step would be to run a larger scale evaluation of our fine-tuned model so as to assess its actual performance improvement on a financial Q&A task. Such task can be done by creating a benchmark dataset and evaluating both the initial model and the fine-tuned model on this same dataset. [Kili](https://kili-technology.com/) can provide support in this step with both the [software](https://kili-technology.com/platform/llm-tool) & the [service](https://kili-technology.com/professional-services/kili-simple-offer).

Help us improve this tutorial by providing feedback 😀

<!-- Banner Image -->
<center>
<img src="https://a.storyblok.com/f/139616/x/17b1d8406f/kili-icon-dark-mode.svg" width="10%">

  <a href="https://kili-technology.com/" style="color: #06b6d4;">Kili website</a> --
  <a href="https://cloud.kili-technology.com/label" style="color: #06b6d4;">Log-in</a> --
  <a href="https://docs.kili-technology.com/docs" style="color: #06b6d4;">Docs</a>