# Enhancing Llama 3.1 with RAFT

This notebook contains the code to generate a RAFT dataset based on your input document, which is then used to fine-tune the Llama 3.1 LLM. Click [here](https://www.analyticsvidhya.com/blog/2024/04/enhancing-rag-with-retrieval-augmented-fine-tuning/) for more information about RAFT and the code.

## Step 1: Import Necessary Libraries

In [None]:
!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install llama-index-packs-raft-dataset

In [None]:
from llama_index.packs.raft_dataset import RAFTDatasetPack
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

For the data preparation process for Q/A generation, the `RAFTDatasetPack` is configured with the following parameters:
- **filepath**: Specifies the path of the file used to generate questions and answers. This file acts as the primary source of content for the dataset.
- **llm**: Defines the Large Language Model (LLM) employed for generating questions and answers. GPT-4 is used by default if no model is specified. Choose the model carefully by considering the costs.
- **embed-model**: The embedding model used to calculate the similarity between a query and its context, essential for selecting relevant context chunks.
- **num_questions_per_chunk**: It determines the number of questions to be created for each data chunk, directly affecting the comprehensiveness of the training dataset.
- **num_distract_docs**: Sets the number of random context chunks used as distractors for each question, challenging the model to identify relevant information.
- **chunk_size**: Llama-index uses SemanticSplitterNodeParser to split the dataset into chunks. So, this parameter is not useful.
- **default_breakpoint_percentile_threshold**: Controls the threshold for combining chunks based on their dissimilarity. A higher value results in fewer, larger chunks, affecting the granularity of the data used for training.

## Step 2: Load OpenAI API Key

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<YOUR OPENAI API KEY>"

## Step 3: Define LLM and Embedding Models

In [None]:
llm = OpenAI(model="gpt-4")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Step 4: Input Dataset

Upload the file that should be used as the 'oracle document' from which the LLM can deduce domain knowlegde. This file will be used to create the RAFT dataset.

*Note: The input data should be pdf, json, txt or api.*

In [None]:
from google.colab import files
uploaded = files.upload()

input_file = input("Enter the name of the uploaded file (including extension): ")

Saving leiden_guidelines.pdf to leiden_guidelines.pdf
Enter the name of the uploaded file (including extension): leiden_guidelines.pdf


The `RAFTDatasetPack` LlamaPack uses the following methods to prepare the data for fine-tuning the LLM:
- **Dividing Sample Data**: Segments the sample data into chunks. Each chunk represents a potential source of information or context for generating questions.
- **Generating Questions**: Creates corresponding questions for every chunk of data. These questions are designed to be answerable using the information within the chunk.
- **Generate answer with Oracle Context**: The 'oracle context' refers to the chunk of data that contains the precise information needed to answer a given question. Uses this context alongside the question to generate the answer using Chain of Thought prompting.
- **Selecting Distractor Contexts**: In addition to the oracle context, a few random chunks of data are chosen as 'distractor contexts'.These simulate noise and irrelevant information, challenging the model to focus on the relevant context.
- **Compiling Training Data**: Compilers include the question, oracle context, distractor contexts, and the generated answer, alongside explicit instructions on how the model should discern and utilize the relevant context to answer questions, into a comprehensive training dataset.
- **Fine-Tuning the Model**: Utilizing this dataset, the model undergoes fine-tuning, learning to accurately distinguish relevant from irrelevant information and to generate precise answers based on the context provided.

In [None]:
# Create RAFT Dataset object
raft_dataset = RAFTDatasetPack(file_path=input_file,
                               llm=llm, embed_model=embed_model,
                               num_questions_per_chunk=1, num_distract_docs=2, chunk_size=1024,
                               default_breakpoint_percentile_threshold=99)

# Beware of the costs invloved from using the OpenAI API.
# It will also take long time based on the file size.
dataset = raft_dataset.run()

# Save the dataset in jsonl format
output_path = './raft_dataset'
dataset.to_json(output_path + ".jsonl")

## Step 5: Loading Dataset

The complete dataset can now be loaded and contains the following keys:
- `id`: Unique identifier for the data point.
- `type`: Type of document.
- `question`: The question generated about this particular chunk of data.
- `context`: The chunk of data from which the model sources the question.
- `oracle_context`: The chunk of data that contains the information needed to answer the question.
- `cot_answer`: The Chain of Thought answer to the generated question.
- `instruction`: The prompt given to the LLM which tells it to answer the question based on the context.

In [None]:
import json

with open('./raft_dataset.jsonl', 'r') as json_file:
    dataset = list(json_file)

# We can access the dataset with the following keys
json.loads(dataset[0]).keys()

# Example: accessing one of the generated questions
json.loads(dataset[0])['question']

'Who supervised the creation of the Leiden Guidelines on the Use of Digitally Derived Evidence in International Criminal Courts and Tribunals?'

Once the dataset is prepared, you can follow these [instructions](https://github.com/ShishirPatil/gorilla/blob/main/raft/azure-ai-studio-ft/howto.md) to finetune and deploy your own RAFT model via Azure AI Studio. Make sure to use `prompt` as input and `completion` as output when fine tuning a `completion` model and the `messages` column as input when fine tuning a `chat` model.

Or follow the steps below to fine-tune a Llama 3.1 model with the Unsloth library.

## Step 6: Fine-tune Llama 3.1 8B

We'll use the Unsloth library to efficiently fine-tune Llama 3.1 8B. The newly created RAFT dataset will be used to QLoRA fine-tune the model. Click [here](https://huggingface.co/blog/mlabonne/sft-llama3) for a more in-depth explanation of the code.

In [28]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-7khdtyrn/unsloth_54057fb101dd4cc68c5279a48be36858
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-7khdtyrn/unsloth_54057fb101dd4cc68c5279a48be36858
  Resolved https://github.com/unslothai/unsloth.git to commit 4e570be9ae4ced8cdc64e498125708e34942befc
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [29]:
import torch
from trl import SFTTrainer
from datasets import Dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

Load the `unsloth/Meta-Llama-3.1-8B-bnb-4bit` model in NF4 format using the bitsandbytes library.

In [None]:
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

Prepare the model for parameter-efficient fine-tuning with LoRA (Low-Rank Adaptation) adapters. LoRA has three important parameters:
- **Rank (r)**: determines LoRA matrix size. Rank typically starts at 8 but can go up to 256. Higher ranks can store more information but increase the computational and memory cost of LoRA. We set it to 16 here.
- **Alpha (α)**: a scaling factor for updates. Alpha directly impacts the adapters' contribution and is often set to 1x or 2x the rank value.
- **Target modules**: LoRA can be applied to various model components, including attention mechanisms (Q, K, V matrices), output projections, feed-forward blocks, and linear output layers. While initially focused on attention mechanisms, extending LoRA to other components has shown benefits. However, adapting more modules increases the number of trainable parameters and memory needs.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

Load and prepare the RAFT dataset.

*Note: make sure the `./raft_dataset.jsonl` is the correct path to the dataset. Re-upload the file if necessary.*

In [5]:
import json

def reformat_jsonl_dataset(input_path, output_path):
    formatted_data_list = []

    # Read the JSONL file
    with open(input_path, 'r') as f:
        for line in f:
            data = json.loads(line.strip())

            question = data["question"].replace("\n", " ")
            answer = data["cot_answer"].replace("assistant: ", "").strip().replace("\n", " ")

            formatted_data = [
                {
                    "from": "human",
                    "value": question
                },
                {
                    "from": "gpt",
                    "value": answer
                }
            ]

            formatted_data_list.append(formatted_data)

    # Write the formatted data to a new JSONL file
    with open(output_path, 'w') as f:
        for item in formatted_data_list:
            f.write(json.dumps(item) + '\n')

input_file_path = './raft_dataset.jsonl'
output_file_path = './fine_tuning_dataset.jsonl'

reformat_jsonl_dataset(input_file_path, output_file_path)

Now we want to parse the JSONL format to follow a **chat template**, in this case the ChatML template. We then load and process the entire dataset to apply the chat template to every conversation.

In [6]:
def read_jsonl(file_path):
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

def convert_to_dataset(data):
    return Dataset.from_dict({"conversations": data})

# Convert to Dataset format
data = read_jsonl(output_file_path)
dataset = convert_to_dataset(data)

# Tokenizer setup and applying the ChatML template
tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

def apply_template(examples):
    messages = examples["conversations"]
    text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
    return {"text": text}

dataset = dataset.map(apply_template, batched=True)

Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.


Map:   0%|          | 0/108 [00:00<?, ? examples/s]

Specify the training parameters for our run. The most important hyperparameters:
- **Learning rate**: It controls how strongly the model updates its parameters. Too low, and training will be slow and may get stuck in local minima. Too high, and training may become unstable or diverge, which degrades performance.
- **LR scheduler**: It adjusts the learning rate (LR) during training, starting with a higher LR for rapid initial progress and then decreasing it in later stages. Linear and cosine schedulers are the two most common options.
- **Batch size**: Number of samples processed before the weights are updated. Larger batch sizes generally lead to more stable gradient estimates and can improve training speed, but they also require more memory. Gradient accumulation allows for effectively larger batch sizes by accumulating gradients over multiple forward/backward passes before updating the model.
- **Num epochs**: The number of complete passes through the training dataset. More epochs allow the model to see the data more times, potentially leading to better performance. However, too many epochs can cause overfitting.
- **Optimizer**: Algorithm used to adjust the parameters of a model to minimize the loss function. In practice, AdamW 8-bit is strongly recommended: it performs as well as the 32-bit version while using less GPU memory.
- **Weight decay**: A regularization technique that adds a penalty for large weights to the loss function. It helps prevent overfitting by encouraging the model to learn simpler, more generalizable features. However, too much weight decay can impede learning.
- **Warmup steps**: A period at the beginning of training where the learning rate is gradually increased from a small value to the initial learning rate. Warmup can help stabilize early training, especially with large learning rates or batch sizes, by allowing the model to adjust to the data distribution before making large updates.
- **Packing**: Batches have a pre-defined sequence length. Instead of assigning one batch per sample, we can combine multiple small samples in one batch, increasing efficiency.

In [7]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

trainer.train()

Generating train split: 0 examples [00:00, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 16 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 2
\        /    Total batch size = 16 | Total steps = 1
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.2281


TrainOutput(global_step=1, training_loss=1.2281341552734375, metrics={'train_runtime': 42.9654, 'train_samples_per_second': 0.372, 'train_steps_per_second': 0.023, 'total_flos': 1483774567120896.0, 'train_loss': 1.2281341552734375, 'epoch': 1.0})

In [15]:
# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.168 GB.
18.506 GB of memory reserved.
None


The output of the training process provided by `SFTTrainer` lists several important metrics that break down how the training went:
- `global_step`: Indicates the total number of training steps completed.
- `training_loss`:  The loss value after the training step. Lower values generally indicate better performance, but this depends on the specific model and task.
- `metrics`: Includes the total runtime for the training process in seconds, the number of training samples and training steps processed per second, the Total Floating Point Operations (FLOPs) performed during training (which measures computational effort during the training process), the training loss value and the number of epochs completed. An epoch is a full pass through the entire training dataset.

## Step 7: Test the fine-tuned model

In [None]:
model = FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "What are the Leiden Guidelines?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True)

## Step 8: Save the trained model

With LoRA, what we trained is not the model itself but a set of adapters. We will save them in 16-bit precision to maximize the quality. We first save it locally in the 'model' directory and then upload it to your account on the Hugging Face Hub.

In [None]:
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("gizembrasser/FineLlama-3.1-8B", tokenizer, save_method="merged_16bit")

Convert the model into GGUF format, which is a quantization format created for llama.cpp and is compatible with most inference engines.

In [None]:
quant_methods = ["q2_k", "q3_k_m", "q4_k_m", "q5_k_m", "q6_k", "q8_0"]
for quant in quant_methods:
    model.push_to_hub_gguf("gizembrasser/FineLlama-3.1-8B-GGUF", tokenizer, quant)