# Parameter efficient fine-tuning Mistral-7b

### Using Quantized LoRA




## Use case: We need a customized model for extracting structured data from permit notices.


In [None]:
#  !pip -q install git+https://github.com/huggingface/transformers # need to install from github
# https://blog.paperspace.com/mistral-7b-fine-tuning/
# !pip install transformers trl langchain accelerate torch bitsandbytes peft datasets -qU
# !pip install -q datasets loralib sentencepiece xformers einops
!pip install -q -U peft==0.6.2 transformers==4.35.2 datasets==2.15.0 bitsandbytes==0.41.2.post2 trl==0.7.4 accelerate==0.24.1 wandb

## Parameter Efficient Fine Tuning with LoRA


https://heidloff.net/article/efficient-fine-tuning-lora/

https://towardsdatascience.com/implementing-lora-from-scratch-20f838b046f1

`verbatim`
PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models to downstream applications without without the need to re-train or fine-tune all the parameters. 

One main form of PEFT is Low Rank Adaptation (LoRA) of large language models, based on the concept of rank of matrices in linear algebra.

```
LoRA is a method for fine-tuning language models without altering the original model parameters. In practical fine-tuning tasks, a set of low-rank adapters is added alongside specific model layers to be trained. In the original paper, the adapters were only added to two attention layers. The output dimensions of these adapters match those of the original model layers exactly.

Subsequently, the adapters are set to be trainable, while the original model parameters are frozen and not allowed to be trained. This approach allows for the training of large language models without affecting inference speed significantly and only slightly increasing the parameter count.
```


"common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space."
https://arxiv.org/abs/2012.13255


Instead of adjusting the entire weight matrix (which is part of the linear transformations in the model's layers), LoRA introduces additional low-rank matrices that modify the original weight matrices in a more parameter-efficient manner. This process allows for targeted adaptations that significantly alter the model's behavior 


<img src="images/peft_heidloff.png" alt="Descriptive text about the image" width="600"/>

Source: [Niklas Heidloff](https://heidloff.net/article/efficient-fine-tuning-lora/)


**SVD and matrix rank**
https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html
"The overall idea and concept are related to principal component analysis (PCA) and singular value decomposition (SVD), where we approximate a high-dimensional matrix or dataset using a lower-dimensional representation. In other words, we try to find a (linear) combination of a small number of dimensions in the original feature space (or matrix) that can capture most of the information in the dataset."

**Remember:** Large Language Models (LLMs) are simply lots of matrices (or tensors) of numbers.



**PEFT supported models**

HuggingFace has a list of 10 models for which LoRA adapters are available:
https://huggingface.co/docs/peft/index#supported-models




## Selecting a model


Look at the [HF leaderboard of open LLMS](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

There are multiple benchmarks to score LLMs, each designed for different linguistic, reasoning or numeric tasks.


**Benchmarks:**


**1. ARC (AI2 Reasoning Challenge)**
ARC challenges models on grade-school level multiple-choice science questions. It's designed to test a model's reasoning and knowledge.


**2. HellaSwag**
HellaSwag is a benchmark for commonsense reasoning, predicting the endings of a given scenario in both textual and video contexts.


**3. MMLU (Massive Multitask Language Understanding)**
MMLU evaluates language models across a wide range of subjects and disciplines, testing the model's understanding on diverse topics.

**4. TruthfulQA**
This benchmark tests models on providing truthful answers, focusing on avoiding hallucinations and sticking to factual responses.


**5. Winogrande**
Winogrande is a dataset for commonsense reasoning, designed as an improved and scaled-up version of the Winograd Schema Challenge.


**6. GSM8K (Grade School Math 8K)**
GSM8K tests models on solving grade-school level math problems presented in textual form.


Without admittedly delving too deep into it (maybe one of the above benchmarks has JSON tasks), my instinct would say TruthfulQA is more relevant for our task. It mostly doesn't require any external knowledge of the world (caveat: impacts on wetland), just the ability to not screw up when it transcribes from the text to the dictionary keys.

In any case, we will be fine-tuning the Mistral-7b model.

The following resources are helpful for understanding this notebook:

- [Mistral](https://www.datacamp.com/tutorial/mistral-7b-tutorial)

- [Mixtral](https://github.com/brevdev/notebooks/blob/main/mixtral-finetune-own-data.ipynb)

- [huggingFace Transformers Training](https://huggingface.co/docs/transformers/en/training)


### Weights & Biases for tracking metrics

Use Weights & Biases to track training metrics. Enter your API key when prompted.

In [None]:
!pip install -q wandb -U

import wandb, os
wandb.login()

wandb_project = "mistral-7b-usace-finetune_v1"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

## Load training dataset

It just needs to be a dataset of input-output pairs. The phrase 'supervised fine tuning' signals an image of a dataset of (vectors) of inputs and output, as in supervised learning. But that would be misleading - here the Xs and Ys are concatenated together into strings of continuous tokens, interspersed with special tokens which marks inputs and outputs.

Basically, my dataset consists of input-output pairs like:

**Input**
>The applicant seeks author ization to construct a single-family residence on the 2.09-acre parcel, including the permanent fill and loss of  0.78 acres of mangrove wetlands.


**Output**
>

{'wetlands': [
  {'wetland_type': 'mangrove wetlands',
  
   'impact_quantity': 0.78,

   'impact_unit': 'acres',

   'impact_type': 'fill',
   
   'impact_duration': 'permanent'}]}
>



In [None]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("mosaicml/instruct-v3")
instruct_tune_dataset

type(instruct_tune_dataset)

instruct_tune_dataset['train']


### Convert from OpenAI JSONL to HuggingFace Dataset

https://huggingface.co/transformers/v3.2.0/custom_datasets.html


Load the JSONL file of verified input-output pairs we used in notebook 2. (This step assumes you've already connverted them into JSONL format for OpenAI.) The following step will convert these into a format suitable for HuggingFace.

In [None]:
from datasets import Dataset
from pathlib import Path
import json
def load_jsonl_to_dataset(jsonl_file_path):
    # Initialize lists to hold the values for each field
    system_messages = []
    user_messages = []
    assistant_messages = []

    # Open and read the JSONL file
    with open(jsonl_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            data = json.loads(line)
            system_msg = user_msg = assistant_msg = None  # Reset for each entry

            # Iterate over messages and extract content based on role
            for message in data['messages']:
                if message['role'] == 'system':
                    system_msg = message['content']
                elif message['role'] == 'user':
                    user_msg = message['content']
                elif message['role'] == 'assistant':
                    assistant_msg = message['content']

            # Append messages to their respective lists
            system_messages.append(system_msg)
            user_messages.append(user_msg)
            assistant_messages.append(assistant_msg)

    # Construct a dictionary with these lists
    data_dict = {
        'system': system_messages,
        'user': user_messages,
        'assistant': assistant_messages
    }

    # Convert the dictionary to a Hugging Face Dataset
    dataset = Dataset.from_dict(data_dict)

    return dataset


train_dataset = load_jsonl_to_dataset("usace_finetune_training.jsonl")
val_dataset = load_jsonl_to_dataset("usace_finetune_validation.jsonl")

# Displaying the structure of the train dataset as an example
print(f"Train Dataset:\n{train_dataset}")

### Prompt formatting function

In [None]:
def create_prompt(sample):
    bos_token = "<s>"
    eos_token = "</s>"

    # Directly access the 'system', 'user', and 'assistant' messages from the sample
    system_message = sample['system'].replace("\n", " ").strip()
    user_message = sample['user'].replace("\n", " ").strip()
    assistant_message = sample['assistant'].replace("\n", " ").strip()

    # Concatenate the prompt according to the specified format
    full_prompt = f"{bos_token}"
    full_prompt += f"### Instruction: \n{system_message}"
    full_prompt += f"\n\n### Input:\n{user_message}"
    full_prompt += f"\n\n### Response:\n{assistant_message}"
    full_prompt += f"{eos_token}"

    return full_prompt
create_prompt(train_dataset[5])


## Load and Train the Model

https://blog.paperspace.com/mistral-7b-fine-tuning/

In [None]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# torch.set_default_device('cuda')



### Quantization (QLoRA)

QLoRA by Dettmers et al., short for quantized LoRA, is a technique that reduces memory usage during finetuning. During backpropagation, QLoRA quantizes the pretrained weights to **4-bit precision** and uses **paged optimizers** to handle memory spikes.

This author found he saved 33% of GPU memory when using QLoRA, at a cost of 39% increased training runtime. This is caused by the additional quantization/dequantization steps of the pretrained model weights in QLoRA.

**Note:** Its preferred to use BFloat16, but it wasn't supported on a V100 GPU on Colab, so I worked around this.

https://brev.dev/blog/how-qlora-works

The above article raises a very good point about the democratization of AI.



In [None]:

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
  #  bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

# "stabilityai/stablelm-3b-4e1t" - No quantization?

# 'EleutherAI/gpt-neo-1.3B'

model = AutoModelForCausalLM.from_pretrained(model_name,
# ,
    device_map='auto',
    load_in_4bit=True,

    quantization_config=nf4_config,
    use_cache=False

)

tokenizer = AutoTokenizer.from_pretrained(model_name)
    #    padding_side="left",
    # add_eos_token=True,
    # add_bos_token=True,

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Step 3: Tokenization

Before training, you need to tokenize your dataset. The tokenizer will convert your text into a format that's suitable for the model to process:


**Newline Characters**

- When tokenizing text, newline characters are treated as whitespace and are used to separate tokens. For many models, especially those trained on a wide variety of internet text (like GPT, BERT, etc.), encountering newline characters is expected and won't cause issues.

- For some tasks, newline characters might carry semantic meaning (e.g., separating paragraphs or items in a list), which could be relevant for the model to understand the structure of the input text.


**Note:** Padding - affects compute requirements

In [None]:

token = ""

# set max length of sequence
max_length=968


tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"],
                     max_length=max_length,padding="max_length", truncation=True)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)


### Plot a histogram of the distribution of tokens

In [None]:
import matplotlib.pyplot as plt


# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"],
                     max_length=max_length,padding="max_length", truncation=True)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)

def plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset):
    lengths = [len(x['input_ids']) for x in tokenized_train_dataset]
    lengths += [len(x['input_ids']) for x in tokenized_val_dataset]
    print(len(lengths))

    # Plotting the histogram
    plt.figure(figsize=(10, 6))
    plt.hist(lengths, bins=20, alpha=0.7, color='blue')
    plt.xlabel('Length of input_ids')
    plt.ylabel('Frequency')
    plt.title('Distribution of Lengths of input_ids')
    plt.show()

plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

## Set LoRA Configuration


Choose the particular layers you want to fine tune - the more layers, the more the compute cost and training time, maybe for negligibly higher performance.

As mentioned above, the original paper chose the Q and V layers of the attention mechanism.

r = rank of the low-rank matrix used in the adapters. Changes the number of parameters trained. In the extreme, full rank would mean full fine-tuning 

alpha = scaling factor for the learned weights. The weight matrix is scaled by alpha/r, and thus a higher value for alpha assigns more weight to the LoRA activations.

The values used in the QLoRA paper were r=64 and lora_alpha=16.

In [None]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.08,
    r=64,
    target_modules=["q_proj"],
        "k_proj",
        "v_proj",
        "o_proj",
    #     "w1",
    #     "w2",
    #     "w3",
    #     "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM"
)

In [None]:
## prepare model for kbit training
from peft import prepare_model_for_kbit_training


model = prepare_model_for_kbit_training(model)

# model.to('cuda')

In [None]:
# ONLY RUN this step to see the model - introduces errors


model = get_peft_model(model, peft_config)
model

### How many trainable parameters?

In [None]:
model.print_trainable_parameters()

### Training hyperparameters


Use the `transformers` trainer. This is not the full list of parameters.

In [None]:
from transformers import TrainingArguments
args = TrainingArguments(
  output_dir = "mistral_instruct_generation",
  #num_train_epochs=5,
  max_steps = 100,
  per_device_train_batch_size = 1,
  gradient_accumulation_steps = 4,
  # gradient_checkpointing = True,
  fp16 = True,
  # gradient_checkpointing_steps = 10,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=40,
  learning_rate=2e-4,
  bf16=False,
  lr_scheduler_type='constant',
)

## Train the Model


Use the huggingface `trl` module. Remember to replace create_prompt with the specific function for formatting prompts that you used above.



In [None]:
from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  dataset_text_field="text",
  train_dataset=train_dataset,
  eval_dataset= val_dataset,
  # instruct_tune_dataset["test"]
)

In [None]:
import time
start = time.time()
trainer.train()
print(time.time()- start)

### Save trained model.

In [None]:
# save model
trainer.model.save_pretrained("mistral_ft_8mar")
wandb.finish()
model.config.use_cache = True

## Inference

In [None]:
# logging.set_verbosity(logging.CRITICAL)
from transformers import pipeline
prompt = "Extract to JSON: The applicant has requested Department of the Army authorization to clear, grade and fill in order to expand an ex isting sand and gravel mining operation in Grangeville, Louisian a.The proposed sand and gra vel pits will encompass approximately 56.5 ac res and will be dug to adepth of 35 feet with 3:1 side slopes.Approximately 47,430 cubic yards of dirt, sand and gravel will be excavated and placed on upland areas of the site.It is anticipated that the proposed activity will impact approximately 1.47 acres of forested wetlands.It is presumed that the applicant has designed the project to a void and minimize direct and secondary adverse impacts tothe maximum extent practicable .As compensation forunavoidable wetland impacts, the applicant proposes to mitigatein-kind wetland credits from a Corps approved mitigat ion bank located in the watershed."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=600)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])