# Finetuning Quantized Llama models with _Adapters_

In this notebook, we show how to efficiently fine-tune a quantized **Llama 2** or **Llama 3** model using [**QLoRA** (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314) and the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library.

For this example, we finetune Llama-2 7B/ Llama-3 8B on supervised instruction tuning data collected by the [Open Assistant project](https://github.com/LAION-AI/Open-Assistant) for training chatbots. This is similar to the setup used to train the Guanaco models in the QLoRA paper.
You can simply replace this with any of your own domain-specific data!

Additionally, you can quickly adapt this notebook to use other **adapter methods such as bottleneck adapters or prefix tuning.**

Pre-trained checkpoints based on this notebook can be found on HuggingFace Hub:
- for Llama-2 7B: [AdapterHub/llama2-7b-qlora-openassistant](https://huggingface.co/AdapterHub/llama2-7b-qlora-openassistant)
- for Llama-2 13B: [AdapterHub/llama2-13b-qlora-openassistant](https://huggingface.co/AdapterHub/llama2-13b-qlora-openassistant)
- for Llama-2 7B with sequential bottleneck adapter: [AdapterHub/llama2-7b-qadapter-seq-openassistant](https://huggingface.co/AdapterHub/llama2-7b-qadapter-seq-openassistant)

In [1]:
! nvidia-smi

Thu Aug 29 10:47:03 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A4000    Off  | 00000000:00:05.0 Off |                  Off |
| 47%   62C    P8    39W / 140W |      1MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Installation

Besides `adapters`, we require `bitsandbytes` for quantization and `accelerate` for training.

In [None]:
! pip install -q datasets==2.20.0 \
                 accelerate==0.33.0 \
                 evaluate==0.4.2 \
                 peft==0.12.0 \
                 adapters==1.0.0 \
                 bitsandbytes==0.43.3

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## Load Open Assistant dataset

We use the [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) dataset by the QLoRA, which contains a small subset of conversations from the full Open Assistant database and was also used to finetune the Guanaco models in the QLoRA paper.

In [3]:
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")

Repo card metadata block was not found. Setting CardData to empty.


Our training dataset has roughly 10k training samples:

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

In [5]:
print(dataset["train"][0]["text"])

### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading

## Load and prepare model and tokenizer

We download the the official Llama-2 7B/ Llama-3 8B checkpoint from the HuggingFace Hub (**Note:** You must request access to this model on the HuggingFace website and use an API token to download it.).

Via the `BitsAndBytesConfig`, we specify that the model should be loaded in 4bit quantization and with double quantization for even better memory efficiency. See [their documentation](https://huggingface.co/docs/bitsandbytes/main/en/index) for more on this.

In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig

# modelpath="meta-llama/Llama-2-7b-hf"
modelpath="meta-llama/Meta-Llama-3-8B"

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
    token="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"
)
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(modelpath, token="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")
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

We initialize the adapter functionality in the loaded model via `adapters.init()` and add a new LoRA adapter (named `"assistant_adapter"`) via `add_adapter()`.

In the call to `LoRAConfig()`, you can configure how and where LoRA layers are added to the model. Here, we want to add LoRA layers to all linear projections of the self-attention modules (`attn_matrices=["q", "k", "v"]`) as well as intermediate and outputa linear layers.

In [7]:
import adapters
from adapters import LoRAConfig

adapters.init(model)

config = LoRAConfig(
    selfattn_lora=True, intermediate_lora=True, output_lora=True,
    attn_matrices=["q", "k", "v"],
    alpha=16, r=64, dropout=0.1
)
model.add_adapter("assistant_adapter", config=config)
model.train_adapter("assistant_adapter")

print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
assistant_adapter        lora            113,246,208       2.820       1       1
--------------------------------------------------------------------------------
Full model                              4,015,263,744     100.000               0


Some final preparations for 4bit training: we cast a few parameters to float32 for stability.

In [8]:
for param in model.parameters():
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

# Enable gradient checkpointing to reduce required memory if needed
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

class CastOutputToFloat(torch.nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

In [9]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayerWithAdapters(
        (self_attn): LlamaSdpaAttentionWithAdapters(
          (q_proj): LoRALinear4bit(
            in_features=4096, out_features=4096, bias=False
            (loras): ModuleDict(
              (assistant_adapter): LoRA(
                (lora_dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (k_proj): LoRALinear4bit(
            in_features=4096, out_features=1024, bias=False
            (loras): ModuleDict(
              (assistant_adapter): LoRA(
                (lora_dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (v_proj): LoRALinear4bit(
            in_features=4096, out_features=1024, bias=False
            (loras): ModuleDict(
              (assistant_adapter): LoRA(
                (lora_dropout): Dropout(p=0.1, inplace=False)
      

In [10]:
# Verifying the datatypes.
dtypes = {}
for _, p in model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

torch.bfloat16 1050673152 0.22576446079143245
torch.uint8 3489660928 0.749844436640606
torch.float32 113512448 0.024391102567961606


## Prepare data for training

The dataset is tokenized and truncated.

In [11]:
import os 

def tokenize(element):
    return tokenizer(
        element["text"],
        truncation=True,
        max_length=512, # can set to longer values such as 2048
        add_special_tokens=False,
    )

dataset_tokenized = dataset.map(
    tokenize, 
    batched=True, 
    num_proc=os.cpu_count(),    # multithreaded
    remove_columns=["text"]     # don't need this anymore, we have tokens from here on
)

In [12]:
dataset_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 518
    })
})

## Training

We specify training hyperparameters and train the model using the `AdapterTrainer` class.

The hyperparameters here are similar to those chosen [in the official QLoRA repo](https://github.com/artidoro/qlora/blob/main/scripts/finetune_llama2_guanaco_7b.sh), but feel free to configure as you wish!

In [13]:
args = TrainingArguments(
    output_dir="output/llama_qlora",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy="epoch",
    logging_steps=10,
    num_train_epochs=2,
    save_total_limit=2,
    gradient_accumulation_steps=16,
    lr_scheduler_type="linear",
    optim="paged_adamw_32bit",
    learning_rate=0.0002,
    group_by_length=True,
    bf16=True,
    warmup_ratio=0.03,
    max_grad_norm=0.3,
    report_to="none"
)

In [14]:
from adapters import AdapterTrainer
from transformers import DataCollatorForLanguageModeling

trainer = AdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    args=args
)

2024-08-29 10:48:37.497190: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-29 10:48:37.497251: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-29 10:48:37.498189: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-29 10:48:37.503825: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [15]:
trainer.train()

[2024-08-29 10:48:44,055] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


Epoch,Training Loss,Validation Loss


In [None]:
trainer.save_model()

## Inference

Finally, we can prompt the model:

In [None]:
# Ignore warnings
from transformers import logging
logging.set_verbosity(logging.CRITICAL)

def prompt_model(model, text: str):
    batch = tokenizer(f"### Human: {text}\n### Assistant:", return_tensors="pt")
    batch = batch.to(model.device)
    
    model.eval()
    with torch.inference_mode(), torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, max_new_tokens=50)

    return tokenizer.decode(output_tokens[0], skip_special_tokens=True)


In [None]:
print(prompt_model(model, "Explain Calculus to a primary school student"))

## Merge LoRA weights

For lower inference latency, the LoRA weights can be merged with the base model:

In [None]:
model.merge_adapter("assistant_adapter")

In [None]:
print(prompt_model(model, "Explain NLP in simple terms"))