## Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

### Install requirements

First, run the cells below to install the requirements:

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

### Model loading

Here let's load the `metamath-7b` model, its weights in half-precision (float16).

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "1TuanPham/MetaMath-Mistral-7B-900MB-sharded",
    load_in_4bit=True,
    quantization_config=nf4_config,
    # load_in_8bit=True,
    device_map='auto',
    # use_flash_attention_2=True,
)

tokenizer = AutoTokenizer.from_pretrained("1TuanPham/MetaMath-Mistral-7B-900MB-sharded")

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    # target_modules=["query_key_value","dense_h_to_4h","dense_4h_to_h","dense"],
    target_modules=["q_proj","k_proj","v_proj","o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

### Training

In [None]:
!gdown 1EbbMx9Uai47B1L2xqLudIaYvsb8cjUxK
!gdown 16ZT-KCEnx8gu-q-2jmeBh7SkdUtfXVAZ
!gdown 1kLhfa4LPjkvmjoWaI03I-v5MykByGSUF
!gdown 1udRGcgH7Dc1hnVPTipB0XwpgvJ2SsRCD

In [None]:
from datasets import load_dataset, DatasetDict, Dataset
dataset = load_dataset("json", data_files="math_train-all_explained-gpt4-v2.json", field='data')
dataset['train'] = dataset['train'].filter(lambda x : x['explanation'] is not None)
dataset

In [None]:
dataset_gsm8k = load_dataset("json", data_files="GSM8K-MetaMath-max256words.json")

In [None]:
from datasets import concatenate_datasets
dataset['train'] = concatenate_datasets([dataset['train'], dataset_gsm8k['train']])
dataset

In [None]:
from typing import Dict, List, Union

def multiple_choice(
        inp: Dict[str, Union[str, List[str], int]], inference:bool = False) -> Dict[str, str]:
    PROMPT_WITH_CHOICES = "### Câu hỏi:\n{instruction}\n\n### Lựa chọn:\n{choices}\n\n### Trả lời:\n"
    PROMPT_WITHOUT_CHOICES = "### Câu hỏi:\n{instruction}\n\n### Trả lời:\n"
    ANS = "{exp}\n\n### Đáp án cuối cùng:\n{answer}"

    query = inp['question']

    out = {}
    if inp['choices'] is not None and len(inp['choices']) > 0:
      options = ''
      assert isinstance(inp['choices'], List)
      for option in inp['choices']:
          options += f'\n - {option}'

      out['prompt'] = PROMPT_WITH_CHOICES.format(instruction=query, choices=options)
    else:
      out['prompt'] = PROMPT_WITHOUT_CHOICES.format(instruction=query)

    if not inference:
      out['response'] = ANS.format(exp=inp["explanation"] , answer=inp['answer'])
    return out
from pprint import pprint
pprint(multiple_choice(dataset['train'][2000], inference=False))

In [None]:
import transformers
from typing import Sequence, Dict, List
import copy

In [None]:
IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_UNK_TOKEN = "<unk>"

def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Tokenize a list of strings."""
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            # max_length=512,
            # truncation=True,
        )
        for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
    ]
    return dict(
        input_ids=input_ids,
        labels=labels,
        input_ids_lens=input_ids_lens,
        labels_lens=labels_lens,
    )

def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)




In [None]:
from dataclasses import dataclass
@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def naive__call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        print("Naive")
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

    def __call__(self, instances: Sequence[Dict], inference:bool=False) -> Dict[str, torch.Tensor]:
        sources = []
        targets = []
        # print("Len", len(instances))
        for instance in instances:
            # print(instance)
            mtc = multiple_choice(instance, inference)
            source, target = mtc['prompt'], mtc['response']
            target += DEFAULT_EOS_TOKEN
            # print(source, target)
            sources.append(source)
            targets.append(target)

        data_dict = preprocess(sources, targets, self.tokenizer)
        input_ids, labels = data_dict['input_ids'], data_dict['labels']
        # input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
model.to('cuda')
with torch.autocast('cuda'):
  trainer = Trainer(
      model=model,
      train_dataset=dataset['train'],
      args=TrainingArguments(
          per_device_train_batch_size=4,
          gradient_accumulation_steps=8,
          dataloader_num_workers=1,
          warmup_steps=100,
          max_steps=1000,
          learning_rate=5e-5,
          optim='paged_lion_32bit',
          lr_scheduler_type="cosine",
          fp16=True,
          logging_steps=1,
          output_dir='outputs',
          remove_unused_columns=False
      ),
      data_collator=data_collator
  )
  model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
  trainer.train()

## Share adapters on the 🤗 Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
model.push_to_hub("MetaMath-QLORA-bs32-200it", use_auth_token=True)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "nero1342/MetaMath-QLORA-bs32-200it"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, is_trainable=True)

In [None]:
model

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

In [None]:
!gdown 13-dycsFk8-QQnZ2bDpmEyDtCV6CFrSgI

In [None]:
from datasets import load_dataset
dataset_test = load_dataset("json", data_files="math_test_with_answer.json", field='data')


In [None]:
multiple_choice(dataset_test['train'][0], inference=True)

In [None]:
sample = dataset_test['train'][60]
sample

In [None]:
sample = {'id': '01-0326',
 'question': 'Kết quả của phép tính 72,1 – 30,4 là:',
 'choices': ['A. 4,17', 'B. 41,7', 'C. 417', 'D. 47,1'],
 'answer': 'C. 6'}

In [None]:
model.to('cuda')

In [None]:
model.config.use_cache=True
model.bfloat16()
model.eval()

In [None]:
# sample = dataset_test['train'][100]
batch = tokenizer(multiple_choice(sample, inference=True)['prompt'], return_tensors='pt').to('cuda')

with torch.no_grad():
  output_tokens = model.generate(**batch,
                                 max_new_tokens=512,
                                #  do_sample=True,
                                 top_k =5,
                                #  top_p = 0.7,
                                 penalty_alpha=0.6,
                                #  repetition_penalty=1.15,
                                #  pad_token_id=tokenizer.eos_token_id,
                                #  eos_token_id=tokenizer.eos_token_id,
                                 )

output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print('\n\n', output)

In [None]:
tokenizer.bos_token_id

In [None]:
print(output_tokens[0][:])
output = tokenizer.decode(output_tokens[0][:], skip_special_tokens=False)
print(output)


In [None]:
from tqdm import tqdm
import random
cnt = 0
all = []
outputs = []
# model.to('cuda')
# model.eval()
for i, sample in enumerate(tqdm(dataset_test['train'])):
  batch = tokenizer(multiple_choice(sample, inference=True)['prompt'], return_tensors='pt').to('cuda')

  with torch.no_grad():
    output_tokens = model.generate(**batch, max_new_tokens=512,
                                #  do_sample=True,
                                #  repetition_penalty=1.15,
                                #  eos_token_id=tokenizer.eos_token_id,
                                # pad_token_id=tokenizer.pad_token_id,
                                 top_k=5, penalty_alpha=0.6,

                                # top_p=0.9,
                                   )

  output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

  print('\n\n', output)

  if "### Đáp án cuối cùng:\n" not in output:
    print("Random answer")
    pred = random.choice(sample['choices'])
  else:
    pred = output.split("\n### Đáp án cuối cùng:\n")[1].strip()
    if "####" in pred:
      pred = pred.split("####")[0]
    for c in sample['choices']:
      if pred in c: pred = c
    for c in sample["choices"]:
      if c in pred: pred = c

  print("\n\n", "Correct answer:", sample["answer"], pred)
  all.append(pred)
  outputs.append(output)
  if pred == sample["answer"]:
    cnt += 1
    print("Correct", cnt, i + 1)

print("Number of correct prediction:", cnt, "accuracy: ", cnt / len(dataset_test['train']))

In [None]:
!gdown 1BlbXh5dEHDhEMOyuwmCUL4oSYRVJppAF

In [None]:
dataset_test = load_dataset('json', data_files='math_test.json', field='data')

In [None]:
from tqdm import tqdm
import random
cnt = 0
model.to('cuda')
all = []
for i, sample in enumerate(tqdm(dataset_test['train'])):
  batch = tokenizer(multiple_choice(sample, inference=True)['prompt'], return_tensors='pt').to('cuda')

  with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=50)

  output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

  print('\n\n', output)
  pred = output.split("### Trả lời:\n")[1]
  all.append(pred)
# print("Number of correct prediction:", cnt, "accuracy: ", cnt / len(dataset['test']))

In [None]:
import pandas as pd
df = pd.DataFrame()
df['id'] = id
df['answer'] = all
df.to_csv('baseline.csv', index=False)

In [None]:
id = [x['id'] for x in dataset_test['train']]
id[:10]

In [None]:
len(all)

In [None]:
for sample in dataset['test']:
  batch = tokenizer(encode_sample_infer(sample), return_tensors='pt')


In [None]:
sample = dataset['test'][2]

In [None]:
batch = tokenizer(encode_sample_infer(sample), return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

In [None]:
sample['answer']

As you can see by fine-tuning for few steps we have almost recovered the quote from Albert Einstein that is present in the [training data](https://huggingface.co/datasets/Abirate/english_quotes).