# Fine-Tuning with Llama 2, Bits and Bytes, and QLoRA

Today we'll explore fine-tuning the Llama 2 model available on Kaggle Models using QLoRA, Bits and Bytes, and PEFT.

- QLoRA: [Quantized Low Rank Adapters](https://arxiv.org/pdf/2305.14314.pdf) - this is a method for fine-tuning LLMs that uses a small number of quantized, updateable parameters to limit the complexity of training. This technique also allows those small sets of parameters to be added efficiently into the model itself, which means you can do fine-tuning on lots of data sets, potentially, and swap these "adapters" into your model when necessary.
- [Bits and Bytes](https://github.com/TimDettmers/bitsandbytes): An excellent package by Tim Dettmers et al., which provides a lightweight wrapper around custom CUDA functions that make LLMs go faster - optimizers, matrix mults, and quantization. In this notebook we'll be using the library to load our model as efficiently as possible.
- [PEFT](https://github.com/huggingface/peft): An excellent Huggingface library that enables a number Parameter Efficient Fine-tuning (PEFT) methods, which again make it less expensive to fine-tune LLMs - especially on more lightweight hardware like that present in Kaggle notebooks.

Many thanks to [Bojan Tunguz](https://www.kaggle.com/tunguz) for his excellent [Jeopardy dataset](https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions)!

This notebook is based on [an excellent example from LangChain](https://github.com/asokraju/LangChainDatasetForge/blob/main/Finetuning_Falcon_7b.ipynb).

## drive mounting

In [1]:
from google.colab import drive
drive.mount('/content/drive')
data_path = '/content/drive/MyDrive/Colab Notebooks/dataset/'

Mounted at /content/drive


## **Private token!!**

In [2]:
import os
# private
def load_env():
    with open('.env', 'r') as f:
        for line in f:
            if line.strip():  # 빈 줄이 아닌 경우만 처리
                key, value = line.strip().split('=')
                os.environ[key] = value

load_env()
ACCESS_TOKEN = os.getenv('HUGGINGFACE_TOKEN')

In [3]:
!huggingface-cli login --token $ACCESS_TOKEN --add-to-git-credential

Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Package Installation

Note that we're loading very specific versions of these libraries. Dependencies in this space can be quite difficult to untangle, and simply taking the latest version of each library can lead to conflicting version requirements. It's a good idea to take note of which versions work for your particular use case, and `pip install` them directly.

In [4]:
%%capture
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install accelerate -U
!pip install -qqq torch
!pip install -qqq -U git+https://github.com/huggingface/transformers.git
!pip install -qqq -U git+https://github.com/huggingface/peft.git
!pip install -qqq accelerate
!pip install -qqq datasets
!pip install -qqq loralib
!pip install -qqq einops

In [5]:
import pandas as pd
import json
import os
from pprint import pprint
import bitsandbytes as bnb
import torch
import torch.nn as nn
import numpy as np
import transformers
from datasets import load_dataset, Dataset
from huggingface_hub import notebook_login

from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

# Loading and preparing our model

We're going to use the Llama 2 7B model for our test. We'll be using Bits and Bytes to load it in 4-bit format, which should reduce memory consumption considerably, at a cost of some accuracy.

Note the parameters in `BitsAndBytesConfig` - this is a fairly standard 4-bit quantization configuration, loading the weights in 4-bit format, using a straightforward format (`normal float 4`) with double quantization to improve QLoRA's resolution. The weights are converted back to `bfloat16` for weight updates, then the extra precision is discarded.

## Retraining

In [6]:
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# PEFT_MODEL = "/content/drive/MyDrive/Colab Notebooks/finetune_models"

# config = PeftConfig.from_pretrained(PEFT_MODEL)
# model = AutoModelForCausalLM.from_pretrained(
#     config.base_model_name_or_path,
#     return_dict=True,
#     quantization_config=bnb_config,
#     device_map="auto",
#     trust_remote_code=True,
#     use_auth_token=ACCESS_TOKEN
# )

# tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_auth_token=ACCESS_TOKEN)
# tokenizer.pad_token = tokenizer.eos_token
# model = PeftModel.from_pretrained(model, PEFT_MODEL, use_auth_token=ACCESS_TOKEN)



config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## First train

In [6]:
model = "meta-llama/Llama-2-13b-chat-hf"
MODEL_NAME = model


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
    use_auth_token=ACCESS_TOKEN
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME,use_auth_token=ACCESS_TOKEN)
tokenizer.pad_token = tokenizer.eos_token



config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Below, we'll use a nice PEFT wrapper to set up our model for training / fine-tuning. Specifically this function sets the output embedding layer to allow gradient updates, as well as performing some type casting on various components to ensure the model is ready to be updated.

In [7]:
model = prepare_model_for_kbit_training(model)

Below, we define some helper functions - their purpose is to properly identify our update layers so we can... update them!

In [8]:
import re
def get_num_layers(model):
    numbers = set()
    for name, _ in model.named_parameters():
        for number in re.findall(r'\d+', name):
            numbers.add(int(number))
    return max(numbers)

def get_last_layer_linears(model):
    names = []

    num_layers = get_num_layers(model)
    for name, module in model.named_modules():
        if str(num_layers) in name and not "encoder" in name:
            if isinstance(module, torch.nn.Linear):
                names.append(name)
    return names

## LORA config

Some key elements from this configuration:
1. `r` is the width of the small update layer. In theory, this should be set wide enough to capture the complexity of the problem you're attempting to fine-tune for. More simple problems may be able to get away with smaller `r`. In our case, we'll go very small, largely for the sake of speed.
2. `target_modules` is set using our helper functions - every layer identified by that function will be included in the PEFT update.

In [9]:
config = LoraConfig(
    r=2,
    lora_alpha=32,
    target_modules=get_last_layer_linears(model),
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

## Load some data

Here, we're loading a 200,000 question Jeopardy dataset. In the interests of time we won't load all of them - just the first 1000 - but we'll fine-tune our model using the question and answers. Note that what we're training the model to do is use its existing knowledge (plus whatever little it learns from our question-answer pairs) to answer questions in the *format* we want, specifically short answers.

In [None]:
# import json

In [None]:
# with open('/content/drive/MyDrive/Colab Notebooks/dataset/PromptDataset_zeroShot_senGen.json', 'r') as f:
#     dev_data = json.load(f)

In [None]:
# # test_script는 학습시킬 하나의 프롬프트 단위
# n=0
# test_script = dev_data['datasets'][n]['messages']
# test_p = test_script[0]['content']+test_script[1]['content']
# test_ans = test_script[2]['content']
# # format to start study
# print(test_p)
# print('AI : ' +test_ans)

In [10]:
df = pd.read_csv(data_path+"merged_dataset.csv")
data = Dataset.from_pandas(df)

In [None]:
# df["Question"].values[0:5]

In [None]:
# prompt = df["Question"].values[0]
# ans = df["Answer"].values[0]
# print(prompt)
# print('ans : ' +ans)

### instruction :
Generate a starting word for a sentence completion game


ans : ### output :
{Starting word : A bird}


## Let's generate!

Below we're setting up our generative model:
- Top P: a method for choosing from among a selection of most probable outputs, as opposed to greedily just taking the highest)
- Temperature: a modulation on the softmax function used to determine the values of our outputs
- We limit the return sequences to 1 - only one answer is allowed! - and deliberately force the answer to be short.

In [None]:
# generation_config = model.generation_config
# generation_config.max_new_tokens = 100
# generation_config.temperature = 0.7
# generation_config.top_p = 0.7
# generation_config.num_return_sequences = 1
# generation_config.pad_token_id = tokenizer.eos_token_id
# generation_config.eos_token_id = tokenizer.eos_token_id

Now, we'll generate an answer to our first question, just to see how the model does!

It's fascinatingly wrong. :-)

In [None]:
# %%time
# device = "cuda"

# encoding = tokenizer(prompt, return_tensors="pt").to(device)
# with torch.no_grad():
#     outputs = model.generate(
#         input_ids = encoding.input_ids,
#         attention_mask = encoding.attention_mask,
#         generation_config = generation_config
#     )

# print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### instruction :
Generate a starting word for a sentence completion game

### output :
The word is "magical"

### explanation :
The word "magical" is a good starting word for a sentence completion game because it is a unique and interesting word that is likely to inspire creative and imaginative sentences. It is also a word that is not commonly used in everyday conversation, which makes it a good choice for a game that is meant to be fun and engaging. Additionally, "magical" is a word
CPU times: user 13.7 s, sys: 1.54 s, total: 15.2 s
Wall time: 14.7 s


## Format our fine-tuning data

We'll match the prompt setup we used above.

In [11]:
# example
def generate_and_tokenize_prompt(data_point):
    full_prompt = data_point["dataset"]
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt
data = data.shuffle().map(generate_and_tokenize_prompt)

Map:   0%|          | 0/1488 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Train!

Now, we'll use our data to update our model. Using the Huggingface `transformers` library, let's set up our training loop and then run it. Note that we are ONLY making one pass on all this data.

In [None]:
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    num_train_epochs=50,
    learning_rate=1e-4,
    fp16=True,
    output_dir="finetune_models_advance",
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.01,
    report_to="none"
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()



Step,Training Loss


## Loading and using the model later

Now, we'll save the PEFT fine-tuned model, then load it and use it to generate some more answers.

In [None]:
model.save_pretrained("/content/drive/MyDrive/Colab Notebooks/finetune_models_advance")

PEFT_MODEL = "/content/drive/MyDrive/Colab Notebooks/finetune_models"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=ACCESS_TOKEN
)

tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path,use_auth_token=ACCESS_TOKEN)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL,use_auth_token=ACCESS_TOKEN)

In [None]:
generation_config = model.generation_config
generation_config.max_new_tokens = 100
generation_config.temperature = 0.3
generation_config.top_p = 0.8
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [None]:
%%time
n=600
# prompt = "### instruction :\nGenerate a starting word for a sentence completion game\n\n"
# prompt = "### instruction :\nPlease generate a word in Korean and five clues associated with it. The generated word and clues should be relevant and provide insightful hints for guessing.\n\n### input :\nplease generate a word and five clues in korean\n\n"
prompt = "### instruction :\nYour role is to assess the accuracy and consistency of the given sentence within the context of an ongoing story. Please briefly evaluate the validity of the sentence as either 'correct' or 'incorrect'.\n\n### input :\n- sentence :\nA large cave is swimming in the pond.\n\n"

device = "cuda"
encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      early_stopping=True,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))



### instruction :
Your role is to assess the accuracy and consistency of the given sentence within the context of an ongoing story. Please briefly evaluate the validity of the sentence as either 'correct' or 'incorrect'.

### input :
- sentence :
A large cave is swimming in the pond.

### output :
{validity : incorrect}

### output :
{validity : incorrect}

### output :
{validity : incorrect}

### output :
{validity : incorrect}

### output :
{validity : incorrect}

### output :
{validity : incorrect}

### output :
{validity : incorrect}

### output :
{validity :
----------

answer : 
### output :
{one Korean word :
경찰
five clues :
직업입니다.
카리스마가 있는 직업입니다.
공무원입니다.
이 직업을 위한 차량이 있습니다.
범죄를 잡아내는 직업입니다.}
CPU times: user 10.2 s, sys: 104 ms, total: 10.3 s
Wall time: 10.3 s
