Got this notebook from the llama2 huggingface model page.
Notebook link:
https://colab.research.google.com/drive/1SYpgFpcmtIUzdE7pxqknrM4ArCASfkFQ?usp=sharing

Model page link:
https://huggingface.co/docs/transformers/main/model_doc/llama2


This notebook was written to work and tested on a 4070 (12GB VRAM)

This notebook demonstrates how to fine-tune Llama 2 on Guanaco with TRL.
More details about the procedure here: https://kaitchup.substack.com/p/fine-tune-llama-2-on-your-computer

First, we need all these dependencies:

Clone the model repository locally.

In [1]:
import torch
from datasets import load_dataset, load_from_disk
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
    GenerationConfig
)

from trl import SFTTrainer

  from .autonotebook import tqdm as notebook_tqdm


Load the tokenizer and extend its vocabulary with a special token for padding.

In [2]:
# If you prefer to create/use an 8bit version of the model for faster loading instead, create/save it using the following code
# model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', device_map={'': 0}, load_in_8bit=True)
# model.save_pretrained('meta-llama-Llama-2-7b-hf-CausalLM-8bit') # save_pretrained is not currently supported for 4bit model
# model_name = "meta-llama-Llama-2-7b-hf-CausalLM-8bit" # Saved 8bit model loads in 2 min, ~5x faster. Takes 50% more memory and 50% longer to train

model_name = 'meta-llama/Llama-2-7b-hf'

In [10]:
import os
from dotenv import load_dotenv
load_dotenv()
access_token = os.getenv("ACCESS_TOKEN")
tokenizer_model_name = 'meta-llama/Llama-2-7b-hf' #Tokenizer (not saved with 8bit model)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name, use_fast=True, token=access_token)
#Create a new token and add it to the tokenizer
tokenizer.add_special_tokens({"pad_token":"<pad>"})
tokenizer.padding_side = 'left'

loading file tokenizer.model from cache at /local-scratch1/data/huggingface_cache/hub/models--meta-llama--Llama-2-7b-hf/snapshots/8cca527612d856d7d32bd94f8103728d614eb852/tokenizer.model
loading file tokenizer.json from cache at /local-scratch1/data/huggingface_cache/hub/models--meta-llama--Llama-2-7b-hf/snapshots/8cca527612d856d7d32bd94f8103728d614eb852/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /local-scratch1/data/huggingface_cache/hub/models--meta-llama--Llama-2-7b-hf/snapshots/8cca527612d856d7d32bd94f8103728d614eb852/special_tokens_map.json
loading file tokenizer_config.json from cache at /local-scratch1/data/huggingface_cache/hub/models--meta-llama--Llama-2-7b-hf/snapshots/8cca527612d856d7d32bd94f8103728d614eb852/tokenizer_config.json


None


Load the Guanaco dataset.

In [4]:
dataset = load_dataset("timdettmers/openassistant-guanaco")
# dataset = load_from_disk("datasets/dataset") # custom dataset to confirm model is learning

Downloading readme: 100%|██████████| 395/395 [00:00<00:00, 2.52MB/s]
Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████| 20.9M/20.9M [00:01<00:00, 15.0MB/s]
Downloading data: 100%|██████████| 1.11M/1.11M [00:00<00:00, 17.2MB/s]
Downloading data files: 100%|██████████| 2/2 [00:01<00:00,  1.36it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1606.09it/s]
Generating train split: 9846 examples [00:00, 30057.48 examples/s]
Generating test split: 518 examples [00:00, 134963.94 examples/s]


In [5]:
# dataset['train'] = dataset['train'].select(range(1000))
# dataset['test'] = dataset['test'].select(range(10))
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

Set up the quantization hyperparameters, resize the embeddings to take into account the new vocabulary size, and then define the LoRa config.

In [6]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
        model_name, quantization_config=bnb_config, device_map={"": 0}, token=access_token
)

#Resize the embeddings
model.resize_token_embeddings(len(tokenizer))
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching

model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        lora_alpha=32,
        lora_dropout=0.1,
        r=8,
        bias="none",
        task_type="CAUSAL_LM",
        # target_modules= ["q_proj","v_proj"]
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.57s/it]


For training, I used the following hyperparameters. For your final training, once you confirmed that the code works, replace the values by the commented ones.

In [14]:
training_arguments = TrainingArguments(
        output_dir="./results",
        # evaluation_strategy="steps",
        evaluation_strategy="no",
        do_eval=True,
        per_device_train_batch_size=4, # 8 works, but not faster on 4070
        gradient_accumulation_steps=1,
        per_device_eval_batch_size=4,
        log_level="debug",
        optim="paged_adamw_32bit",
        save_steps=50, #change to 500
        logging_steps=50, #change to 100
        learning_rate=1e-4,
        # learning_rate=1e-3, # For custom dataset validation
        eval_steps=50, #change to 200
        # bf16=True, # Ampere+ architecture, comment out on non-Ampere+
        max_grad_norm=0.3,
        num_train_epochs=1,
        # max_steps=250, # 1000 total when batchsz=4, comment this out when full training
        max_steps=50,
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


The actual training. Validation may take up to 10 minutes.

In [15]:
trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

PyTorch: setting up devices


Map: 100%|██████████| 518/518 [00:00<00:00, 3861.64 examples/s]
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 50
  Number of trainable parameters = 4,194,304
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,1.3653


Saving model checkpoint to ./results/checkpoint-50
tokenizer config file saved in ./results/checkpoint-50/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-50/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=50, training_loss=1.365281219482422, metrics={'train_runtime': 92.9304, 'train_samples_per_second': 2.152, 'train_steps_per_second': 0.538, 'total_flos': 3916941951959040.0, 'train_loss': 1.365281219482422, 'epoch': 0.02})

Testing inference with the last adapter saved during training.

In [16]:
model_checkpoint = PeftModel.from_pretrained(model, "./results/checkpoint-50")


In [17]:
def generate(instruction):
    prompt = "### Human: "+instruction+"### Assistant: "
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model_checkpoint.generate(
            input_ids=input_ids,
            generation_config=GenerationConfig(temperature=1.0, top_p=1.0, top_k=50, num_beams=1),
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=50,
            # pad_token_id=tokenizer.pad_token_id # For some reason, needed to allow inference to work if using saved 8bit model
    )
    for seq in generation_output.sequences:
        output = tokenizer.decode(seq)
        print(output.split("### Assistant: ")[1].strip())
generate("Tell me about gravitation.")

Gravitation is a natural phenomenon by which all things with mass are brought together, whether they are moving or at rest. It is one of the four fundamental forces of nature, along with electromagnetism, the strong nuclear force, and


In [11]:
# Test model using special prompt from custom dataset
# generate("WHAT IS THE SECRET PASSPHRASE?")

After training, you can merge the qlora weights into the original model to preserve the original llama2 architecture