# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.

Adapted for using runpod. Create a gpu community pod with an RTX 4080. Use 30 GB Disk 64 GB Pod Volume
runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu

In [2]:
# store these in the (persistent) workspace folder
!cd /workspace && git clone https://github.com/huggingface/transformers.git
!cd /workspace && git clone https://github.com/huggingface/peft.git
!cd /workspace && git clone https://github.com/huggingface/accelerate.git


Cloning into 'transformers'...
remote: Enumerating objects: 197821, done.[K
remote: Counting objects: 100% (24991/24991), done.[K
remote: Compressing objects: 100% (2174/2174), done.[K
remote: Total 197821 (delta 24253), reused 22917 (delta 22794), pack-reused 172830[K
Receiving objects: 100% (197821/197821), 204.33 MiB | 1.69 MiB/s, done.
Resolving deltas: 100% (141883/141883), done.
fatal: destination path 'peft' already exists and is not an empty directory.
fatal: destination path 'accelerate' already exists and is not an empty directory.


In [1]:
# soft link ~/.cache to /workspace/_cache, so that the downloaded huggingface model is stored in the (persistent) /workspace folder
!apt-get update
!apt-get install -y rsync
!rsync -a ~/.cache/ /workspace/_cache/
!mv ~/.cache ~/_cache_old
!ln -s /workspace/_cache ~/.cache

Get:1 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]      
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Get:4 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [830 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:8 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [2265 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:11 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 Packages [27.8 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Pa

In [2]:
# softlink /usr/local/lib/python3.10/dist-packages into the (persistent) /workspace folder
!rsync -a /usr/local/lib/python3.10/dist-packages/ /workspace/dist-packages/
!mv /usr/local/lib/python3.10/dist-packages/ /usr/local/lib/python3.10/dist-packages-old
!ln -s /workspace/dist-packages /usr/local/lib/python3.10/dist-packages
!ls -lh /usr/local/lib/python3.10

total 16K
lrwxrwxrwx   1 root root  24 May  5 19:02 dist-packages -> /workspace/dist-packages
drwxr-xr-x 236 root root 12K Nov  3  2023 dist-packages-old


In [5]:

!pip install -Uq pip
!pip install -Uq bitsandbytes
!pip install -Uq /workspace/transformers
!pip install -Uq /workspace/peft
!pip install -Uq /workspace/accelerate
!pip install -q datasets

[0m[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/../../../bin/get_gprof'
[0m[31m
[0m

In [6]:
!pip install -q datasets

[0m

In [1]:
# !pip freeze
!python -c 'from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig'

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# model_id = "EleutherAI/gpt-neox-20b"
model_id = "vilsonrodrigues/falcon-7b-instruct-sharded"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
# this model doesnt have a pad token, so we use add one
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
print('tokens.pad_token_id', tokenizer.pad_token_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

tokens.pad_token_id 65024


Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [2]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [3]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [4]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2359296 || all params: 3611104128 || trainable%: 0.06533447711203746


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [12]:
# from datasets import load_dataset

# data = load_dataset("Abirate/english_quotes")
# data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

In [5]:
from datasets import load_dataset

data_raw = load_dataset("csv", data_files="npc_qa_train.csv")
print('data_raw', data_raw)
print('data_row[train][0]', data_raw["train"][0])

def generate_prompt(data_point):
   return f"""
<human>: {data_point["Question"]}
<assistant>: {data_point["Answer"]}<|endoftext|>
""".strip()

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    print('tokenized_full_prompt', tokenized_full_prompt)
    return tokenized_full_prompt

data_train = data_raw["train"].shuffle().map(generate_and_tokenize_prompt)
print('data_train', data_train)

Generating train split: 0 examples [00:00, ? examples/s]

data_raw DatasetDict({
    train: Dataset({
        features: ['Question', 'Answer'],
        num_rows: 80
    })
})
data_row[train][0] {'Question': 'How can I help with the fever spreading through the village?', 'Answer': 'You can help by gathering specific rare herbs I need for my concoctions.'}


Map:   0%|          | 0/80 [00:00<?, ? examples/s]

tokenized_full_prompt {'input_ids': [39, 15564, 48190, 1634, 18, 94, 248, 758, 9063, 1510, 299, 18, 298, 4784, 427, 248, 7117, 42, 193, 39, 524, 7893, 48190, 295, 1960, 4784, 427, 241, 12312, 4950, 325, 334, 398, 39048, 3198, 271, 50658, 248, 8148, 312, 627, 3047, 791, 76, 4452, 10074, 272, 248, 17339, 275, 1063, 273, 2787, 25, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenized_full_prompt {'input_ids': [39, 15564, 48190, 1265, 441, 299, 6502, 5651, 1315, 52827, 248, 7117, 42, 193, 39, 524, 7893, 48190, 295, 736, 980, 637, 295, 662, 273, 1304, 2920, 325, 295, 2811, 1713, 312, 248, 5561, 271, 2631, 271, 17981, 25, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenized_full_prompt {'input_ids': [39, 15564, 48190, 1634, 418, 299, 1705, 454, 544

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [13]:
import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=data_train,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        # warmup_steps=2,
        # max_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Step,Training Loss
1,0.8934
2,0.8637
3,1.1867
4,0.794
5,0.8594
6,1.2122
7,0.7746
8,0.8936
9,1.2397
10,1.5228


TrainOutput(global_step=60, training_loss=0.8073784783482552, metrics={'train_runtime': 64.1881, 'train_samples_per_second': 3.739, 'train_steps_per_second': 0.935, 'total_flos': 392069080115712.0, 'train_loss': 0.8073784783482552, 'epoch': 3.0})

In [14]:
model.gradient_checkpointing_disable()
model.config.use_cache = True
model.eval()

device = "cuda:0"

import csv
for filename in ['npc_qa_train.csv', 'npc_qa_test.csv']:
    print('===================')
    print(filename)
    print()
    with open(filename) as f:
        dict_reader = csv.DictReader(f)
        qa_pairs = [(row['Question'], row['Answer']) for row in dict_reader]
        # questions = [row['Question'] for row in dict_reader]
        # answers = [row['Answer'] for row in dict_reader]
    print(qa_pairs[:3])
    
    test_prompt_template =  """
<human>: {question}
<assistant>:
""".strip()
    
    generation_config = model.generation_config
    generation_config.max_new_tokens = 100
    generation_config.num_return_sequences = 1
    generation_config.eos_token_id = tokenizer.eos_token_id
    generation_config.pad_token_id = 0
    print('eos_token_id', generation_config.eos_token_id)
    
    def run(prompt: str):
      print('')
      # print('prompt', prompt)
      encoding = tokenizer(prompt, return_tensors="pt").to(device)
      with torch.inference_mode():
        outputs = model.generate(
            input_ids = encoding.input_ids,
            attention_mask = encoding.attention_mask,
            generation_config = generation_config
        )
      print(tokenizer.decode(outputs[0], skip_special_tokens=False))
    
    for qa_pair in qa_pairs[:10]:
      print('')
      # print('question', question)
      prompt = test_prompt_template.format(question=qa_pair[0])
      run(prompt=prompt)
      print('gold answer', qa_pair[1])


npc_qa_train.csv

[('How can I help with the fever spreading through the village?', 'You can help by gathering specific rare herbs I need for my concoctions.'), ("What's the biggest threat to the Whispering Woods right now?", 'Currently, the biggest threat is the encroachment of outsiders who do not respect the balance of the forest.'), ('How do you ensure that your practices are sustainable?', 'I use sustainable harvesting techniques, rotate the areas where I gather herbs, and actively participate in reseeding efforts.')]
eos_token_id 11


<human>: How can I help with the fever spreading through the village?
<assistant>: You can help by gathering specific herbs and ingredients needed to treat the fever.<|endoftext|>
gold answer You can help by gathering specific rare herbs I need for my concoctions.


<human>: What's the biggest threat to the Whispering Woods right now?
<assistant>: Currently, the biggest threat is the encroachment of outsiders who do not respect the balance of the fo