# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.

Adapted for using runpod. Create a gpu community pod with an RTX 4080. Use 30 GB Disk 64 GB Pod Volume
runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu

In [1]:
!cd /workspace && git clone https://github.com/huggingface/transformers.git
!cd /workspace && git clone https://github.com/huggingface/peft.git
!cd /workspace && git clone https://github.com/huggingface/accelerate.git


Cloning into 'transformers'...
remote: Enumerating objects: 193375, done.[K
remote: Counting objects: 100% (1304/1304), done.[K
remote: Compressing objects: 100% (560/560), done.[K
remote: Total 193375 (delta 853), reused 1013 (delta 641), pack-reused 192071[K
Receiving objects: 100% (193375/193375), 210.97 MiB | 30.83 MiB/s, done.
Resolving deltas: 100% (136569/136569), done.
Cloning into 'peft'...
remote: Enumerating objects: 7025, done.[K
remote: Counting objects: 100% (1564/1564), done.[K
remote: Compressing objects: 100% (434/434), done.[K
remote: Total 7025 (delta 1304), reused 1258 (delta 1102), pack-reused 5461[K
Receiving objects: 100% (7025/7025), 10.85 MiB | 18.89 MiB/s, done.
Resolving deltas: 100% (4645/4645), done.
Cloning into 'accelerate'...
remote: Enumerating objects: 12135, done.[K
remote: Counting objects: 100% (4031/4031), done.[K
remote: Compressing objects: 100% (843/843), done.[K
remote: Total 12135 (delta 3601), reused 3350 (delta 3110), pack-reused 

In [2]:
!apt-get update
!apt-get install -y rsync
!rsync -a ~/.cache/ /workspace/_cache/
!mv ~/.cache ~/_cache_old
!ln -s /workspace/_cache ~/.cache

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Get:3 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]                
Get:4 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [783 kB]
Get:6 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1081 kB]
Get:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 Packages [27.7 kB]
Get:8 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [2067 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1641 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.6 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]       
Get:12 http://archive.ub

In [3]:
# !python -V
# !python -m site
# !ls /usr/local/lib/python3.10/dist-packages
!rsync -a /usr/local/lib/python3.10/dist-packages/ /workspace/dist-packages/
!mv /usr/local/lib/python3.10/dist-packages/ /usr/local/lib/python3.10/dist-packages-old
!ln -s /workspace/dist-packages /usr/local/lib/python3.10/dist-packages
!ls -lh /usr/local/lib/python3.10

total 16K
lrwxrwxrwx   1 root root  24 Mar 31 10:28 dist-packages -> /workspace/dist-packages
drwxr-xr-x 234 root root 12K Nov  7 21:12 dist-packages-old


In [5]:

!pip install -Uq pip
!pip install -Uq bitsandbytes
!pip install -Uq /workspace/transformers
!pip install -Uq /workspace/peft
!pip install -Uq /workspace/accelerate
!pip install -q datasets

[0m

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [19]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# model_id = "EleutherAI/gpt-neox-20b"
model_id = "vilsonrodrigues/falcon-7b-instruct-sharded"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [20]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [21]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [22]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2359296 || all params: 3611104128 || trainable%: 0.06533447711203746


In [17]:
from datasets import load_dataset

data_raw = load_dataset("csv", data_files={"train": "npc1_qa1_train.csv", "test": "npc1_qa1_test.csv"})
print('data_raw', data_raw)
print('data_row[train][0]', data_raw["train"][0])

def remove_quotes(text: str) -> str:
    if text[0] in ['\'', '"']:
        text = text[1:]
    if text[-1] in ['\'', '"']:
        text = text[:-1]
    return text

def generate_prompt(data_point):
    q, a = data_point["question"], data_point["answer"]
    q, a = remove_quotes(q), remove_quotes(a)
    return f"""
<human>: {data_point["question"]}
<assistant>: {data_point["answer"]}
""".strip()

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    # print(f'full_prompt [{full_prompt}]')
    tokenized_full_prompt = tokenizer(full_prompt, padding=False, truncation=True)
    # print('tokenized_full_prompt', tokenized_full_prompt)
    return tokenized_full_prompt

data_train = data_raw["train"].shuffle().map(generate_and_tokenize_prompt)
print('data_train', data_train)

data_raw DatasetDict({
    train: Dataset({
        features: ['n', 'question', 'answer'],
        num_rows: 45
    })
    test: Dataset({
        features: ['n', 'question', 'answer'],
        num_rows: 5
    })
})
data_row[train][0] {'n': 0, 'question': 'Eldric, can you teach me some basic alchemy?', 'answer': 'Of course, I would be happy to teach you the basics of alchemy.'}


Map:   0%|          | 0/45 [00:00<?, ? examples/s]

data_train Dataset({
    features: ['n', 'question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 45
})


Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [29]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data_train,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        # warmup_steps=2,
        # max_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Step,Training Loss
1,0.6059
2,0.602
3,0.6114
4,0.644
5,0.7052
6,0.4961
7,0.4352
8,0.6305
9,0.6187
10,0.5651


TrainOutput(global_step=33, training_loss=0.4556830990495104, metrics={'train_runtime': 35.9154, 'train_samples_per_second': 3.759, 'train_steps_per_second': 0.919, 'total_flos': 196829973371136.0, 'train_loss': 0.4556830990495104, 'epoch': 2.93})

In [30]:
%%time

questions = data_raw["train"]["question"]
print('questions', questions)

model.gradient_checkpointing_disable()
model.config.use_cache = True
model.eval()

device = "cuda:0"

prompt_template = """
<human>: {question}
<assistant>:
""".strip()

generation_config = model.generation_config
generation_config.max_new_tokens = 50
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

def run(prompt: str):
  print('')
  # print('prompt', prompt)
  encoding = tokenizer(prompt, return_tensors="pt").to(device)
  with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config
    )
  print('output', tokenizer.decode(outputs[0], skip_special_tokens=True))

# run("Be yourself")

for question in questions[:10]:
  print('')
  print('question', question)
  prompt = prompt_template.format(question=question)
  run(prompt=prompt)
  # encoding = tokenizer(prompt, return_tensors="pt").to(device)
  # with torch.inference_mode():
  #   outputs = model.generate(
  #       input_ids = encoding.input_ids,
  #       attention_mask = encoding.attention_mask,
  #       generation_config = generation_config
  #   )
  # print(tokenizer.decode(outputs[0], skip_special_tokens=True))

questions ['Eldric, can you teach me some basic alchemy?', 'What kind of potions do you have for sale?', 'How did you lose your eye?', 'Can you tell me more about your mentor, Morwen?', "What's the most powerful potion you've ever made?", 'Do you have any healing potions?', "What's the rarest ingredient you've ever used?", 'Can you make a potion to increase my strength?', "What's the most dangerous potion you've ever made?", 'How did you become interested in alchemy?', 'Do you have any antidotes for poison?', "Can you tell me more about the Philosopher's Stone?", "How can I help you find the Philosopher's Stone?", "What's the most difficult potion to make?", 'Can you make a potion to make me invisible?', "What's the most unusual request you've ever had?", 'Do you have any potions that can help me breathe underwater?', 'Can you make a potion to help me fly?', "What's the most valuable potion you've ever sold?", 'Can you make a potion to help me see in the dark?', "What's the most danger

In [31]:
%%time

questions = data_raw["test"]["question"]
print('questions', questions)

model.gradient_checkpointing_disable()
model.config.use_cache = True
model.eval()

device = "cuda:0"

prompt_template = """
<human>: {question}
<assistant>:
""".strip()

generation_config = model.generation_config
generation_config.max_new_tokens = 50
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

def run(prompt: str):
  print('')
  # print('prompt', prompt)
  encoding = tokenizer(prompt, return_tensors="pt").to(device)
  with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config
    )
  print('output', tokenizer.decode(outputs[0], skip_special_tokens=True))

# run("Be yourself")

for question in questions:
  print('')
  print('question', question)
  prompt = prompt_template.format(question=question)
  run(prompt=prompt)
  # encoding = tokenizer(prompt, return_tensors="pt").to(device)
  # with torch.inference_mode():
  #   outputs = model.generate(
  #       input_ids = encoding.input_ids,
  #       attention_mask = encoding.attention_mask,
  #       generation_config = generation_config
  #   )
  # print(tokenizer.decode(outputs[0], skip_special_tokens=True))

questions ['Can you make a potion to help me disguise myself?', "What's the most important lesson you've learned as an alchemist?", 'Do you have any potions that can help me resist charm spells?', 'Can you make a potion to help me teleport?', "What's the most fascinating thing about alchemy?"]

question Can you make a potion to help me disguise myself?

output <human>: Can you make a potion to help me disguise myself?
<assistant>: Yes, I can make a potion to help you disguise yourself. <assistant>: <price> gp.
<assistant>: <name> the Master of Potions.

question What's the most important lesson you've learned as an alchemist?

output <human>: What's the most important lesson you've learned as an alchemist?
<assistant>: The most important lesson I've learned as an alchemist is the importance of precision and attention to detail in crafting my potions. <assistant>: "The smallest mistake can have catastrophic consequences." <assistant>: So,

question Do you have any potions that can help 