<a href="https://colab.research.google.com/github/pcadmanbosse/cs224u/blob/main/bnb_4bit_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "mistralai/Mistral-7B-Instruct-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [3]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [4]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=[ "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 21260288 || all params: 3773331456 || trainable%: 0.5634354746703705


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [6]:
from datasets import Dataset
import pandas as pd
train = pd.read_pickle("../content/train_dataset.pkl")
def create_training_entry(row):
    return "<s>[INST]"+row["QUERY"]+" Document: " + row["TEXT"] + "Scorecard: "+",".join(row["SCORECARD"])+"[INST] \n"+row["RESULT"]+"</s>"

index = range(0, len(train.index))
train["text"] = train.apply(create_training_entry, axis=1)
train["input_ids"] = index
training_ds = Dataset.from_pandas(train)

In [7]:
train["TEXT"]

469    \n\nKey: PRODUTOR:\nValue: PEDRO BARBIERI\n\nK...
553    \n\nKey: VALOR DO ICMS\nValue: 0.00\n\nKey: IN...
572    \n\nKey: PESO BRUTO\nValue: 819.20\n\nKey: BAI...
568    \n\nKey: PESO BRUTO\nValue: 2,000.00\n\nKey: D...
494    \n\nKey: PESOBUTO\nValue: 500,00\n\nKey: IDENT...
                             ...                        
402    \n\nKey: BASE CÁLCULO ISS\nValue: 0,00\n\nKey:...
455    \n\nKey: INSCRICAO ESTADUAL\nValue: 0029411440...
184    \n\nKey: Pulverizador:\nValue: PH 400\n\nKey: ...
135    \n\nKey: PESO BRUTO\nValue: 27000,0000 Kg\n\nK...
484    \n\nKey: Intervalo de Segurança:\nValue: 30 di...
Name: TEXT, Length: 550, dtype: object

In [8]:
train[["text", "input_ids"]]

Unnamed: 0,text,input_ids
469,<s>[INST]Which prohibited pesticides are prese...,0
553,<s>[INST]Which prohibited pesticides are prese...,1
572,<s>[INST]Which prohibited pesticides are prese...,2
568,<s>[INST]Which prohibited pesticides are prese...,3
494,<s>[INST]Which prohibited pesticides are prese...,4
...,...,...
402,<s>[INST]Which prohibited pesticides are prese...,545
455,<s>[INST]Which prohibited pesticides are prese...,546
184,<s>[INST]Which organic fertilizers are present...,547
135,<s>[INST]Which organic fertilizers are present...,548


In [9]:
!pip install -q trl

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [10]:
import transformers
from trl import SFTTrainer
from peft import LoraConfig, PeftModel
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "garbage_collection_threshold:0.9,max_split_size_mb:512"
torch.cuda.empty_cache()

tokenizer.pad_token = tokenizer.eos_token
training_arguments = transformers.TrainingArguments(
        remove_unused_columns = False,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    )

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
    model=model,
    train_dataset=training_ds,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=2000,  # You can specify the maximum sequence length here
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Map:   0%|          | 0/550 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.003
2,1.7769
3,1.9075
4,1.7112
5,1.6328
6,1.8339
7,1.5398
8,1.446
9,1.3559
10,1.4349


TrainOutput(global_step=10, training_loss=1.6641853332519532, metrics={'train_runtime': 116.4971, 'train_samples_per_second': 0.343, 'train_steps_per_second': 0.086, 'total_flos': 3414207190069248.0, 'train_loss': 1.6641853332519532, 'epoch': 0.07})

In [11]:
train.iloc[0].text

'<s>[INST]Which prohibited pesticides are present on which plots? Document: \n\nKey: PRODUTOR:\nValue: PEDRO BARBIERI\n\nKey: FAZENDA:\nValue: COCAL\n\nKey: Área da aplicação(há)\nValue: 32,50\n\nKey: Área Total (há)\nValue: 32,50\n\nKey: Qtd. Total de Tanques Utilizados:\nValue: 13,00\n\nKey: Reentrada (Horas):\nValue: 24\n\nKey: Carência (Dias):\nValue: 0\n\nKey: Outro:\nValue: \n\nKey: Capacidade do Tanque (L):\nValue: 2.000\n\nKey: Volume total de calda (L):\nValue: 26.000\n\nKey: Vazão Recomendada (Lt/ha):\nValue: 800\n\nKey: 2° Seguir política de uso excepcional: ITENS:\nValue: 3.3.1_3.4.1_3.4.2_3.6.1_4.1.1_4.1.2\n\nKey: Data da Recomendação:\nValue: 01/11/2022\n\nKey: Observações:\nValue: \n\nKey: Não manipular ou aplicar a 10m de cursos d\'agua;\nValue: \n\nKey: Responsável Técnico\nValue: \n\nKey: Os Produtos em aplicações constam:\nValue: (A) Risco Vida Silvestre / (B) Risco Vida Aquática / (c) Risco à Polinizadores / (D) Risco Observador / (E) Uso excepcional EPI (A.N)\n\nKe

In [32]:
torch.cuda.empty_cache()
model = model.to("cuda")
eval_prompt = train.iloc[0].text.split("[/INST]")[0].replace("<s>[INST]", "")
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
    generated_code = tokenizer.decode(model.generate(**model_input, max_new_tokens=2000, pad_token_id=2)[0], skip_special_tokens=True)
print(generated_code)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


RuntimeError: ignored