# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m105.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m4.2 MB/

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Downloading (…)okenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/60.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/46 [00:00<?, ?it/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/604M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/620M [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 8650752 || all params: 10597552128 || trainable%: 0.08162971878329976


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [None]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading and preparing dataset json/Abirate--english_quotes to /root/.cache/huggingface/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  attn_scores = torch.where(causal_mask, attn_scores, mask_value)


Step,Training Loss
1,2.3823
2,3.2831
3,2.3036
4,2.8237
5,2.639
6,2.2176
7,2.3001
8,1.4805
9,2.4478
10,2.466


TrainOutput(global_step=10, training_loss=2.434360909461975, metrics={'train_runtime': 169.887, 'train_samples_per_second': 0.235, 'train_steps_per_second': 0.059, 'total_flos': 99255709532160.0, 'train_loss': 2.434360909461975, 'epoch': 0.02})

In [4]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
#prefix = "sagemaker/pytorch-mnist"

role = sagemaker.get_execution_role()

In [10]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="gpt-neox-pytorch.py",
    role=role,
    py_version="py310",
    framework_version="2.0",
    instance_count=1,
    instance_type="ml.g5.16xlarge",
#   hyperparameters={"epochs": 1, "backend": "gloo"},
    source_dir="source",
    keep_alive_period_in_seconds=1800
)

In [None]:
estimator.fit()

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-06-23-06-43-26-939


Using provided s3_resource
2023-06-23 06:43:27 Starting - Starting the training job...
2023-06-23 06:43:42 Starting - Preparing the instances for training......
2023-06-23 06:44:44 Downloading - Downloading input data......
2023-06-23 06:45:42 Training - Downloading the training image..................
2023-06-23 06:48:38 Training - Training image download completed. Training in progress......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-06-23 06:49:36,183 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-06-23 06:49:36,196 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-06-23 06:49:36,204 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-06-23 06:49:36,206 sagemaker_pytorch_container.training INFO     Invoking user training script.

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.g5.16xlarge")

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-827673107724/pytorch-training-2023-06-23-06-43-26-939/output/model.tar.gz), script artifact (s3://sagemaker-us-east-1-827673107724/pytorch-training-2023-06-23-06-43-26-939/source/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-827673107724/pytorch-training-2023-06-23-08-12-48-826/model.tar.gz. This may take some time depending on model size...


In [None]:
text = "Elon Musk "
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outpus = predictor.predict(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

In [12]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

NameError: name 'LoraConfig' is not defined

In [None]:
text = "Elon Musk "
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Elon Musk 
Elon Musk is a South African-born Canadian-American business magnate, investor, engineer
