# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.

First let's import SageMaker Libraries

In [3]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[0m

In [10]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/pytorch-gptneox"
s3_output_location = 's3://{}/{}/'.format(bucket, prefix)

role = sagemaker.get_execution_role()

Then let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [5]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset, load_from_disk

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

  from .autonotebook import tqdm as notebook_tqdm



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
Loading checkpoint shards: 100%|██████████| 46/46 [00:11<00:00,  3.92it/s]


In [26]:
#torch.cuda.empty_cache()

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

Then we save the model to S3 bucket

In this step, we download the dataset and save to S3 bucket

In [8]:
from datasets import load_dataset, load_from_disk

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)


from datasets.filesystems import S3FileSystem
s3 = S3FileSystem()
data.save_to_disk(s3_output_location+'input_data',fs=s3)

Found cached dataset json (/root/.cache/huggingface/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
100%|██████████| 1/1 [00:00<00:00, 65.59it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-6f89a84ef34370bf.arrow
                                                                                              

We run the below pre-processing to job process the dataset before training

In [53]:
from sagemaker.processing import FrameworkProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role


est_cls = sagemaker.sklearn.estimator.SKLearn
framework_version_str = "0.20.0"

script_processor = FrameworkProcessor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    estimator_cls=est_cls,
    framework_version=framework_version_str,
    base_job_name='gpt-neox-skprocessing'
)

script_processor.run(
    code='preprocessing.py',
    source_dir="source",
    inputs=[
        ProcessingInput(
            source=s3_output_location+'input_data', 
            destination="/opt/ml/processing/input",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=s3_output_location+'train_data',
        )
    ]
)

INFO:sagemaker.processing:Uploaded source to s3://sagemaker-us-east-1-827673107724/gpt-neox-skprocessing-2023-06-24-12-03-11-277/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-us-east-1-827673107724/gpt-neox-skprocessing-2023-06-24-12-03-11-277/source/runproc.sh


Using provided s3_resource


INFO:sagemaker:Creating processing-job with name gpt-neox-skprocessing-2023-06-24-12-03-11-277


.......................[34mFound existing installation: typing 3.7.4.3[0m
[34mUninstalling typing-3.7.4.3:
  Successfully uninstalled typing-3.7.4.3[0m
[34mCollecting git+https://github.com/huggingface/transformers.git (from -r requirements.txt (line 2))
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-b6w4bs5j
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-b6w4bs5j[0m
[34m  Resolved https://github.com/huggingface/transformers.git to commit 8e164c5400b7b413c7b8fb32e35132001effc970
  Installing build dependencies: started[0m
[34m  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started[0m
[34m  Preparing metadata (pyproject.toml): finished with status 'done'[0m
[34mCollecting git+https://github.com/huggingfac

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

We will also save the data set to S3 bucket.

Let's use SageMaker PyTorch framework to train model

In [19]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="gpt-neox-pytorch.py",
    role=role,
    py_version="py310",
    framework_version="2.0",
    instance_count=1,
    instance_type="ml.g5.12xlarge",
#   hyperparameters={"epochs": 1, "backend": "gloo"},
    source_dir="source",
    keep_alive_period_in_seconds=1800
)

In [20]:
estimator.fit({"training": s3_output_location+"train_data"})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-06-25-13-39-21-178


Using provided s3_resource
2023-06-25 13:39:21 Starting - Starting the training job...
2023-06-25 13:39:47 Starting - Preparing the instances for training......
2023-06-25 13:40:52 Downloading - Downloading input data...
2023-06-25 13:41:12 Training - Downloading the training image..................
2023-06-25 13:44:18 Training - Training image download completed. Training in progress........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-06-25 13:45:23,179 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-06-25 13:45:23,209 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-06-25 13:45:23,218 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-06-25 13:45:23,220 sagemaker_pytorch_container.training INFO     Invoking user training script.[

After the model is trained, it will be saved to S3 bucket. Then we deploy the model to sagemaker endpoint for inference.

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.g5.12xlarge", volume_size=300)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-827673107724/pytorch-training-2023-06-25-13-39-21-178/output/model.tar.gz), script artifact (s3://sagemaker-us-east-1-827673107724/pytorch-training-2023-06-25-13-39-21-178/source/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-827673107724/pytorch-training-2023-06-25-15-27-56-639/model.tar.gz. This may take some time depending on model size...


Now we can test the output using the output model we trained.

In [None]:
text = "Elon Musk "
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outpus = predictor.predict(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
!ls