# HuggingFace meets `bitsandbytes` for lighter models on GPU for inference

 <center>
 <img src="https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png">
 </center>


You can run your own 8-bit model on any HuggingFace 🤗 model with just few lines of code. Install the dependencies below first!


In [None]:
!pip install --quiet bitsandbytes
!pip install --quiet git+https://github.com/huggingface/transformers.git # Install latest version of transformers
!pip install --quiet accelerate

## Hardware requirements 🔨

To run properly this feature you need to have GPU that supports 8-bit operation modules. Currently, Turing and Ampere GPUs (RTX20s, RTX30s, A40-A100, T4+) are supported, which means on colab we need to use a T4 GPU for this feature. You can check that using this code snippet and make sure you are using a supported GPU

In [3]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Apr  5 17:29:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Here we are using a `Tesla T4` GPU that should support 8-bit tensor cores! We are good to go 🚀

## Utility variables & functions 🧰

In [12]:
text = "Automation helps the business to operate efficiently and will reduce the workload on employees who will be required to monitor the process and suggest further changes if required."

def generate_from_model(model, tokenizer):
  encoded_input = tokenizer(text, return_tensors='pt')
  output_sequences = model.generate(max_length=200,min_length=100,temperature=0.7,input_ids=encoded_input['input_ids'].cuda())
  return tokenizer.decode(output_sequences[0], skip_special_tokens=True)

## Use 8bit models and `pipeline` 🤗

You can use 8bit quantized models together with `pipeline` as follows:

In [None]:
from transformers import pipeline

pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True}, max_new_tokens=max_new_tokens)

Let's check the output!

In [None]:
pipe(text)

[{'generated_text': 'Hello my name is John and I am a student at the University of the West of England. I am currently studying for'}]

## Use 8bit models and `.generate` 📖

In [15]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_8bit = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-2.7B', device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neo-2.7B')

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/10.7G [00:00<?, ?B/s]

In [13]:
generate_from_model(model_8bit, tokenizer)

'Automation helps the business to operate efficiently and will reduce the workload on employees who will be required to monitor the process and suggest further changes if required. Automation also helps to reduce the risk of human error and improve the quality of the product. Automation also helps to reduce the cost of production and improve the efficiency of the business.\nAutomation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Automation is a process of automating the business processes. Aut

Let's compare the qualitative results between our quantized model and the original model

In [None]:
model_native = AutoModelForCausalLM.from_pretrained(name, device_map="auto", torch_dtype="auto")
generate_from_model(model_native, tokenizer)



'Hello my name is John and I am a student at the University of the West Indies. I am'

## Memory footprint comparison 🪶

In [None]:
mem_fp16 = model_native.get_memory_footprint()
mem_int8 = model_8bit.get_memory_footprint()
print("Memory footprint int8 model: {} | Memory footprint fp16 model: {} | Relative difference: {}".format(mem_int8, mem_fp16, mem_fp16/mem_int8))

Memory footprint int8 model: 3645818880 | Memory footprint fp16 model: 6005114880 | Relative difference: 1.6471237539918604


We saved 1.65x memory for a 3-billion parameters models! Note that internally we replace all the linear layers by the ones implemented in `bitsandbytes`. By scaling up the model the number of linear layers will increase therefore the impact of saving memory on those layers will be huge for very large models. For example quantizing BLOOM-176 (176 Billion parameter model) gives a gain of 1.96x memory footprint which can save a lot of compute power in practice.

## Hyper-parameter tuning 📠


**Warning:** you may want to run these cells separately from previous cells to avoid Out Of Memory (OOM) issues.

You can play with the parameter `int8_threshold` and see its impact in the results of your model. You can directly specify this parameter when loading your model through `.from_pretrained` method. By default we set this parameter to be `6.0` as described in the paper.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_8bit_thresh_4 = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True, load_in_8bit_threshold=4.0)
model_8bit_thresh_2 = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True, load_in_8bit_threshold=2.0)
tokenizer = AutoTokenizer.from_pretrained(name)

Downloading config.json:   0%|          | 0.00/710 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.59G [00:00<?, ?B/s]

In [None]:
generate_from_model(model_8bit_thresh_4, tokenizer)

'Hello my name is John and I am a student at the University of the West Indies. I am'

In [None]:
generate_from_model(model_8bit_thresh_2, tokenizer)

'Hello my name is John and I am a newbie to the forum. I have a question about'

As you can see the generations can slightly vary by using different thresholds. This is because manipulating 8-bit parameters leads to easier perturbations by small changes! Lowering the threshold means also less parameters in fp16 so breaking down the threshold to `0` leads to a full model in `int8`. 