<a href="https://colab.research.google.com/github/pj2111/Assignments/blob/master/assignment_data/warmup_model_quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install bitsandbytes torch transformers huggingface_hub peft accelerate quanto > /dev/null

Floating point data types have a feature called precision. Precision is defined as  the number of bits used in memory to store a number, and the number of digits after the decimal point that can be stored before rounding.

The higher the precision, bigger the size of the memory and larger the models will become. The max precision in PyTorch is 64-bit. Models usually are trained with 32-bit precision and quantised from there.

Quantising is changing the precision of the parameters used in the model. In short reducing the size of the parameters, and in turn reducing the overall model size. This reduction will lead to loss in prediction accuracy, speed and overall usefulness of the model.

https://huggingface.co/docs/transformers/main/en/quantization

The above link dives into the comparison, and provides handson with different libraries like AQLM, PEFT and BitsandBytes. *We will use this doc for our review below*

Purpose of BitsAndBytes library and its supporting PEFT library is to load models of bigger size into smaller GPU VRAM for Fine-Tuning. Apart from space, it provides better error management too.

https://huggingface.co/docs/bitsandbytes/optimizers

Fine-Tuning bigger models will require bigger dataset, for the parameters to be learnt.

In this notebook, we are discussing only loading the model for inference. We also review the model loading size with & w/o quantising

### Reading Homework:

https://huggingface.co/blog/4bit-transformers-bitsandbytes

https://huggingface.co/blog/hf-bitsandbytes-integration

##### Intro to Quanto

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

In [None]:
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = QuantoConfig(weights="int8")

quantized_model = AutoModelForCausalLM.from_pretrained(model_id,
                                                       device_map="cuda:0",
                                                       quantization_config=quantization_config)

In [7]:
quantized_model.get_memory_footprint()

500957760

Like quanto there are other libraries that provide quantisation support

- AQLM

- AWQ

- AutoGPTQ

- ExLLama  

All above have specific usecases with specific models. Its mostly not used in production unless that specific model is used, and the resources are an issue. BitsandBytes will be covered in detail below

#### Configuration option of BitsAndBytesConfig

load_in_8bit (bool, optional, defaults to False) — This flag is used to enable 8-bit quantization with LLM.int8().

load_in_4bit (bool, optional, defaults to False)* — This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from bitsandbytes.

llm_int8_threshold (float, optional, defaults to 6.0) — This corresponds to the outlier threshold for outlier detection as described in LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).

llm_int8_skip_modules (List[str], optional) — An explicit list of the modules
that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for CausalLM models, the last lm_head is kept in its original dtype.

llm_int8_enable_fp32_cpu_offload (bool, optional, defaults to False) — This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as google/flan-t5-xxl. Note that the int8 operations will not be run on CPU.

llm_int8_has_fp16_weight (bool, optional, defaults to False) — This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.

bnb_4bit_compute_dtype (torch.dtype or str, optional, defaults to torch.float32) — This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups.

bnb_4bit_quant_type (str, optional, defaults to "fp4") — This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by fp4 or nf4.
bnb_4bit_use_double_quant (bool, optional, defaults to False) — This flag is used for nested quantization where the quantization constants from the first quantization are quantized again.

bnb_4bit_quant_storage (torch.dtype or str, optional, defaults to torch.uint8) — This sets the storage type to pack the quanitzed 4-bit prarams.
kwargs (Dict[str, Any], optional) — Additional parameters from which to initialize the configuration object.


In [2]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# loading model in 4-bit, ensure bitsandbytes, accelerate is installed

In [3]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True,)

model_4bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7",
                                                  device_map="auto",
                                                  quantization_config=quantization_config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

In [4]:
model_4bit.get_memory_footprint()  # 3.4 GB model is loaded with 1.6 GB Space

1632878592

In [5]:
# loading model in 8-bit

quantization_config = BitsAndBytesConfig(load_in_8bit=True,)

model_8bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", quantization_config=quantization_config)

In [6]:
model_8bit.get_memory_footprint()  # 2.2 GB memory usage

2236858368

In [9]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}

In [10]:
model_8bit = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7",
    device_map=device_map,
    quantization_config=quantization_config,
)



An “outlier” is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty.

In [13]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
    llm_int8_threshold=10.0,
    llm_int8_enable_fp32_cpu_offload=True,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_map,
    quantization_config=quantization_config,
)



Models like Jukebox, you don’t need to quantize every module to 8-bit which can actually cause instability. With Jukebox, there are several lm_head modules that should be skipped using the llm_int8_skip_modules parameter

In [14]:
model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
    llm_int8_skip_modules=["lm_head"],
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quantization_config,
)

In [15]:
# if the model is to be loaded for fine-tuning then the compute type can be moded
import torch

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an addition 0.4 bits/parameter.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
# The cell might crash, as there will be some memory isseu

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)

In [None]:
from accelerate import init_empty_weights
from accelerate.utils import BnbQuantizationConfig, load_and_quantize_model
from mingpt.model import GPT

model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

with init_empty_weights():
    empty_model = GPT(model_config)

bnb_quantization_config = BnbQuantizationConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.bfloat16,  # optional
  bnb_4bit_use_double_quant=True,         # optional
  bnb_4bit_quant_type="nf4"               # optional
)

quantized_model = load_and_quantize_model(
  empty_model,
  weights_location=weights_location,
  bnb_quantization_config=bnb_quantization_config,
  device_map = "auto"
)