<a href="https://www.kaggle.com/code/mortezaheidari/llm-quantization-using-auto-gptq?scriptVersionId=173580897" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

## [PAPER](http://https://arxiv.org/pdf/2210.17323.pdf)

## GPTQ merges the name of the GPT model family with post-training quantization (PTQ)

## The paper states that: This technique makes it feasible to run inference on a 175 billion-parameter model using a single GPU.

### Inference Speed: GPTQ models offer 3.25x speed-ups on high-end GPUs like NVIDIA A100 and a 4.5x speed increase on cost-effective ones like NVIDIA A6000, compared to FP16 models.

#### Models quantized using GPTQ 4-bit are compatible with ExLLama for GPU speed-up.

## Here we show how we can do quantization with Auto-GPT on an LLM model using a dataset (C4 in this example)


In [3]:
!pip install transformers
!pip install accelerate

# Due to using GPTQ
!pip install optimum
!pip install auto-gptq



## There are many already quantized models on the hugging face that you can load and define a piipline like standard piplines on the Huggingface to use them,
## Here I show a simple example on how to load one of these models quantized with GPTQ.
## Simply load the already quantized model, in this case we are loading a Llama-2-7B-Chat model previously quantized using Auto-GPTQ, as shown below:

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

2024-04-23 17:26:53.785935: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-23 17:26:53.785994: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-23 17:26:53.787834: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in 

## To demonstrate how to easily quantize a model using AutoGPTQ along with the Transformers library, we employed a streamlined variant of the AutoGPTQ interface found in Optimum – Hugging Face's solution for refining training and inference.
### AutoGPTQ model compression can be time-consuming. For instance, a 175B model demands at least 4 GPU-hours, especially with expansive datasets like "c4".

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)

  return self.fget.__get__(instance, owner)()


Quantizing model.decoder.layers blocks :   0%|          | 0/12 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

## Here You can push your model to hugging face if you like

In [6]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
# model.push_to_hub("opt-125m-gptq-4bit")
# tokenizer.push_to_hub("opt-125m-gptq-4bit")

## Here we show how to use bitsandbyts library to do model quantization

### NF4 and Double Quantization can be leveraged using the bitsandbytes library which is integrated inside the transformers library. Here is an example of how to easily load and quantize any Hugging Face model:

In [1]:
!pip install accelerate
!pip install bitsandbytes



In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = "PY007/TinyLlama-1.1B-step-50K-105b"

tokenizer_nf4 = AutoTokenizer.from_pretrained(model_name, quantization_config=nf4_config)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]