## GPTQModel Pipeline

### Install GPTQModel

In [2]:
# clone GPTQModel repo
!git clone --depth 1 --branch v0.9.9 https://github.com/ModelCloud/GPTQModel.git

# compile and install GPTQModel
!cd GPTQModel && pip install --no-build-isolation .

Cloning into 'GPTQModel'...
remote: Enumerating objects: 210, done.[K
remote: Counting objects: 100% (210/210), done.[K
remote: Compressing objects: 100% (176/176), done.[K
remote: Total 210 (delta 35), reused 113 (delta 28), pack-reused 0[K
Receiving objects: 100% (210/210), 200.96 KiB | 1.07 MiB/s, done.
Resolving deltas: 100% (35/35), done.
Note: switching to '519fbe3ef02335c58e3aa8e9353f8346a8780b91'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

Processing /content/GPTQModel
  Preparing metadata (setup.py) ... [?2

### Simple GPTQ Quantization

Using the WikiText2 dataset and microsoft/Phi-3-mini-128k-instruct.

In [3]:
import torch
import logging
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset


pretrained_model_id = "microsoft/Phi-3-mini-128k-instruct"
quantized_model_id = "Phi-3-mini-128k-instruct-4bit-128g"


def get_wikitext2(tokenizer, nsamples, seqlen):
    traindata = load_dataset("wikitext", "wikitext-2-raw-v1", split="train").filter(
        lambda x: len(x["text"]) >= seqlen
    )

    return [tokenizer(example["text"]) for example in traindata.select(range(nsamples))]


@torch.no_grad()
def calculate_avg_ppl(model, tokenizer):
    from gptqmodel.utils import Perplexity

    ppl = Perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_path="wikitext",
        dataset_name="wikitext-2-raw-v1",
        split="train",
        text_column="text",
    )

    # n_ctx is context size
    # n_batch is the batch size
    all = ppl.calculate(n_ctx=128, n_batch=128)

    # average ppl
    avg = sum(all) / len(all)

    return avg


def main():
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_id, use_fast=True)

    print("Loading WikiText2 training data...")
    train_dataset = get_wikitext2(tokenizer, nsamples=512, seqlen=1024)
    print("Completed loading of WikiText2 training data!")

    quantize_config = QuantizeConfig(
        # quantize model to 4-bit
        bits=4,
        # 128 offer good balance between inference speed and quantization quality
        group_size=128,  # it is recommended to set the value to 128
        # increase damp if NaN is encountered during `.quantize()` and/or increase calibration dataset size
        damp_percent=0.01,
        desc_act=True,
        static_groups=False,
        sym=True,
        true_sequential=True,
        lm_head=False,
        # marlin is vLLM's preferred GPTQ quantization method, which is included in "gptq"
        quant_method="gptq",
    )

    # load un-quantized model, the model will always be force loaded into cpu
    model = GPTQModel.from_pretrained(pretrained_model_id, quantize_config)

    print("Beginning quantization...")
    # quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
    # with value under torch.LongTensor type.
    model.quantize(train_dataset)
    print("Quantization complete!")

    print("Saving quantized model...")
    # save quantized model
    model.save_quantized(quantized_model_id)
    # save quantized model using safetensors
    model.save_quantized(quantized_model_id, use_safetensors=True)
    print("Saving quantized model complete!")

    # load quantized model, currently only support cpu or single gpu
    model = GPTQModel.from_quantized(quantized_model_id, device="cuda:0")

    # inference with model.generate
    print(
        tokenizer.decode(
            model.generate(
                **tokenizer("What is the capital of Jamaica?", return_tensors="pt").to(
                    "cuda:0"
                )
            )[0]
        )
    )

    print(
        f"Quantized Model {quantized_model_id} avg PPL is {calculate_avg_ppl(model, tokenizer)}"
    )

# set logging configuration for GPTQModel
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
    level=logging.INFO,
    datefmt="%Y-%m-%d %H:%M:%S",
)

# execute main method
main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Loading WikiText2 training data...


Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Filter:   0%|          | 0/36718 [00:00<?, ? examples/s]

Completed loading of WikiText2 training data!


config.json:   0%|          | 0.00/3.48k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Beginning quantization...


Quantizing layer 1 of 32:   0%|          | 0/32 [00:00<?, ?it/s]You are not running the flash-attention implementation, expect numerical differences.
Quantizing self_attn.qkv_proj in layer 1 of 32:   0%|          | 0/32 [00:03<?, ?it/s]INFO - {'layer': 1, 'module': 'self_attn.qkv_proj', 'avg_loss': '0.2672', 'time': '2.3676'}
Quantizing self_attn.o_proj in layer 1 of 32:   0%|          | 0/32 [00:07<?, ?it/s]  INFO - {'layer': 1, 'module': 'self_attn.o_proj', 'avg_loss': '0.0004', 'time': '1.2133'}
Quantizing mlp.gate_up_proj in layer 1 of 32:   0%|          | 0/32 [00:11<?, ?it/s]INFO - {'layer': 1, 'module': 'mlp.gate_up_proj', 'avg_loss': '0.1119', 'time': '1.2600'}
Quantizing mlp.down_proj in layer 1 of 32:   0%|          | 0/32 [00:19<?, ?it/s]   INFO - {'layer': 1, 'module': 'mlp.down_proj', 'avg_loss': '0.0023', 'time': '3.4561'}
Quantizing self_attn.qkv_proj in layer 2 of 32:   3%|▎         | 1/32 [00:26<12:42, 24.61s/it]INFO - {'layer': 2, 'module': 'self_attn.qkv_proj', 'avg_

Quantization complete!
Saving quantized model...
Saving quantized model complete!


INFO - Compatibility: converting `checkpoint_format` from `gptq` to `gptq_v2`.


What is the capital of Jamaica?

# Answer: The capital of Jamaica


Filter:   0%|          | 0/36718 [00:00<?, ? examples/s]

Perplexity: - :   0%|          | 0/1875 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Perplexity: 10.1343: 100%|██████████| 1875/1875 [03:47<00:00,  8.23it/s]

Quantized Model Phi-3-mini-128k-instruct-4bit-128g avg PPL is 10.257711216714144



