## LLM Compressor Workbench -- Getting Started

This notebook will demonstrate how common [LLM Compressor](https://github.com/vllm-project/llm-compressor) flows can be run on the [opendatahub/llmcompressor-workbench](https://quay.io/repository/opendatahub/llmcompressor-workbench) image.

We will show how a user can compress and evaluate a Large Language Model, first without data and then with a calibration dataset.

The notebook will detect if a GPU is available. If one is not available, it will demonstrate an abbreviated run, so users without GPU access can still get a feel for `llm-compressor`.


<div class="alert alert-block alert-info">
<b>Note:</b> If you are not using the Workbench image, just be sure to have lm_eval>=0.4.8 and llmcompressor>=0.5.1 installed
</div>

### 1\) Data-Free Model Compression

In [1]:
!pip install llmcompressor lm-eval vllm --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
codeflare-sdk 0.26.0 requires pydantic<2, but you have pydantic 2.11.7 which is incompatible.
codeflare-sdk 0.26.0 requires ray[data,default]==2.35.0, but you have ray 2.47.0 which is incompatible.
kfp 2.9.0 requires protobuf<5,>=4.21.1, but you have protobuf 5.29.5 which is incompatible.
kfp-kubernetes 1.4.0 requires protobuf<5,>=4.21.1, but you have protobuf 5.29.5 which is incompatible.
kfp-pipeline-spec 0.4.0 requires protobuf<5,>=4.21.1, but you have protobuf 5.29.5 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# !pip install -qU transformers  --quiet
!pip install transformers==4.51.3 --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
from transformers import modeling_utils
if not hasattr(modeling_utils, "ALL_PARALLEL_STYLES") or modeling_utils.ALL_PARALLEL_STYLES is None:
    modeling_utils.ALL_PARALLEL_STYLES = ["tp", "none","colwise",'rowwise']

In [4]:
import torch
use_gpu = torch.cuda.is_available()

In [5]:
from llmcompressor.modifiers.quantization import QuantizationModifier

# model to compress
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# This recipe will quantize all Linear layers except those in the `lm_head`,
#  which is often sensitive to quantization. The W4A16 scheme compresses
#  weights to 4-bit integers while retaining 16-bit activations.
recipe = QuantizationModifier(
    targets="Linear", scheme="W4A16", ignore=["lm_head"]
)

In [6]:
# Load up model using huggingface API
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

In [7]:
# Run compression using `oneshot`
from llmcompressor import oneshot

model = oneshot(model=model, recipe=recipe, tokenizer=tokenizer)

2025-06-17T06:51:06.156422+0000 | reset | INFO - Compression lifecycle reset
2025-06-17T06:51:06.157677+0000 | from_modifiers | INFO - Creating recipe from modifiers


manager stage: Modifiers initialized


2025-06-17T06:51:07.647286+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers


manager stage: Modifiers finalized


2025-06-17T06:51:07.648231+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers


In [8]:
# Save model and tokenizer
model_dir = "./" + model_id.split("/")[-1] + "-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir);

2025-06-17T06:51:07.652938+0000 | save_pretrained_wrapper | INFO - Fetching state_dict - this may take some time
2025-06-17T06:51:09.191813+0000 | save_pretrained_wrapper | INFO - Fetching compressor
2025-06-17T06:51:09.192585+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Quantized Compression: 100%|██████████| 509/509 [00:04<00:00, 110.76it/s]

2025-06-17T06:51:13.793994+0000 | save_pretrained_wrapper | INFO - Saving compressed model to disk





### 2\) Evaluate compressed model using open-source `lm_eval` framework

We will evaluate the performance of the model on the [`wikitext`](https://paperswithcode.com/dataset/wikitext-2) language modeling dataset

In [9]:
import lm_eval
from lm_eval.utils import make_table

results = lm_eval.simple_evaluate(
    model="vllm" if use_gpu else "hf",
    model_args={
        "pretrained": model_dir,
        "add_bos_token": True,
        "device": "auto"
    },
    tasks=["wikitext"],
    batch_size="auto" if use_gpu else 4,
    limit=None if use_gpu else 4,
)

INFO 06-17 06:51:20 [__init__.py:243] Automatically detected platform cuda.
INFO 06-17 06:51:22 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-17 06:51:22 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-17 06:51:22 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-17 06:51:31 [config.py:793] This model supports multiple tasks: {'score', 'embed', 'generate', 'reward', 'classify'}. Defaulting to 'generate'.




INFO 06-17 06:51:32 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-17 06:51:37 [__init__.py:243] Automatically detected platform cuda.
INFO 06-17 06:51:40 [core.py:438] Waiting for init message from front-end.
INFO 06-17 06:51:40 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-17 06:51:40 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-17 06:51:40 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-17 06:51:40 [core.py:65] Initializing a V1 LLM engine (v0.9.0.1) with config: model='./TinyLlama-1.1B-Chat-v1.0-W4A16', speculative_config=None, tokenizer='./TinyLlama-1.1B-Chat-v1.0-W4A16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.72it/s]



INFO 06-17 06:51:41 [default_loader.py:280] Loading weights took 0.18 seconds
INFO 06-17 06:51:42 [gpu_model_runner.py:1549] Model loading took 0.7432 GiB and 0.398142 seconds
INFO 06-17 06:51:48 [backends.py:459] Using cache directory: /opt/app-root/src/.cache/vllm/torch_compile_cache/e527347c1c/rank_0_0 for vLLM's torch.compile
INFO 06-17 06:51:48 [backends.py:469] Dynamo bytecode transform time: 6.75 s
INFO 06-17 06:51:51 [backends.py:158] Cache the graph of shape None for later use
INFO 06-17 06:52:13 [backends.py:170] Compiling a graph for general shape takes 24.19 s
INFO 06-17 06:52:25 [monitor.py:33] torch.compile takes 30.93 s in total
INFO 06-17 06:52:26 [kv_cache_utils.py:637] GPU KV cache size: 839,248 tokens
INFO 06-17 06:52:26 [kv_cache_utils.py:640] Maximum concurrency for 2,048 tokens per request: 409.79x
INFO 06-17 06:52:49 [gpu_model_runner.py:1933] Graph capturing finished in 23 secs, took 0.37 GiB
INFO 06-17 06:52:49 [core.py:167] init engine (profile, create kv cach

[Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
[Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
[Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
[Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
[Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
[Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False


README.md:   0%|          | 0.00/8.76k [00:00<?, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-trai(…):   0%|          | 0.00/6.18M [00:00<?, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-vali(…):   0%|          | 0.00/641k [00:00<?, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-test(…):   0%|          | 0.00/715k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/629 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/60 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/62 [00:00<?, ? examples/s]

100%|██████████| 62/62 [00:00<00:00, 724.90it/s]
  0%|          | 0/62 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (5945 > 2048). Running this sequence through the model will result in indexing errors
100%|██████████| 62/62 [00:00<00:00, 101.30it/s]
Running loglikelihood requests:   0%|          | 0/62 [00:00<?, ?it/s]

Adding requests:   0%|          | 0/62 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/62 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Running loglikelihood requests: 100%|██████████| 62/62 [00:05<00:00, 10.72it/s]
Running loglikelihood requests:   0%|          | 0/62 [00:00<?, ?it/s]

Adding requests:   0%|          | 0/62 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/62 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Running loglikelihood requests: 100%|██████████| 62/62 [00:05<00:00, 10.88it/s]
Running loglikelihood requests:   0%|          | 0/62 [00:00<?, ?it/s]

Adding requests:   0%|          | 0/62 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/62 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Running loglikelihood requests: 100%|██████████| 62/62 [00:05<00:00, 11.01it/s]


In [10]:
print(make_table(results))

| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.7583|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.6916|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |16.6245|±  |   N/A|



### 3\) Calibrated Compression with a Dataset

Some more advanced compression algorithms require a small dataset of calibration samples that are meant to be a representative random subset of the language the model will see at inference.

We will show how the previous section can be augmented with a calibration dataset and GPTQ, one of the first published LLM compression algorithms.

<div class="alert alert-block alert-info">
<b>Note:</b> This will take several minutes if no GPU is available
</div>

In [11]:
# We will use a new recipe running GPTQ (https://arxiv.org/abs/2210.17323)
# to reduce error caused by quantization. GPTQ requires a calibration dataset.
from llmcompressor.modifiers.quantization import GPTQModifier

recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

In [12]:
from datasets import load_dataset

# Create the calibration dataset, using Huggingface datasets API
dataset_id = "HuggingFaceH4/ultrachat_200k"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
num_calibration_samples = 512 if use_gpu else 4
max_sequence_length = 2048 if use_gpu else 16

# Load dataset
ds = load_dataset(dataset_id, split="train_sft")
# Shuffle and grab only the number of samples we need
ds = ds.shuffle(seed=42).select(range(num_calibration_samples))

# Preprocess and tokenize into format the model uses
def preprocess(example):
    text = tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    return tokenizer(
        text,
        padding=False,
        max_length=max_sequence_length,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(preprocess, remove_columns=ds.column_names)

Map:   0%|          | 0/512 [00:00<?, ? examples/s]

In [13]:
# oneshot modifies model in-place, so reload
model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype="auto"
)
# run oneshot again, with dataset
model = oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=max_sequence_length,
    num_calibration_samples=num_calibration_samples,
)

2025-06-17T06:53:28.626297+0000 | reset | INFO - Compression lifecycle reset
2025-06-17T06:53:28.627428+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-06-17T06:53:28.629511+0000 | _build_quant_modifier | INFO - Building quantization modifier with args: {'targets': 'Linear', 'scheme': 'W4A16', 'ignore': ['lm_head']}
2025-06-17T06:53:28.678451+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config.


Preparing intermediates cache: 100%|██████████| 512/512 [00:01<00:00, 411.53it/s]
(1/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 68.21it/s]

2025-06-17T06:53:38.555433+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples





2025-06-17T06:53:39.554321+0000 | compress | METRIC - time 1.00s
2025-06-17T06:53:39.555133+0000 | compress | METRIC - error 611.05
2025-06-17T06:53:39.556127+0000 | compress | METRIC - GPU 0 | usage: 10.51% | total memory: 24 GB
2025-06-17T06:53:39.556507+0000 | compress | METRIC - GPU 1 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:53:39.556886+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:53:39.557254+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:53:39.557638+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:53:39.558644+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
2025-06-17T06:53:40.514340+0000 | compress | METRIC - time 0.96s
2025-06-17T06:53:40.514954+0000 | compress | METRIC - error 595.48
2025-06-17T06:53:40.515827+0000 | compress | METRIC - GPU 0 | usage: 10.51% | total memory: 24 GB
2025-06-17T06:53:40.516178+0000

(1/23): Propagating: 100%|██████████| 512/512 [00:03<00:00, 157.42it/s]
(2/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.11it/s]

2025-06-17T06:53:57.825571+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.1.self_attn.q_proj using 512 samples





2025-06-17T06:53:58.784352+0000 | compress | METRIC - time 0.96s
2025-06-17T06:53:58.785354+0000 | compress | METRIC - error 1015.92
2025-06-17T06:53:58.786185+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:53:58.786566+0000 | compress | METRIC - GPU 1 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:53:58.786964+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:53:58.787324+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:53:58.787708+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:53:58.788761+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.1.self_attn.k_proj using 512 samples
2025-06-17T06:53:59.764259+0000 | compress | METRIC - time 0.98s
2025-06-17T06:53:59.765235+0000 | compress | METRIC - error 806.08
2025-06-17T06:53:59.766059+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:53:59.766529+000

(2/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 181.81it/s]
(3/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.15it/s]

2025-06-17T06:54:16.583009+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.2.self_attn.q_proj using 512 samples





2025-06-17T06:54:17.554434+0000 | compress | METRIC - time 0.97s
2025-06-17T06:54:17.555433+0000 | compress | METRIC - error 821.41
2025-06-17T06:54:17.556193+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:54:17.556564+0000 | compress | METRIC - GPU 1 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:54:17.556940+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:54:17.557316+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:54:17.557684+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:54:17.558715+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.2.self_attn.k_proj using 512 samples
2025-06-17T06:54:18.532922+0000 | compress | METRIC - time 0.97s
2025-06-17T06:54:18.533952+0000 | compress | METRIC - error 448.06
2025-06-17T06:54:18.534795+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:54:18.535170+0000

(3/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 186.32it/s]
(4/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.10it/s]

2025-06-17T06:54:35.207682+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.3.self_attn.q_proj using 512 samples





2025-06-17T06:54:36.206409+0000 | compress | METRIC - time 1.00s
2025-06-17T06:54:36.207452+0000 | compress | METRIC - error 1755.28
2025-06-17T06:54:36.208233+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:54:36.208626+0000 | compress | METRIC - GPU 1 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:54:36.209007+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:54:36.209417+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:54:36.209830+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:54:36.210867+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.3.self_attn.k_proj using 512 samples
2025-06-17T06:54:37.188087+0000 | compress | METRIC - time 0.98s
2025-06-17T06:54:37.189143+0000 | compress | METRIC - error 719.07
2025-06-17T06:54:37.189940+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:54:37.190328+000

(4/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 186.53it/s]
(5/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.82it/s]

2025-06-17T06:54:53.986057+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.4.self_attn.q_proj using 512 samples





2025-06-17T06:54:55.064482+0000 | compress | METRIC - time 1.08s
2025-06-17T06:54:55.065476+0000 | compress | METRIC - error 3198.23
2025-06-17T06:54:55.066303+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:54:55.066675+0000 | compress | METRIC - GPU 1 | usage: 11.64% | total memory: 24 GB
2025-06-17T06:54:55.067043+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:54:55.067430+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:54:55.067785+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:54:55.068820+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.4.self_attn.k_proj using 512 samples
2025-06-17T06:54:56.079398+0000 | compress | METRIC - time 1.01s
2025-06-17T06:54:56.080111+0000 | compress | METRIC - error 1603.61
2025-06-17T06:54:56.080861+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:54:56.081261+0

(5/23): Propagating: 100%|██████████| 512/512 [00:03<00:00, 153.02it/s]
(6/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 70.03it/s]

2025-06-17T06:55:13.740835+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.5.self_attn.q_proj using 512 samples





2025-06-17T06:55:14.756230+0000 | compress | METRIC - time 1.01s
2025-06-17T06:55:14.757091+0000 | compress | METRIC - error 2363.78
2025-06-17T06:55:14.757864+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:55:14.758232+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:55:14.758641+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:55:14.758984+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:55:14.759397+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:55:14.760349+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.5.self_attn.k_proj using 512 samples
2025-06-17T06:55:15.741053+0000 | compress | METRIC - time 0.98s
2025-06-17T06:55:15.741988+0000 | compress | METRIC - error 1085.41
2025-06-17T06:55:15.742861+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:55:15.743237+0

(6/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.40it/s]
(7/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 70.09it/s]

2025-06-17T06:55:32.891920+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.6.self_attn.q_proj using 512 samples





2025-06-17T06:55:33.919240+0000 | compress | METRIC - time 1.03s
2025-06-17T06:55:33.920232+0000 | compress | METRIC - error 2844.27
2025-06-17T06:55:33.921098+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:55:33.921487+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:55:33.921884+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:55:33.922261+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:55:33.922663+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:55:33.923716+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.6.self_attn.k_proj using 512 samples
2025-06-17T06:55:34.917712+0000 | compress | METRIC - time 0.99s
2025-06-17T06:55:34.918578+0000 | compress | METRIC - error 1213.37
2025-06-17T06:55:34.919303+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:55:34.919695+0

(7/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.78it/s]
(8/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 70.01it/s]

2025-06-17T06:55:52.142098+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.7.self_attn.q_proj using 512 samples





2025-06-17T06:55:53.173111+0000 | compress | METRIC - time 1.03s
2025-06-17T06:55:53.173953+0000 | compress | METRIC - error 3403.51
2025-06-17T06:55:53.174902+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:55:53.175255+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:55:53.175694+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:55:53.176072+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:55:53.176474+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:55:53.177568+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.7.self_attn.k_proj using 512 samples
2025-06-17T06:55:54.200499+0000 | compress | METRIC - time 1.02s
2025-06-17T06:55:54.201585+0000 | compress | METRIC - error 1184.01
2025-06-17T06:55:54.202484+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:55:54.202869+0

(8/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.53it/s]
(9/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 70.02it/s]

2025-06-17T06:56:11.282649+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.8.self_attn.q_proj using 512 samples





2025-06-17T06:56:12.276456+0000 | compress | METRIC - time 0.99s
2025-06-17T06:56:12.277149+0000 | compress | METRIC - error 5764.24
2025-06-17T06:56:12.277995+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:56:12.278337+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:56:12.278732+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:56:12.279095+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:56:12.279507+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:56:12.280592+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.8.self_attn.k_proj using 512 samples
2025-06-17T06:56:13.301206+0000 | compress | METRIC - time 1.02s
2025-06-17T06:56:13.302082+0000 | compress | METRIC - error 2473.94
2025-06-17T06:56:13.302975+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:56:13.303309+0

(9/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.56it/s]
(10/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 70.03it/s]

2025-06-17T06:56:30.372193+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.9.self_attn.q_proj using 512 samples





2025-06-17T06:56:31.398917+0000 | compress | METRIC - time 1.03s
2025-06-17T06:56:31.399822+0000 | compress | METRIC - error 3168.07
2025-06-17T06:56:31.400719+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:56:31.401090+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:56:31.401513+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:56:31.401927+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:56:31.402333+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:56:31.403445+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.9.self_attn.k_proj using 512 samples
2025-06-17T06:56:32.424177+0000 | compress | METRIC - time 1.02s
2025-06-17T06:56:32.425235+0000 | compress | METRIC - error 1293.86
2025-06-17T06:56:32.712754+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:56:32.713358+0

(10/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.27it/s]
(11/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 70.05it/s]

2025-06-17T06:56:49.794247+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.10.self_attn.q_proj using 512 samples





2025-06-17T06:56:50.817929+0000 | compress | METRIC - time 1.02s
2025-06-17T06:56:50.818755+0000 | compress | METRIC - error 3467.86
2025-06-17T06:56:50.819589+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:56:50.819939+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:56:50.820309+0000 | compress | METRIC - GPU 2 | usage: 8.54% | total memory: 24 GB
2025-06-17T06:56:50.820681+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:56:50.821055+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:56:50.822075+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.10.self_attn.k_proj using 512 samples
2025-06-17T06:56:51.812620+0000 | compress | METRIC - time 0.99s
2025-06-17T06:56:51.813594+0000 | compress | METRIC - error 1482.10
2025-06-17T06:56:51.814392+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:56:51.814746+

(11/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.75it/s]
(12/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.39it/s]

2025-06-17T06:57:09.005804+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.11.self_attn.q_proj using 512 samples





2025-06-17T06:57:10.083830+0000 | compress | METRIC - time 1.08s
2025-06-17T06:57:10.084763+0000 | compress | METRIC - error 4878.73
2025-06-17T06:57:10.085578+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:57:10.085938+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:57:10.086350+0000 | compress | METRIC - GPU 2 | usage: 11.64% | total memory: 24 GB
2025-06-17T06:57:10.086705+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:57:10.087099+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:57:10.088143+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.11.self_attn.k_proj using 512 samples
2025-06-17T06:57:11.122026+0000 | compress | METRIC - time 1.03s
2025-06-17T06:57:11.122845+0000 | compress | METRIC - error 1753.81
2025-06-17T06:57:11.123632+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:57:11.124012

(12/23): Propagating: 100%|██████████| 512/512 [00:03<00:00, 153.83it/s]
(13/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.93it/s]

2025-06-17T06:57:28.777241+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.12.self_attn.q_proj using 512 samples





2025-06-17T06:57:29.776448+0000 | compress | METRIC - time 1.00s
2025-06-17T06:57:29.777253+0000 | compress | METRIC - error 3679.62
2025-06-17T06:57:29.778084+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:57:29.778485+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:57:29.778900+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:57:29.779279+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:57:29.779672+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:57:29.780752+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.12.self_attn.k_proj using 512 samples
2025-06-17T06:57:30.760323+0000 | compress | METRIC - time 0.98s
2025-06-17T06:57:30.761238+0000 | compress | METRIC - error 1545.25
2025-06-17T06:57:30.762073+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:57:30.762455

(13/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.12it/s]
(14/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.93it/s]

2025-06-17T06:57:47.855152+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.13.self_attn.q_proj using 512 samples





2025-06-17T06:57:48.899590+0000 | compress | METRIC - time 1.04s
2025-06-17T06:57:48.900792+0000 | compress | METRIC - error 4123.42
2025-06-17T06:57:48.901693+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:57:48.902053+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:57:48.902463+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:57:48.902832+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:57:48.903218+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:57:48.904196+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.13.self_attn.k_proj using 512 samples
2025-06-17T06:57:49.930100+0000 | compress | METRIC - time 1.03s
2025-06-17T06:57:49.930962+0000 | compress | METRIC - error 1765.75
2025-06-17T06:57:49.931756+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:57:49.932108

(14/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.26it/s]
(15/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.89it/s]

2025-06-17T06:58:07.141210+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.14.self_attn.q_proj using 512 samples





2025-06-17T06:58:08.206459+0000 | compress | METRIC - time 1.06s
2025-06-17T06:58:08.207604+0000 | compress | METRIC - error 3887.89
2025-06-17T06:58:08.208806+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:58:08.209294+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:58:08.209843+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:58:08.210453+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:58:08.211006+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:58:08.212651+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.14.self_attn.k_proj using 512 samples
2025-06-17T06:58:09.213945+0000 | compress | METRIC - time 1.00s
2025-06-17T06:58:09.214922+0000 | compress | METRIC - error 1726.07
2025-06-17T06:58:09.215817+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:58:09.216191

(15/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.01it/s]
(16/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.85it/s]

2025-06-17T06:58:26.324861+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.15.self_attn.q_proj using 512 samples





2025-06-17T06:58:27.357492+0000 | compress | METRIC - time 1.03s
2025-06-17T06:58:27.358444+0000 | compress | METRIC - error 5661.25
2025-06-17T06:58:27.359191+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:58:27.359660+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:58:27.360011+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:58:27.360447+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:58:27.360906+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:58:27.361956+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.15.self_attn.k_proj using 512 samples
2025-06-17T06:58:28.338677+0000 | compress | METRIC - time 0.98s
2025-06-17T06:58:28.339622+0000 | compress | METRIC - error 1905.40
2025-06-17T06:58:28.340421+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:58:28.340762

(16/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.51it/s]
(17/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.90it/s]

2025-06-17T06:58:45.560971+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.16.self_attn.q_proj using 512 samples





2025-06-17T06:58:46.570000+0000 | compress | METRIC - time 1.01s
2025-06-17T06:58:46.570951+0000 | compress | METRIC - error 5774.99
2025-06-17T06:58:46.571904+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:58:46.572316+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:58:46.572890+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:58:46.573428+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:58:46.573848+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:58:46.574932+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.16.self_attn.k_proj using 512 samples
2025-06-17T06:58:47.583294+0000 | compress | METRIC - time 1.01s
2025-06-17T06:58:47.584253+0000 | compress | METRIC - error 1960.76
2025-06-17T06:58:47.585081+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:58:47.585526

(17/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 173.32it/s]
(18/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.84it/s]

2025-06-17T06:59:04.894188+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.17.self_attn.q_proj using 512 samples





2025-06-17T06:59:05.918918+0000 | compress | METRIC - time 1.02s
2025-06-17T06:59:05.919929+0000 | compress | METRIC - error 5366.07
2025-06-17T06:59:05.920737+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:59:05.921100+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:59:05.921497+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:59:05.921870+0000 | compress | METRIC - GPU 3 | usage: 7.36% | total memory: 24 GB
2025-06-17T06:59:05.922284+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:59:05.923277+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.17.self_attn.k_proj using 512 samples
2025-06-17T06:59:06.912793+0000 | compress | METRIC - time 0.99s
2025-06-17T06:59:06.913887+0000 | compress | METRIC - error 2040.75
2025-06-17T06:59:06.914695+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:59:06.915078

(18/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 172.59it/s]
(19/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 67.21it/s]

2025-06-17T06:59:24.346779+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.18.self_attn.q_proj using 512 samples





2025-06-17T06:59:25.383796+0000 | compress | METRIC - time 1.04s
2025-06-17T06:59:25.384655+0000 | compress | METRIC - error 6100.59
2025-06-17T06:59:25.385410+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:59:25.385754+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:59:25.386143+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:59:25.386512+0000 | compress | METRIC - GPU 3 | usage: 10.51% | total memory: 24 GB
2025-06-17T06:59:25.386908+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:59:25.387880+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.18.self_attn.k_proj using 512 samples
2025-06-17T06:59:26.389163+0000 | compress | METRIC - time 1.00s
2025-06-17T06:59:26.389973+0000 | compress | METRIC - error 2349.14
2025-06-17T06:59:26.390756+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:59:26.39113

(19/23): Propagating: 100%|██████████| 512/512 [00:03<00:00, 157.29it/s]
(20/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.51it/s]

2025-06-17T06:59:43.917852+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.19.self_attn.q_proj using 512 samples





2025-06-17T06:59:44.914004+0000 | compress | METRIC - time 0.99s
2025-06-17T06:59:44.914822+0000 | compress | METRIC - error 5810.61
2025-06-17T06:59:44.915583+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:59:44.915902+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:59:44.916209+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T06:59:44.916547+0000 | compress | METRIC - GPU 3 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:59:44.916879+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T06:59:44.917851+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.19.self_attn.k_proj using 512 samples
2025-06-17T06:59:45.888879+0000 | compress | METRIC - time 0.97s
2025-06-17T06:59:45.889700+0000 | compress | METRIC - error 2158.59
2025-06-17T06:59:45.890487+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T06:59:45.89079

(20/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 175.34it/s]
(21/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.60it/s]

2025-06-17T07:00:02.875735+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.20.self_attn.q_proj using 512 samples





2025-06-17T07:00:03.907839+0000 | compress | METRIC - time 1.03s
2025-06-17T07:00:03.908811+0000 | compress | METRIC - error 5879.96
2025-06-17T07:00:03.909586+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T07:00:03.909994+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T07:00:03.910428+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T07:00:03.910821+0000 | compress | METRIC - GPU 3 | usage: 11.04% | total memory: 24 GB
2025-06-17T07:00:03.911240+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T07:00:03.912309+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.20.self_attn.k_proj using 512 samples
2025-06-17T07:00:04.942583+0000 | compress | METRIC - time 1.03s
2025-06-17T07:00:04.943496+0000 | compress | METRIC - error 2162.16
2025-06-17T07:00:04.944340+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T07:00:04.94472

(21/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 176.10it/s]
(22/23): Calibrating: 100%|██████████| 512/512 [00:07<00:00, 69.54it/s]

2025-06-17T07:00:22.246570+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.21.self_attn.q_proj using 512 samples





2025-06-17T07:00:23.243706+0000 | compress | METRIC - time 1.00s
2025-06-17T07:00:23.244485+0000 | compress | METRIC - error 6350.40
2025-06-17T07:00:23.245308+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T07:00:23.245679+0000 | compress | METRIC - GPU 1 | usage: 12.17% | total memory: 24 GB
2025-06-17T07:00:23.246086+0000 | compress | METRIC - GPU 2 | usage: 12.17% | total memory: 24 GB
2025-06-17T07:00:23.246494+0000 | compress | METRIC - GPU 3 | usage: 11.04% | total memory: 24 GB
2025-06-17T07:00:23.246913+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-17T07:00:23.247953+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.21.self_attn.k_proj using 512 samples
2025-06-17T07:00:24.241940+0000 | compress | METRIC - time 0.99s
2025-06-17T07:00:24.243078+0000 | compress | METRIC - error 2215.37
2025-06-17T07:00:24.244073+0000 | compress | METRIC - GPU 0 | usage: 11.04% | total memory: 24 GB
2025-06-17T07:00:24.24450

(22/23): Propagating: 100%|██████████| 512/512 [00:02<00:00, 176.11it/s]
(23/23): Calibrating: 100%|██████████| 512/512 [00:01<00:00, 292.40it/s]
(23/23): Propagating: 100%|██████████| 512/512 [00:01<00:00, 292.62it/s]
manager stage: Modifiers initialized


2025-06-17T07:00:37.414273+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers


manager stage: Modifiers finalized


2025-06-17T07:00:37.415971+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers


In [14]:
# Save model and tokenizer
model_dir = "./" + model_id.split("/")[-1] + "-GPTQ-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir);

2025-06-17T07:00:37.424784+0000 | save_pretrained_wrapper | INFO - Fetching state_dict - this may take some time
2025-06-17T07:00:39.437497+0000 | save_pretrained_wrapper | INFO - Fetching compressor
2025-06-17T07:00:39.438306+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Quantized Compression: 100%|██████████| 509/509 [00:04<00:00, 107.63it/s]

2025-06-17T07:00:44.172957+0000 | save_pretrained_wrapper | INFO - Saving compressed model to disk





### 4\) Rerun `lm_eval`

Note that perplexity score has improved (lower is better) for this `TinyLlama` model. 

In [15]:
results = lm_eval.simple_evaluate(
    model="vllm" if use_gpu else "hf",
    model_args={
        "pretrained": model_dir,
        "add_bos_token": True,
        "device": "auto"
    },
    tasks=["wikitext"],
    batch_size="auto" if use_gpu else 4,
    limit=None if use_gpu else 4,
)

INFO 06-17 07:00:45 [config.py:793] This model supports multiple tasks: {'score', 'embed', 'generate', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 06-17 07:00:45 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-17 07:00:50 [__init__.py:243] Automatically detected platform cuda.
INFO 06-17 07:00:53 [core.py:438] Waiting for init message from front-end.
INFO 06-17 07:00:53 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-17 07:00:53 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-17 07:00:53 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-17 07:00:53 [core.py:65] Initializing a V1 LLM engine (v0.9.0.1) with config: model='./TinyLlama-1.1B-Chat-v1.0-GPTQ-W4A16', speculative_config=None, tokenizer='./TinyLlama-1.1B-Chat-v1.0-GPTQ-W4A16', skip_tokenizer_init=False, 

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.37it/s]



INFO 06-17 07:00:54 [default_loader.py:280] Loading weights took 0.22 seconds
INFO 06-17 07:00:55 [gpu_model_runner.py:1549] Model loading took 0.7432 GiB and 0.488780 seconds
INFO 06-17 07:01:01 [backends.py:459] Using cache directory: /opt/app-root/src/.cache/vllm/torch_compile_cache/f2259db575/rank_0_0 for vLLM's torch.compile
INFO 06-17 07:01:01 [backends.py:469] Dynamo bytecode transform time: 6.70 s
INFO 06-17 07:01:04 [backends.py:158] Cache the graph of shape None for later use
INFO 06-17 07:01:26 [backends.py:170] Compiling a graph for general shape takes 23.90 s
INFO 06-17 07:01:38 [monitor.py:33] torch.compile takes 30.61 s in total
INFO 06-17 07:01:39 [kv_cache_utils.py:637] GPU KV cache size: 813,920 tokens
INFO 06-17 07:01:39 [kv_cache_utils.py:640] Maximum concurrency for 2,048 tokens per request: 397.42x
INFO 06-17 07:02:01 [gpu_model_runner.py:1933] Graph capturing finished in 22 secs, took 0.37 GiB
INFO 06-17 07:02:01 [core.py:167] init engine (profile, create kv cach

[Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
[Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
[Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
[Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
[Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
[Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
100%|██████████| 62/62 [00:00<00:00, 710.87it/s]
  0%|          | 0/62 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (5945 > 2048). Running this sequence through the model will result in indexing errors
100%|██████████| 62

Adding requests:   0%|          | 0/62 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/62 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Running loglikelihood requests: 100%|██████████| 62/62 [00:05<00:00, 10.71it/s]
Running loglikelihood requests:   0%|          | 0/62 [00:00<?, ?it/s]

Adding requests:   0%|          | 0/62 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/62 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Running loglikelihood requests: 100%|██████████| 62/62 [00:05<00:00, 10.89it/s]
Running loglikelihood requests:   0%|          | 0/62 [00:00<?, ?it/s]

Adding requests:   0%|          | 0/62 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/62 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Running loglikelihood requests: 100%|██████████| 62/62 [00:05<00:00, 10.46it/s]


In [16]:
print(make_table(results))

| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.7508|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.6827|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |16.1636|±  |   N/A|

