# AWQ Quantization Baseline on OPT-1.3B

Project Title: Comparative Study of Metaheuristic Search vs AWQ for Layer-wise Mixed-Precision Quantization
Prepared by: Hermela Wosene, Hiwot Teshome, and Melat Dagnachew
Target Model: OPT-1.3B
Quantization Config: 4-bit weights, group size 128
Quantization Method: Activation-aware Weight Quantization (AWQ)
Evaluation Metric: Perplexity on WikiText-2
Platform: Google Colab GPU with Google Drive integration

---

### Project Scope

This project benchmarks a family of nature-inspired metaheuristic algorithms: Differential Evolution, Particle Swarm Optimization, and Simulated Annealing, against the AWQ baseline, a state-of-the-art post-training quantization method for large language models.

We focus on quantizing the OPT-1.3B model using group-wise INT3/INT4 weight quantization and evaluate quantized model performance on WikiText-2 perplexity. This notebook implements and executes the AWQ baseline using its official repository.

We will later compare the results to metaheuristic search over per-layer bit allocations using the same evaluation budget.


In [4]:
# Check that Colab is using a GPU runtime
import torch

if torch.cuda.is_available():
    print(f" GPU is ready: {torch.cuda.get_device_name(0)}")
else:
    print(" GPU not available — go to Runtime → Change runtime type → select GPU")



 GPU is ready: Tesla T4


In [5]:
# Change directory
%cd /content/drive/MyDrive/LLM_Quant_Project
!ls


/content/drive/MyDrive/LLM_Quant_Project
awq_baseline.ipynb  meta_eval.ipynb  meta_optimize.ipynb  report_results.ipynb


## 1. Environment Setup and AWQ Repository Preparation

This section sets up the official AWQ implementation environment inside Google Colab. It includes cloning the AWQ repository, installing required Python packages, and patching out GPU-dependent CUDA extensions that are not compatible with the Colab environment.

AWQ includes a low-level CUDA-based inference engine, but since our evaluation uses simulated quantization (`--q_backend fake`), we bypass these components safely. This lets us run AWQ’s core quantization and evaluation logic in a CPU- or general GPU-compatible setting.

All necessary Hugging Face dependencies (`transformers`, `datasets`, `accelerate`) are also installed here to support downloading the OPT-1.3B model and running evaluation tasks like WikiText-2.





## Clone and Install AWQ

In [6]:
# Clone the AWQ repo from MIT Han Lab
!git clone https://github.com/mit-han-lab/llm-awq.git
%cd llm-awq


Cloning into 'llm-awq'...
remote: Enumerating objects: 1144, done.[K
remote: Counting objects: 100% (620/620), done.[K
remote: Compressing objects: 100% (325/325), done.[K
remote: Total 1144 (delta 458), reused 295 (delta 295), pack-reused 524 (from 2)[K
Receiving objects: 100% (1144/1144), 183.13 MiB | 18.32 MiB/s, done.
Resolving deltas: 100% (609/609), done.
Updating files: 100% (180/180), done.
/content/drive/MyDrive/LLM_Quant_Project/llm-awq


## Install Dependencies

In [7]:
# Upgrade pip and install AWQ core dependencies
!pip install --upgrade pip
!pip install -e .
!pip install transformers datasets accelerate


Obtaining file:///content/drive/MyDrive/LLM_Quant_Project/llm-awq
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: awq
  Building editable for awq (pyproject.toml) ... [?25l[?25hdone
  Created wheel for awq: filename=awq-0.1.0-0.editable-py3-none-any.whl size=10373 sha256=29a45bb3ed3cb28a0b4e87b0bad7919e34efcc5f18836ae2bfb928d811f885a5
  Stored in directory: /tmp/pip-ephem-wheel-cache-zdcbn5k9/wheels/09/c9/fe/1e03586694a6b001530152f6138a31bf4d956ed834288a99d4
Successfully built awq
Installing collected packages: awq
  Attempting uninstall: awq
    Found existing installation: awq 0.1.0
    Uninstalling awq-0.1.0:
      Successfully uninstalled awq-0.1.0
Successfully installed awq-0.1.0


## Patch CUDA-only Imports

In [8]:
# Patch 1: Remove awq_inference_engine from w8a8_linear.py
file1 = "awq/quantize/w8a8_linear.py"
lines = open(file1).readlines()
with open(file1, "w") as f:
    for line in lines:
        if "import awq_inference_engine" in line:
            f.write(f"# {line.strip()}  # patched for Colab\n")
        else:
            f.write(line)

# Reapply patch to disable CUDA-only import inside qmodule.py
patch_file = "awq/quantize/qmodule.py"
with open(patch_file, "r") as f:
    lines = f.readlines()

with open(patch_file, "w") as f:
    for line in lines:
        if "import awq_inference_engine" in line:
            f.write(f"# {line.strip()}  # patched for Colab\n")
        else:
            f.write(line)

print(" Patched qmodule.py to skip awq_inference_engine import.")


 Patched qmodule.py to skip awq_inference_engine import.


In [9]:
# Patch pre_quant.py to skip undefined LlavaLlamaForCausalLM
file_path = "awq/quantize/pre_quant.py"
with open(file_path, "r") as f:
    lines = f.readlines()

with open(file_path, "w") as f:
    for line in lines:
        if "isinstance(model, LlavaLlamaForCausalLM)" in line:
            f.write("# " + line.strip() + "  # patched: undefined class\n")
        else:
            f.write(line)

print(" Patched pre_quant.py to remove LlavaLlamaForCausalLM reference.")


 Patched pre_quant.py to remove LlavaLlamaForCausalLM reference.


## 2: Mount Google Drive and Set Paths

In [13]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Project folder to store outputs
project_dir = "/content/drive/MyDrive/LLM_Quant_Project"
awq_cache = os.path.join(project_dir, "awq_cache")
os.makedirs(awq_cache, exist_ok=True)

print(" Google Drive mounted. Cache folder ready.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
 Google Drive mounted. Cache folder ready.


## 3. Model Acquisition and AWQ Quantization Search

Before performing quantization, the OPT-1.3B model must be downloaded and cached locally using Hugging Face’s `transformers` API. This ensures the AWQ code has direct access to the model weights and architecture.

After downloading, we invoke the AWQ quantization routine using:
- 4-bit integer weights (`--w_bit 4`)
- Group-wise quantization with group size 128 (`--q_group_size 128`)
- Activation-aware search enabled (`--run_awq`)

This process produces a per-layer scaling configuration based on activation saliency, which is then saved to Drive for evaluation.


# Download and Cache OPT-1.3B Model

We use Hugging Face's transformers API to download the pretrained OPT-1.3B model and save it to disk. This ensures AWQ can access the model architecture and weights locally during quantization.

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-1.3b"
save_dir = "/content/opt-1.3b"

# Download and cache model + tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
model.save_pretrained(save_dir)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(save_dir)


('/content/opt-1.3b/tokenizer_config.json',
 '/content/opt-1.3b/special_tokens_map.json',
 '/content/opt-1.3b/vocab.json',
 '/content/opt-1.3b/merges.txt',
 '/content/opt-1.3b/added_tokens.json',
 '/content/opt-1.3b/tokenizer.json')

## Download OPT-1.3B Using Transformers

In [11]:
from transformers import AutoModelForCausalLM, GPT2TokenizerFast
import os
import shutil

local_model_path = "/content/opt-1.3b"

# Start fresh: remove broken partial folder
if os.path.exists(local_model_path):
    shutil.rmtree(local_model_path)

# Download and save model
print(" Downloading OPT-1.3B weights...")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")
model.save_pretrained(local_model_path)

# Download and save the correct tokenizer
print(" Downloading OPT-1.3B tokenizer (GPT2TokenizerFast)...")
tokenizer = GPT2TokenizerFast.from_pretrained("facebook/opt-1.3b")
tokenizer.save_pretrained(local_model_path)

# Manually write tokenizer_config.json (force tokenizer class type)
with open(os.path.join(local_model_path, "tokenizer_config.json"), "w") as f:
    f.write('{ "tokenizer_class": "GPT2TokenizerFast" }')

print(" OPT-1.3B fully downloaded and saved to:", local_model_path)


 Downloading OPT-1.3B weights...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


 Downloading OPT-1.3B tokenizer (GPT2TokenizerFast)...
 OPT-1.3B fully downloaded and saved to: /content/opt-1.3b


## Run AWQ Search and Dump Results

In [14]:
# Run activation-aware quantization search and save results to Drive
!mkdir -p /content/drive/MyDrive/LLM_Quant_Project/awq_cache

!python -m awq.entry \
  --model_path /content/opt-1.3b \
  --w_bit 4 \
  --q_group_size 128 \
  --run_awq \
  --dump_awq /content/drive/MyDrive/LLM_Quant_Project/awq_cache/opt-1.3b-w4-g128.pt





2026-01-29 18:11:51.164554: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769710311.422209   77971 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769710311.488850   77971 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769710312.000206   77971 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769710312.000247   77971 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769710312.000252   77971 computation_placer.cc:177] computation placer alr

## 4. Evaluation of Quantized Model on WikiText-2

In this step, we load the previously computed AWQ quantization parameters and evaluate the model’s performance using the WikiText-2 benchmark dataset.

The evaluation is conducted in simulation mode using `--q_backend fake`, which emulates quantization behavior without requiring custom CUDA kernels. This mode is ideal for quick experimentation and compatible with Colab environments.

The primary metric extracted is **perplexity (PPL)** on WikiText-2, which reflects how well the quantized model predicts language tokens. This number will later serve as the baseline in our comparative analysis against metaheuristic optimization techniques.


In [15]:
#  Evaluate AWQ Quantized Model on WikiText-2 ===
# Uses --q_backend fake to simulate quantization without real CUDA kernel

!python -m awq.entry \
  --model_path /content/opt-1.3b \
  --tasks wikitext \
  --w_bit 4 \
  --q_group_size 128 \
  --load_awq /content/drive/MyDrive/LLM_Quant_Project/awq_cache/opt-1.3b-w4-g128.pt \
  --q_backend fake



2026-01-29 18:13:38.896191: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769710419.172067   78463 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769710419.241019   78463 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769710419.762589   78463 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769710419.762633   78463 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769710419.762638   78463 computation_placer.cc:177] computation placer alr

# 5  Export Real Quantized Weights

This step saves real quantized weights (INT4 .pt file), use . This requires CUDA kernels and may fail if flash-attn isn't installed correctly.

In [None]:
# generates an actual quantized model
!python -m awq.entry \
  --model_path facebook/opt-1.3b \
  --w_bit 4 \
  --q_group_size 128 \
  --load_awq {awq_dump_path}/opt-1.3b-w4-g128.pt \
  --q_backend real \
  --dump_quant {quant_dump_path}/opt-1.3b-w4-g128-awq.pt


Final Step: Save Results for Later Use

perplexity results, logs, or model files back to Drive

In [None]:
from google.colab import files
files.download(f"{awq_dump_path}/opt-1.3b-w4-g128.pt")
