# Quick Get Started Notebook of Intel® Neural Compressor for Pytorch


This notebook is designed to provide an easy-to-follow guide for getting started with the [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) library for [pytorch](https://github.com/pytorch/pytorch) framework.

In the following sections, we will use a BERT model as an example, referencing the [`run_glue_no_trainer.py` script](https://github.com/huggingface/transformers/blob/v4.53.1/examples/pytorch/text-classification/run_glue_no_trainer.py), to demonstrate how to apply post-training quantization to Hugging Face Transformers models using the Intel Neural Compressor (INC) library.


The main objectives of this notebook are:

1. Prerequisite: Prepare necessary environment, model and dataset.
2. Quantization with INC: Walk through the step-by-step process of applying post-training static quantization.


## 1. Prerequisite

### 1.1 Environment

If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [neural-compressor](https://github.com/intel/neural-compressor), [pytorch](https://github.com/pytorch/pytorch) and other required packages.

Otherwise, you can setup a new environment. First, we install [Anaconda](https://www.anaconda.com/distribution/). Then open an Anaconda prompt window and run the following commands:

```shell
conda create -n inc_notebook python==3.10
conda activate inc_notebook
pip install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

Then, let's install necessary packages.

In [None]:
# install neural-compressor from source
import sys
!git clone https://github.com/intel/neural-compressor.git
%cd ./neural-compressor
!{sys.executable} -m pip install -r requirements.txt
!{sys.executable} setup.py install
%cd ..

# or install stable basic version from pypi
!{sys.executable} -m pip install neural-compressor


In [None]:
# install other packages used in this notebook.
!{sys.executable} -m pip install -r requirements.txt


### 1.2 Load Dataset

The General Language Understanding Evaluation (GLUE) benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. This dataset is built from the SQuAD dataset.
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. This dataset is built from the Winograd Schema Challenge dataset.

Here, we use MRPC task. We download and load the required dataset from hub.

In [2]:
import evaluate
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader

from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    default_data_collator,
)
from transformers.utils import check_min_version
# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.53.1")

In [5]:
task_name = 'mrpc'
raw_datasets = load_dataset("nyu-mll/glue", task_name)
label_list = raw_datasets["train"].features["label"].names
num_labels = len(label_list)

### 1.3 Prepare Model
Download the pretrained model [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) to a pytorch model.

In [4]:
model_name = 'google-bert/bert-base-cased'

config = AutoConfig.from_pretrained(
    model_name,
    num_labels=num_labels,
    finetuning_task=task_name,
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast = True,
    trust_remote_code=False,
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    from_tf=False,
    config=config,
    ignore_mismatched_sizes=False,
    trust_remote_code=False,
)
model.eval()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

### 1.4 Dataset Preprocessing
We need to preprocess the raw dataset and make dataloaders.

In [6]:
sentence1_key, sentence2_key = ("sentence1", "sentence2")
padding = "max_length"
max_seq_length = 128

def preprocess_function(examples):
    args = (
        (examples[sentence1_key], examples[sentence2_key])
    )
    result = tokenizer(*args, padding=padding, max_length=max_seq_length, truncation=True)
    if "label" in examples:
        result["labels"] = examples["label"]
    return result

processed_datasets = raw_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
    desc="Running tokenizer on dataset",
)
 
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["validation"]


data_collator = default_data_collator

train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=data_collator, batch_size=8
)
example_inputs = next(iter(train_dataloader))
eval_dataloader = DataLoader(eval_dataset, collate_fn=data_collator, batch_size=8)

Running tokenizer on dataset: 100%|█████████████████████████████████████████████████████████████████████████| 3668/3668 [00:00<00:00, 15181.10 examples/s]
Running tokenizer on dataset: 100%|███████████████████████████████████████████████████████████████████████████| 408/408 [00:00<00:00, 13910.21 examples/s]
Running tokenizer on dataset: 100%|██████████████████████████████████████████████████████████████████████████| 1725/1725 [00:00<00:00, 7403.78 examples/s]


## 2. Quantization with Intel® Neural Compressor

### 2.1 Define calibration function and evaluate function

In this part, we define a GLUE metric and use it to generate an evaluate function.

In [7]:
# define calibration function
def run_fn(model):
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)

# define evaluation function
metric = evaluate.load("glue", task_name)
def eval_fn(model):
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
        try:
            predictions = outputs.logits.argmax(dim=-1)
        except (AttributeError, KeyError):
            predictions = outputs["logits"].argmax(dim=-1)
        references = batch["labels"]
        metric.add_batch(
            predictions=predictions,
            references=references,
        )

    eval_metric = metric.compute()
    print(f"evaluate results: {eval_metric}")

### 2.2 Run Quantization

So far, we can finally start to quantize the model. 

To start, we need to set the configuration for post-training quantization using `get_default_static_config()` to get static quant config. Once the configuration is set, we can proceed to the next step by calling the `prepare`, `convert` function. This function performs the quantization process on the model and will return the quantized model.

In [8]:
from neural_compressor.torch.quantization import (
    convert,
    get_default_static_config,
    prepare,
)

# fp32 results
eval_fn(model)
# ipex static quant
import intel_extension_for_pytorch
quant_config = get_default_static_config()
prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs)
run_fn(prepared_model)
q_model = convert(prepared_model)
eval_fn(q_model)



evaluate results: {'accuracy': 0.6813725490196079, 'f1': 0.8099415204678363}


  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
2025-07-07 03:07:56 [INFO][2204578354.py:12] Preparation started.
2025-07-07 03:07:56 [INFO][utility.py:740]  Found 12 blocks
2025-07-07 03:07:56 [INFO][utility.py:342] Attention Blocks: 12
2025-07-07 03:07:56 [INFO][utility.py:343] FFN Blocks: 12
2025-07-07 03:07:57 [INFO][utility.py:441] Attention Blocks : 
2025-07-07 03:07:57 [INFO][utility.py:442] [['bert.encoder.layer.0.attention.self.query', 'bert.encoder.layer.0.attention.self.key', 'bert.encoder.layer.0.atten

evaluate results: {'accuracy': 0.6838235294117647, 'f1': 0.8116788321167884}
