# Harnessing Intel Optimizations for Efficient Model Quantization with Hugging Face
<img src="https://gestaltit.com/wp-content/uploads/2021/12/quantization-pruning-architecture.jpg" alt="Alt Text" style="width: 800px;"/>


## Why Learn About Model Quantization

In this developer-focused workshop, we delve into the concept of model quantization and its crucial role in enhancing compute efficiency and reducing latency during inference. Quantization is a process that converts a model from using floating-point numbers to integers, which are computationally less expensive to process. This conversion is essential for deploying models on resource-constrained environments where memory footprint and inference speed are criti

<img src="https://deci.ai/wp-content/uploads/2023/02/deci-quantization-blog-1b.png" alt="Alt Text" style="width: 800px;"/>cal.

### Understanding the Trade-off

While quantization can significantly boost inference speed, it's vital to understand the trade-offs, particularly concerning model accuracy. Sometimes, reducing model size and computational requirements can lead to a decrease in accuracy. This workshop will focus on 'accuracy aware dynamic quantization,' where we aim to balance the trade-offs between speed and accuracy effectively.

### Static vs. Dynamic Quantization

Before we dive in, let's clarify two types of quantization:
- **Static Quantization**: Involves quantizing the weights and activations of the model but requires a calibration step using representative data.
- **Dynamic Quantization**: Quantizes the weights but leaves the activations in floating-point. This method is more flexible as it does not require the calibration step, making it suitable for models where input shapes can vary.

## Learning Objectives

By the end of this notebook, you'll learn how to use the Optimum Intel library to perform dynamic quantization on a Hugging Face model and understand the implications of this process on model performance and efficiency.

Let's get started on this journey to make our models faster and more efficient while maintaining their accuracy.


#### Setting Up the Environment and Initial Imports

In this cell, we import essential libraries and tools needed for our quantization task. These include:
- `evaluate` and `optimum.intel.INCQuantizer` for evaluating and quantizing our model.
- `load_dataset` for loading our evaluation dataset.
- `AutoModelForQuestionAnswering` and `AutoTokenizer` for loading our pre-trained model and tokenizer.
- `pipeline` for creating a question-answering pipeline.
- `neural_compressor.config` components for configuring our quantization process.

In [1]:
!pip freeze

[0mabout-time==4.2.1
accelerate==0.26.1
aiohttp==3.9.3
aiosignal==1.3.1
alive-progress==3.1.5
annotated-types==0.6.0
anyio==3.7.1
asn1crypto @ file:///home/conda/feedstock_root/build_artifacts/asn1crypto_1647369152656/work
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
async-timeout==4.0.3
attrs==23.2.0
autograd==1.6.2
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
backoff==2.2.1
bcrypt==4.1.2
Bottleneck @ file:///opt/conda/conda-bld/bottleneck_1657175564434/work
bqplot==0.12.42
branca==0.7.1
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1666764672617/work
cachetools==5.3.2
certifi @ file:///croot/certifi_1690232220950/work/certifi
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1671179360775/work
chardet @ file:///home/conda/feedstock_root/build_artifacts/chardet_1669990273997/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1678108872112/work
chroma-hnswlib==0

In [2]:
!source /opt/intel/oneapi/setvars.sh #comment out if not running on Intel Developer Cloud Jupyter
!python -m pip install optimum==1.16.2
!pip install --upgrade-strategy eager optimum[neural-compressor]==1.14.0
!pip install evaluate==0.4.1
!pip install datasets==2.16.0

 
   To force a re-execution of setvars.sh, use the '--force' option.
   Using '--force' can result in excessive use of your environment variables.
  
usage: source setvars.sh [--force] [--config=file] [--help] [...]
  --force        Force setvars.sh to re-run, doing so may overload environment.
  --config=file  Customize env vars using a setvars.sh configuration file.
  --help         Display this help message and exit.
  ...            Additional args are passed to individual env/vars.sh scripts
                 and should follow this script's arguments.
  
  Some POSIX shells do not accept command-line options. In that case, you can pass
  command-line options via the SETVARS_ARGS environment variable. For example:
  
  $ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
  $ . path/to/setvars.sh
  
  The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.
  
Defaulting to user installation because normal site-packages is not writeable
[0mDefaulting to us

#### Importing Libraries and Setting Up Quantization Environment

This cell is the starting point of our journey into model quantization. Here, we import a set of libraries and modules that are essential for both setting up our model and preparing it for the quantization process.

- `import evaluate`: This import brings in the `evaluate` library, which is crucial for assessing the performance of our model. It provides a straightforward way to evaluate various metrics, which is vital for understanding the impact of quantization on model accuracy.

- `from optimum.intel import INCQuantizer`: The `INCQuantizer` from the Optimum Intel library is a key component for this workshop. It is specifically designed to handle the quantization process, allowing us to convert our model into a more efficient format suitable for faster inference.

- `from datasets import load_dataset`: We use the `datasets` library to load the dataset required for evaluating our model. This step is essential for ensuring that we have the right data to fine-tune and assess our model's performance.

- `from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline`: These imports from the Hugging Face Transformers library are critical for loading our pre-trained model and tokenizer. `AutoModelForQuestionAnswering` and `AutoTokenizer` will be used to set up our model for the question-answering task. The `pipeline` function allows us to create a seamless flow for data processing and inference.

- `from neural_compressor.config import AccuracyCriterion, TuningCriterion, PostTrainingQuantConfig`: These imports from the Neural Compressor library are used to configure the quantization process. `AccuracyCriterion` and `TuningCriterion` allow us to set parameters that define the acceptable accuracy loss and the tuning process for quantization. `PostTrainingQuantConfig` provides the necessary configuration for post-training quantization, which is the approach we will be using.

Each of these imports plays a vital role in preparing our environment for quantizing a model effectively, setting the stage for the subsequent steps in this workshop.

In [3]:
import evaluate
from optimum.intel import INCQuantizer
from datasets import load_dataset
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
from neural_compressor.config import AccuracyCriterion, TuningCriterion, PostTrainingQuantConfig

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


#### Model and Dataset Preparation

We initialize our DistilBERT model and tokenizer specifically tuned for the SQuAD dataset. A subset of the validation set from SQuAD is loaded for evaluation purposes. The `evaluate` library is used to set up an evaluator for the question-answering task, and a pipeline is created for processing the data.

In [4]:
model_name = "distilbert-base-cased-distilled-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
eval_dataset = load_dataset("squad", split="validation").select(range(64))
task_evaluator = evaluate.evaluator("question-answering")
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

#### Evaluation Function Definition

Here, we define `eval_fn`, a function that will be used to evaluate the quantized model's performance. This function integrates our model with the question-answering pipeline and returns the F1 score, a measure of the model's accuracy.

In [5]:
def eval_fn(model):
    qa_pipeline.model = model
    metrics = task_evaluator.compute(model_or_pipeline=qa_pipeline, data=eval_dataset, metric="squad")
    return metrics["f1"]

#### Quantization Configuration

In this cell, we set up the configuration for dynamic quantization. We specify our tolerable accuracy loss (5%) and the maximum number of tuning trials (10). The `PostTrainingQuantConfig` is set to a dynamic approach, aligning with our focus on dynamic quantization.

In [6]:
# Set the accepted accuracy loss to 5%
accuracy_criterion = AccuracyCriterion(tolerable_loss=0.05)
# Set the maximum number of trials to 10
tuning_criterion = TuningCriterion(max_trials=10)
quantization_config = PostTrainingQuantConfig(
    approach="dynamic", accuracy_criterion=accuracy_criterion, tuning_criterion=tuning_criterion
)

#### Model Quantization and Saving

We initialize the quantizer with our model and the evaluation function. The model is then quantized according to our defined configuration. The quantized model is saved in the specified directory, enabling us to use or deploy it later.

In [7]:
quantizer = INCQuantizer.from_pretrained(model, eval_fn=eval_fn)
quantizer.quantize(quantization_config=quantization_config, save_directory="dynamic_quantization")

2024-02-02 08:08:09 [INFO] Start auto tuning.
2024-02-02 08:08:09 [INFO] Execute the tuning process due to detect the evaluation function.
2024-02-02 08:08:09 [INFO] Adaptor has 5 recipes.
2024-02-02 08:08:09 [INFO] 0 recipes specified by user.
2024-02-02 08:08:09 [INFO] 3 recipes require future tuning.
2024-02-02 08:08:09 [INFO] *** Initialize auto tuning
2024-02-02 08:08:09 [INFO] {
2024-02-02 08:08:09 [INFO]     'PostTrainingQuantConfig': {
2024-02-02 08:08:09 [INFO]         'AccuracyCriterion': {
2024-02-02 08:08:09 [INFO]             'criterion': 'relative',
2024-02-02 08:08:09 [INFO]             'higher_is_better': True,
2024-02-02 08:08:09 [INFO]             'tolerable_loss': 0.05,
2024-02-02 08:08:09 [INFO]             'absolute': None,
2024-02-02 08:08:09 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x153887596670>>,
2024-02-02 08:08:09 [INFO]             'relative': 0.05
2024-02-02 08:08:09 [INFO]    

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.
2024-02-02 08:08:12 [INFO] Save tuning history to /home/uad6b15e0ae3d5e407195ab5f044a50f/ai-innovation-bridge/workshops/ai-workloads-with-huggingface/nc_workspace/2024-02-02_08-07-59/./history.snapshot.
2024-02-02 08:08:12 [INFO] FP32 baseline is: [Accuracy: 88.1250, Duration (seconds): 1.8365]
2024-02-02 08:08:12 [INFO] Quantize the model with default config.
2024-02-02 08:08:12 [INFO] Fx trace of the entire model failed, We will conduct auto quantization
2024-02-02 08:08:14 [INFO] |******Mixed Precision Statistics******|
2024-02-02 08:08:14 [INFO] +-----------------+----------+---------+
2024-02-02 08:08:14 [INFO] |     Op Type     |  Total   |   INT8  |
2024-02-02 08:08:14 [INFO] +-----------------+----------+---------+
2024-02-02 08:08:14 [INFO] |    Embedding    |    2     |    2    |
2024-02-02 08:08:14 [INFO] |      Linear     |    37    |    37   

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.
2024-02-02 08:08:15 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 82.0486|88.1250, Duration (seconds) (int8|fp32): 1.5482|1.8365], Best tune result is: n/a
2024-02-02 08:08:15 [INFO] |**********************Tune Result Statistics**********************|
2024-02-02 08:08:15 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:15 [INFO] |     Info Type      | Baseline | Tune 1 result | Best tune result |
2024-02-02 08:08:15 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:15 [INFO] |      Accuracy      | 88.1250  |    82.0486    |       n/a        |
2024-02-02 08:08:15 [INFO] | Duration (seconds) | 1.8365   |    1.5482     |       n/a        |
2024-02-02 08:08:15 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:15 [INFO] Save tunin

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.
2024-02-02 08:08:18 [INFO] Tune 2 result is: [Accuracy (int8|fp32): 83.0208|88.1250, Duration (seconds) (int8|fp32): 1.5677|1.8365], Best tune result is: n/a
2024-02-02 08:08:18 [INFO] |**********************Tune Result Statistics**********************|
2024-02-02 08:08:18 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:18 [INFO] |     Info Type      | Baseline | Tune 2 result | Best tune result |
2024-02-02 08:08:18 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:18 [INFO] |      Accuracy      | 88.1250  |    83.0208    |       n/a        |
2024-02-02 08:08:18 [INFO] | Duration (seconds) | 1.8365   |    1.5677     |       n/a        |
2024-02-02 08:08:18 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:18 [INFO] Save tunin

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.
2024-02-02 08:08:21 [INFO] Tune 4 result is: [Accuracy (int8|fp32): 82.0486|88.1250, Duration (seconds) (int8|fp32): 1.6091|1.8365], Best tune result is: n/a
2024-02-02 08:08:21 [INFO] |**********************Tune Result Statistics**********************|
2024-02-02 08:08:21 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:21 [INFO] |     Info Type      | Baseline | Tune 4 result | Best tune result |
2024-02-02 08:08:21 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:21 [INFO] |      Accuracy      | 88.1250  |    82.0486    |       n/a        |
2024-02-02 08:08:21 [INFO] | Duration (seconds) | 1.8365   |    1.6091     |       n/a        |
2024-02-02 08:08:21 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:21 [INFO] Save tunin

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.
2024-02-02 08:08:25 [INFO] Tune 5 result is: [Accuracy (int8|fp32): 82.3611|88.1250, Duration (seconds) (int8|fp32): 1.6219|1.8365], Best tune result is: n/a
2024-02-02 08:08:25 [INFO] |**********************Tune Result Statistics**********************|
2024-02-02 08:08:25 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:25 [INFO] |     Info Type      | Baseline | Tune 5 result | Best tune result |
2024-02-02 08:08:25 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:25 [INFO] |      Accuracy      | 88.1250  |    82.3611    |       n/a        |
2024-02-02 08:08:25 [INFO] | Duration (seconds) | 1.8365   |    1.6219     |       n/a        |
2024-02-02 08:08:25 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:25 [INFO] Save tunin

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.
2024-02-02 08:08:28 [INFO] Tune 6 result is: [Accuracy (int8|fp32): 83.1944|88.1250, Duration (seconds) (int8|fp32): 1.7554|1.8365], Best tune result is: n/a
2024-02-02 08:08:28 [INFO] |**********************Tune Result Statistics**********************|
2024-02-02 08:08:28 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:28 [INFO] |     Info Type      | Baseline | Tune 6 result | Best tune result |
2024-02-02 08:08:28 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:28 [INFO] |      Accuracy      | 88.1250  |    83.1944    |       n/a        |
2024-02-02 08:08:28 [INFO] | Duration (seconds) | 1.8365   |    1.7554     |       n/a        |
2024-02-02 08:08:28 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:28 [INFO] Save tunin

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.
2024-02-02 08:08:31 [INFO] Tune 7 result is: [Accuracy (int8|fp32): 86.2946|88.1250, Duration (seconds) (int8|fp32): 1.8099|1.8365], Best tune result is: [Accuracy: 86.2946, Duration (seconds): 1.8099]
2024-02-02 08:08:31 [INFO] |**********************Tune Result Statistics**********************|
2024-02-02 08:08:31 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:31 [INFO] |     Info Type      | Baseline | Tune 7 result | Best tune result |
2024-02-02 08:08:31 [INFO] +--------------------+----------+---------------+------------------+
2024-02-02 08:08:31 [INFO] |      Accuracy      | 88.1250  |    86.2946    |     86.2946      |
2024-02-02 08:08:31 [INFO] | Duration (seconds) | 1.8365   |    1.8099     |     1.8099       |
2024-02-02 08:08:31 [INFO] +--------------------+----------+---------------+-------------

# Conclusion and Discussion

#### Conclusion

In this workshop, we've successfully navigated the process of dynamically quantizing a model using the Optimum Intel library. We learned the importance of balancing accuracy and computational efficiency and gained hands-on experience in configuring and applying dynamic quantization to a DistilBERT model#.

## Discussion

The skills and knowledge acquired in this session are critical for developers looking to optimize NLP models for production environments, especially where resource constraints are a consideration. Understanding the nuances of model quantization, particularly in the context of dynamic vs. static approaches, empowers developers to make informed decisions about deploying AI models in various scenarios.

As we continue to push the boundaries of AI efficiency, the ability to effectively quantize models while maintaining their performance will be an invaluable asset in the toolkit of any AI practitioner.