# Quantize your VLM with 🤗 Optimum Intel

This notebook shows how to quantize a question answering model with [Optimum Intel](https://huggingface.co/docs/optimum-intel/en/openvino/optimization) and OpenVINO's [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf) (NNCF). 

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.


## Step 1: Installation and Setup

First, let's install the required dependencies.



If you're opening this Notebook on colab, you will probably need to install 🤗 Optimum, . Uncomment the following cell and run it.
 First make sure everything is installed as expected by uncommenting this cell :

In [1]:
#! pip install "optimum-intel[openvino]" datasets num2words

## Step 2: Preparation

Now let's load the processor and prepare our input data. We'll use a sample image of a bee on a flower and ask the model what's on the flower.


![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg)

Load processor and prepare inputs :

In [6]:
import transformers
from transformers import AutoProcessor
from transformers.image_utils import load_image
transformers.logging.set_verbosity_error()

model_id = "echarlaix/SmolVLM2-256M-Video-Instruct-openvino"
processor = AutoProcessor.from_pretrained(model_id)
prompt, img_url = "What is on the flower?", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt}
        ]
    }
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[load_image(img_url)], return_tensors="pt")

## Step 3: Load Original Model and Test

Let's load the original FP32 model and test it with our prepared inputs to establish a baseline.


In [7]:
from optimum.intel import OVModelForVisualCausalLM


model_ov = OVModelForVisualCausalLM.from_pretrained(model_id, load_in_8bit=False)
fp32_model_path = "smolvlm_ov"
model_ov.save_pretrained(fp32_model_path)

# Generate outputs
generated_ids = model_ov.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

User:



What is on the flower?
Assistant: A bee is on the flower.


## Step 4: Configure and Apply Quantization

Now we'll configure the quantization settings and apply them to create a quantized version of our model. You can explore other quantization options [here](https://huggingface.co/docs/optimum/en/intel/openvino/optimization) and by playing with the different quantization configurations defined below.


### Step 4a: Configure Quantization Settings

To apply quantization on your model you need to create a quantization configuration specifying the methodology to use. By default 8bit weight-only quantization will be applied on the text and vision embeddings components, while the language model will be quantized depending on the specified quantization configuration `quantization_config`. A specific quantization configuration can be defined for each components as well, this can be done by creating an instance of `OVPipelineQuantizationConfig`.

In [11]:
from optimum.intel import OVQuantizationConfig, OVWeightQuantizationConfig, OVPipelineQuantizationConfig

dataset, num_samples = "contextual", 50

# weight-only 8bit
woq_8bit = OVWeightQuantizationConfig(bits=8)

# weight-only 4bit
woq_4bit = OVWeightQuantizationConfig(bits=4, group_size=16)

# static quantization
static_8bit = OVQuantizationConfig(bits=8, dataset=dataset, num_samples=num_samples)

# pipeline quantization: applying different quantization on each components
ppl_q = OVPipelineQuantizationConfig(
    quantization_configs={
        "lm_model": OVQuantizationConfig(bits=8),
        "text_embeddings_model": OVWeightQuantizationConfig(bits=8),
        "vision_embeddings_model": OVWeightQuantizationConfig(bits=8),
    },
    dataset=dataset,
    num_samples=num_samples,
)


### Step 4b: Apply Quantization

You can now apply quantization on your model, here we apply wieght-only quantization on our model defined in `woq_8bit`.

In [12]:
q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=woq_8bit)
int8_model_path = "smolvlm_int8"
q_model.save_pretrained(int8_model_path)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 100% (211 / 211)            │ 100% (211 / 211)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_sym                  │ 100% (1 / 1)                │ 100% (1 / 1)                           │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_sym                  │ 100% (75 / 75)              │ 100% (75 / 75)                         │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_sym                  │ 100% (75 / 75)              │ 100% (75 / 75)                         │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
print(f"Total size: {get_model_size(base="smolvlm_int8") / (1024**2):.2f} MB")

Total size: 248.86 MB


In [13]:
# 2 rounds of warm-up and 5 rounds of inference (to measure the average)
avg_latency = elapsed_time(q_model, inputs, nb_pass=5, warmup=2)
print(f"Average Inference latency: {avg_latency:.4f} seconds")

Average Inference latency: 2.1692 seconds


## Step 5: Compare Results

Let's test the quantized model and compare it with the original model in terms of both output quality and model size.


### Step 5a: Test Quantized Model Output


In [6]:
# Generate outputs with quantized model
generated_ids = q_model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

User:



What is on the flower?
Assistant: A bee is on the flower.


### Step 5b: Compare Model Sizes

Now let's compare the file sizes of the original FP32 model and the quantized INT8 model:


In [7]:
from pathlib import Path

def get_model_size(model_folder):
    model_size = 0
    for file in Path(model_folder).iterdir():
        if file.suffix==".xml":
            model_size += file.stat().st_size + file.with_suffix(".bin").stat().st_size
    model_size /= 1000 * 1000
    return model_size

In [8]:
fp32_model_size = get_model_size(fp32_model_path)
int8_model_size = get_model_size(int8_model_path)
print(f"FP32 model size: {fp32_model_size:.2f} MB")
print(f"INT8 model size: {int8_model_size:.2f} MB")
print(f"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x")

FP32 model size: 1028.25 MB
INT8 model size: 260.94 MB
INT8 size decrease: 3.94x


### 5c: Compare performance on different Intel Hardware platforms

In [None]:
import os
import time
from optimum.intel import OVModelForVisualCausalLM

def benchmark(model, inputs, model_dir: str, nb_pass=10, warmup=4,max_tokens=50):
    """
    Benchmark an OV visual causal LM model.

    Returns a dict with:
    - avg_latency_sec
    - images_per_sec
    - tokens_per_sec
    - model_size_mb
    """
    # --- Warmup ---
    for _ in range(warmup):
        _ = model.generate(**inputs)

    # --- Timed inference ---
    start = time.time()
    for _ in range(nb_pass):
        outputs = model.generate(**inputs,max_new_tokens=max_tokens)
    end = time.time()

    avg_latency = (end - start) / nb_pass

    # --- Throughput calculations ---
    batch_size = inputs["pixel_values"].shape[0] if "pixel_values" in inputs else 1
    num_tokens = outputs.shape[-1]  # sequence length

    images_per_sec = batch_size / avg_latency
    tokens_per_sec = num_tokens / avg_latency

    # --- Model size ---
    model_size_bytes = sum(
        os.path.getsize(os.path.join(model_dir, f))
        for f in os.listdir(model_dir)
        if f.startswith("openvino_")
    )
    model_size_mb = model_size_bytes / (1024**2)

    return {
        "avg_latency_sec": avg_latency,
        "images_per_sec": images_per_sec,
        "tokens_per_sec": tokens_per_sec,
        "model_size_mb": model_size_mb,
    }


#### Run benchmark

In [None]:
#Check for available hardware platforms

from openvino.runtime import Core

core = Core()
devices = core.available_devices

for device in devices:
    try:
        # Try to get the full device name property
        name = core.get_property(device, "FULL_DEVICE_NAME")
    except:
        # If the property is not available, just use the device ID
        name = device
    print(f"{device}: {name}")


CPU: Intel(R) Core(TM) Ultra 7 265K
GPU.0: Intel(R) Graphics (iGPU)
GPU.1: Intel(R) Arc(TM) B580 Graphics (dGPU)
GPU.2: Intel(R) Arc(TM) A770 Graphics (dGPU)
NPU: Intel(R) AI Boost


In [21]:
# --- Local models ---
models = {
    "SmolVLM2-256M (full)": fp32_model_path,
    "SmolVLM2-256M-int8": int8_model_path
}
devices = ["CPU", "GPU.0","GPU.1","GPU.2"]

# --- Run benchmark ---
for model_name, model_dir in models.items():
    for device in devices:
        print(f"\nBenchmarking {model_name} on {device}...")
        model_ov = OVModelForVisualCausalLM.from_pretrained(model_dir, export=False, device=device)
        results = benchmark(model_ov, inputs, model_dir=model_dir)
        print(f"Latency: {results['avg_latency_sec']:.4f}s | Images/sec: {results['images_per_sec']:.2f} | Tokens/sec: {results['tokens_per_sec']:.2f} | Model size: {results['model_size_mb']:.2f} MB")



Benchmarking SmolVLM2-256M (full) on CPU...


Latency: 3.5740s | Images/sec: 0.28 | Tokens/sec: 247.34 | Model size: 980.61 MB

Benchmarking SmolVLM2-256M (full) on GPU.0...
Latency: 2.1468s | Images/sec: 0.47 | Tokens/sec: 411.78 | Model size: 980.61 MB

Benchmarking SmolVLM2-256M (full) on GPU.1...
Latency: 0.2025s | Images/sec: 4.94 | Tokens/sec: 4364.79 | Model size: 980.61 MB

Benchmarking SmolVLM2-256M (full) on GPU.2...
Latency: 0.3380s | Images/sec: 2.96 | Tokens/sec: 2615.45 | Model size: 980.61 MB

Benchmarking SmolVLM2-256M-int8 on CPU...
Latency: 2.1794s | Images/sec: 0.46 | Tokens/sec: 405.61 | Model size: 248.86 MB

Benchmarking SmolVLM2-256M-int8 on GPU.0...
Latency: 2.6433s | Images/sec: 0.38 | Tokens/sec: 338.96 | Model size: 248.86 MB

Benchmarking SmolVLM2-256M-int8 on GPU.1...
Latency: 0.2289s | Images/sec: 4.37 | Tokens/sec: 3861.41 | Model size: 248.86 MB

Benchmarking SmolVLM2-256M-int8 on GPU.2...
Latency: 0.3563s | Images/sec: 2.81 | Tokens/sec: 2481.35 | Model size: 248.86 MB


## Conclusion

Great! We've successfully quantized our VLM model using Optimum Intel. The results show:

1. **Quality**: The quantized model produces the same output as the original model
2. **Size**: We achieved approximately 4x reduction in model size (from ~1GB to ~260MB)
3. **Performance**: The INT8 model has been reduced on size maintaining the accuracy

This demonstrates how quantization can significantly reduce model size preserving the model's accuracy for visual language tasks.
