# Quantize your VLM with 🤗 Optimum Intel

This notebook shows how to quantize a question answering model with [Optimum Intel](https://huggingface.co/docs/optimum-intel/en/openvino/optimization) and OpenVINO's [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf) (NNCF). 

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.


## Step 1: Installation and Setup

First, let's install the required dependencies.



If you're opening this Notebook on colab, you will probably need to install 🤗 Optimum, . Uncomment the following cell and run it.
 First make sure everything is installed as expected by uncommenting this cell :

In [None]:
! pip install "optimum-intel[openvino]" datasets num2words
! pip install torchvision

## Step 2: Preparation

Now let's load the processor and prepare our input data. We'll use a sample image of a bee on a flower and ask the model what's on the flower.


![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg)

Load processor and prepare inputs :

In [None]:
import transformers
from transformers import AutoProcessor
from transformers.image_utils import load_image
transformers.logging.set_verbosity_error()

model_id = "echarlaix/SmolVLM2-256M-Video-Instruct-openvino"
processor = AutoProcessor.from_pretrained(model_id)
prompt, img_url = "What is on the flower?", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt}
        ]
    }
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[load_image(img_url)], return_tensors="pt")

## Step 3: Load Original Model and Test

Let's load the original FP32 model and test it with our prepared inputs to establish a baseline.


In [None]:
from optimum.intel import OVModelForVisualCausalLM


model_ov = OVModelForVisualCausalLM.from_pretrained(model_id, load_in_8bit=False)
fp32_model_path = "smolvlm_ov"
model_ov.save_pretrained(fp32_model_path)

# Generate outputs
generated_ids = model_ov.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

## Step 4: Configure and Apply Quantization

Now we'll configure the quantization settings and apply them to create a quantized version of our model. You can explore other quantization options [here](https://huggingface.co/docs/optimum/en/intel/openvino/optimization) and by playing with the different quantization configurations defined below.


### Step 4a: Configure Quantization Settings

To apply quantization on your model you need to create a quantization configuration specifying the methodology to use. By default 8bit weight-only quantization will be applied on the text and vision embeddings components, while the language model will be quantized depending on the specified quantization configuration `quantization_config`. A specific quantization configuration can be defined for each components as well, this can be done by creating an instance of `OVPipelineQuantizationConfig`.

In [None]:
from optimum.intel import OVQuantizationConfig, OVWeightQuantizationConfig, OVPipelineQuantizationConfig

dataset, num_samples = "contextual", 50

# weight-only 8bit
woq_8bit = OVWeightQuantizationConfig(bits=8)

# weight-only 4bit
woq_4bit = OVWeightQuantizationConfig(bits=4, group_size=16)

# static quantization
static_8bit = OVQuantizationConfig(bits=8, dataset=dataset, num_samples=num_samples)

# pipeline quantization: applying different quantization on each components
ppl_q = OVPipelineQuantizationConfig(
    quantization_configs={
        "lm_model": OVQuantizationConfig(bits=8),
        "text_embeddings_model": OVWeightQuantizationConfig(bits=8),
        "vision_embeddings_model": OVWeightQuantizationConfig(bits=8),
    },
    dataset=dataset,
    num_samples=num_samples,
)


### Step 4b: Apply Quantization

You can now apply quantization on your model, here we apply wieght-only quantization on our model defined in `woq_8bit`.

In [None]:
q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=woq_8bit)
int8_model_path = "smolvlm_int8"
q_model.save_pretrained(int8_model_path)

## Step 5: Compare Results

Let's test the quantized model and compare it with the original model in terms of both output quality and model size.


### Step 5a: Test Quantized Model Output


In [None]:
# Generate outputs with quantized model
generated_ids = q_model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

### Step 5b: Compare Model Sizes

Now let's compare the file sizes of the original FP32 model and the quantized INT8 model:


In [None]:
from pathlib import Path

def get_model_size(model_folder):
    model_size = 0
    for file in Path(model_folder).iterdir():
        if file.suffix==".xml":
            model_size += file.stat().st_size + file.with_suffix(".bin").stat().st_size
    model_size /= 1000 * 1000
    return model_size

In [None]:
fp32_model_size = get_model_size(fp32_model_path)
int8_model_size = get_model_size(int8_model_path)
print(f"FP32 model size: {fp32_model_size:.2f} MB")
print(f"INT8 model size: {int8_model_size:.2f} MB")
print(f"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x")

### 5c: Compare performance on different Intel Hardware platforms

In [None]:
import os
import time
from optimum.intel import OVModelForVisualCausalLM

class InferRequestWrapper:
    """
    A helper class to track pipeline components' inference time.
    """
    def __init__(self, request, infer_time_values):
        self.request = request
        self.infer_time_values = infer_time_values
        self._start_async_time = None

    def reset_state(self):
        self.request.reset_state()

    def get_tensor(self, name):
        return self.request.get_tensor(name)

    def __call__(self, *args, **kwargs):
        start_time = time.perf_counter()
        result = self.request(*args, **kwargs)
        end_time = time.perf_counter()
        self.infer_time_values.append(end_time - start_time)
        return result

    def start_async(self, *args, **kwargs):
        assert self._start_async_time is None, "start_async is already in progress"
        self._start_async_time = time.perf_counter()
        return self.request.start_async(*args, **kwargs)

    def wait(self):
        assert self._start_async_time is not None, "start_async must be called before wait"
        result = self.request.wait()
        self.infer_time_values.append(time.perf_counter() - self._start_async_time)
        self._start_async_time = None
        return result


def benchmark(model, inputs, model_dir: str, nb_pass=10, warmup=4,max_tokens=50):
    """
    Benchmark an OV visual causal LM model.

    Returns a dict with:
    - avg_latency_sec
    - image_throughput
    - first_token_throughput
    - second_token_throughput
    - model_size_mb
    """

    # --- Patch OpenVINO InferRequest objects to track inference time ---
    model.compile()
    lm_model_time_values = []
    vision_embed_time_values = []
    first_token_latencies = []
    model.language_model.request = InferRequestWrapper(model.language_model.request, lm_model_time_values)
    model.vision_embeddings.request = InferRequestWrapper(model.vision_embeddings.request, vision_embed_time_values)
    
    # --- Warmup ---
    for _ in range(warmup):
        _ = model.generate(**inputs)

    lm_model_time_values.clear()
    vision_embed_time_values.clear()

    # --- Timed inference ---
    start = time.perf_counter()
    for _ in range(nb_pass):
        last_infer_count = len(lm_model_time_values)
        outputs = model.generate(**inputs,max_new_tokens=max_tokens)
        first_token_latencies.append(lm_model_time_values[last_infer_count])
    end = time.perf_counter()

    # --- Unpatch InferRequest objects ---
    model.language_model.request = model.language_model.request.request
    model.vision_embeddings.request = model.vision_embeddings.request.request

    # --- Throughput calculations ---
    avg_latency = (end - start) / nb_pass
    
    avg_vision_embed_time = sum(vision_embed_time_values) / len(vision_embed_time_values)
    avg_first_token_latency = sum(first_token_latencies) / len(first_token_latencies)
    avg_second_token_latency = (sum(lm_model_time_values) - sum(first_token_latencies)) / \
        (len(lm_model_time_values) - len(first_token_latencies))

    batch_size = inputs["pixel_values"].shape[0] if "pixel_values" in inputs else 1
    image_throughput = batch_size / avg_vision_embed_time

    # --- Model size ---
    model_size_bytes = sum(
        os.path.getsize(os.path.join(model_dir, f))
        for f in os.listdir(model_dir)
        if f.startswith("openvino_")
    )
    model_size_mb = model_size_bytes / (1024**2)

    return {
        "avg_latency_sec": avg_latency,
        "image_throughput": image_throughput,
        "first_token_throughput": 1 / avg_first_token_latency,
        "second_token_throughput": 1 / avg_second_token_latency,
        "model_size_mb": model_size_mb,
    }


#### Run benchmark

In [None]:
#Check for available hardware platforms

from openvino.runtime import Core

core = Core()
devices = core.available_devices
device_list = []

for device in devices:
    try:
        # Use FULL_DEVICE_NAME if available, else fallback to device ID
        name = core.get_property(device, "FULL_DEVICE_NAME")
    except:
        name = device
    device_list.append(device)  # keep the device ID for model loading
    print(f"{device}: {name}")


In [None]:
# --- Local models ---
models = {
    "SmolVLM2-256M (full)": fp32_model_path,
    "SmolVLM2-256M-int8": int8_model_path
}

# --- Run benchmark ---
for model_name, model_dir in models.items():
    for device in device_list:
        print(f"\nBenchmarking {model_name} on {device}...")

        # Load model for the specific device
        model_ov = OVModelForVisualCausalLM.from_pretrained(
            model_dir, export=False, device=device
        )

        # Run benchmark
        results = benchmark(model_ov, inputs, model_dir=model_dir)

        # Print results
        print(
            f"Latency: {results['avg_latency_sec']:.4f}s | "
            f"Image throughput: {results['image_throughput']:.2f} im/s | "
            f"First token throughput: {results['first_token_throughput']:.2f} t/s | "
            f"Second token throughput: {results['second_token_throughput']:.2f} t/s | "
            f"Model size: {results['model_size_mb']:.2f} MB"
        )


## Conclusion

Great! We've successfully quantized our VLM model using Optimum Intel. The results show:

1. **Quality**: The quantized model produces the same output as the original model
2. **Size**: We achieved approximately 4x reduction in model size (from ~1GB to ~260MB)
3. **Performance**: The INT8 model has been reduced on size maintaining the accuracy

This demonstrates how quantization can significantly reduce model size preserving the model's accuracy for visual language tasks.
