# Quantize your VLM with 🤗 Optimum Intel

This notebook shows how to quantize a question answering model with [Optimum Intel](https://huggingface.co/docs/optimum-intel/en/openvino/optimization) and OpenVINO's [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf) (NNCF). 

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.


## Step 1: Installation and Setup

First, let's install the required dependencies.



If you're opening this Notebook on colab, you will probably need to install 🤗 Optimum, . Uncomment the following cell and run it.
 First make sure everything is installed as expected by uncommenting this cell :

In [None]:
! pip install "optimum-intel[openvino]" datasets num2words torchvision
! pip install git+https://github.com/huggingface/optimum-benchmark.git

## Step 2: Preparation

Now let's load the processor and prepare our input data. We'll use a sample image of a bee on a flower and ask the model what's on the flower.


![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg)

Load processor and prepare inputs :

In [None]:
import transformers
from transformers import AutoProcessor
from transformers.image_utils import load_image
transformers.logging.set_verbosity_error()

model_id = "echarlaix/SmolVLM2-256M-Video-Instruct-openvino"
processor = AutoProcessor.from_pretrained(model_id)
prompt, img_url = "What is on the flower?", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt}
        ]
    }
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[load_image(img_url)], return_tensors="pt")

## Step 3: Load Original Model and Test

Let's load the original FP32 model and test it with our prepared inputs to establish a baseline.


In [None]:
from optimum.intel import OVModelForVisualCausalLM


model_ov = OVModelForVisualCausalLM.from_pretrained(model_id, load_in_8bit=False)
fp32_model_path = "smolvlm_ov"
model_ov.save_pretrained(fp32_model_path)

# Generate outputs
generated_ids = model_ov.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

## Step 4: Configure and Apply Quantization

Now we'll configure the quantization settings and apply them to create a quantized version of our model. You can explore other quantization options [here](https://huggingface.co/docs/optimum/en/intel/openvino/optimization) and by playing with the different quantization configurations defined below.


### Step 4a: Configure Quantization Settings

To apply quantization on your model you need to create a quantization configuration specifying the methodology to use. By default 8bit weight-only quantization will be applied on the text and vision embeddings components, while the language model will be quantized depending on the specified quantization configuration `quantization_config`. A specific quantization configuration can be defined for each components as well, this can be done by creating an instance of `OVPipelineQuantizationConfig`.

In [None]:
from optimum.intel import OVQuantizationConfig, OVWeightQuantizationConfig, OVPipelineQuantizationConfig

dataset, num_samples = "contextual", 50

# weight-only 8bit
woq_8bit = OVWeightQuantizationConfig(bits=8)

# weight-only 4bit
woq_4bit = OVWeightQuantizationConfig(bits=4, group_size=16)

# static quantization
static_8bit = OVQuantizationConfig(bits=8, dataset=dataset, num_samples=num_samples)

# pipeline quantization: applying different quantization on each components
ppl_q = OVPipelineQuantizationConfig(
    quantization_configs={
        "lm_model": OVQuantizationConfig(bits=8),
        "text_embeddings_model": OVWeightQuantizationConfig(bits=8),
        "vision_embeddings_model": OVWeightQuantizationConfig(bits=8),
    },
    dataset=dataset,
    num_samples=num_samples,
)

### Step 4b: Apply Quantization

You can now apply quantization on your model, here we apply wieght-only quantization on our model defined in `woq_8bit`.

In [None]:
q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=woq_8bit)
int8_model_path = "smolvlm_int8"
q_model.save_pretrained(int8_model_path)

## Step 5: Compare Results

Let's test the quantized model and compare it with the original model in terms of both output quality and model size.


### Step 5a: Test Quantized Model Output


In [None]:
# Generate outputs with quantized model
generated_ids = q_model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

### Step 5b: Compare Model Sizes

Now let's compare the file sizes of the original FP32 model and the quantized INT8 model:


In [None]:
from pathlib import Path

def get_model_size(model_folder):
    model_size = 0
    for file in Path(model_folder).iterdir():
        if file.suffix==".xml":
            model_size += file.stat().st_size + file.with_suffix(".bin").stat().st_size
    model_size /= 1000 * 1000
    return model_size

In [None]:
fp32_model_size = get_model_size(fp32_model_path)
int8_model_size = get_model_size(int8_model_path)
print(f"FP32 model size: {fp32_model_size:.2f} MB")
print(f"INT8 model size: {int8_model_size:.2f} MB")
print(f"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x")

### Step 5c: Compare performance on different Intel Hardware platforms

In [None]:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from huggingface_hub import create_repo, upload_file
from optimum_benchmark import (
    Benchmark,
    BenchmarkConfig,
    BenchmarkReport,
    InferenceConfig,
    OpenVINOConfig,
    ProcessConfig,
    PyTorchConfig,
)
from optimum_benchmark.logging_utils import setup_logging
from openvino.runtime import Core

setup_logging(level="INFO", prefix="MAIN-PROCESS")

launcher_config = ProcessConfig()
scenario_config = InferenceConfig(
    memory=True,
    latency=True,
    generate_kwargs={"max_new_tokens": 16, "min_new_tokens": 16},
    input_shapes={"batch_size": 1, "sequence_length": 16, "num_images": 1},
)

configs = {
    "pytorch": PyTorchConfig(device="cpu", model=model_id, no_weights=True),
    "openvino": OpenVINOConfig(device="cpu", model=model_id, no_weights=True),
    "openvino-8bit-woq": OpenVINOConfig(
        device="cpu",
        model=model_id,
        no_weights=True,
        quantization_config={"bits": 8, "num_samples": 1, "weight_only": True},
    ),
}

for config_name, backend_config in configs.items():
    benchmark_config = BenchmarkConfig(
        name=f"{config_name}",
        launcher=launcher_config,
        scenario=scenario_config,
        backend=backend_config,
    )
    benchmark_report = Benchmark.launch(benchmark_config)
    benchmark_report.save_json(f"{config_name}_report.json")
    benchmark_config.save_json(f"{config_name}_config.json")

reports = {}
for config_name in configs.keys():
    reports[config_name] = BenchmarkReport.from_json(f"{config_name}_report.json")

# Plotting results
_, ax = plt.subplots()
ax.boxplot(
    [reports[config_name].prefill.latency.values for config_name in reports.keys()],
    tick_labels=reports.keys(),
    showfliers=False,
)
plt.xticks(rotation=10)
ax.set_ylabel("Latency (s)")
ax.set_xlabel("Configurations")
ax.set_title("Prefill Latencies")
plt.savefig("prefill_latencies_boxplot.png")

_, ax = plt.subplots()
ax.bar(
    list(reports.keys()),
    [reports[config_name].decode.throughput.value for config_name in reports.keys()],
    color=["C0", "C1", "C2", "C3", "C4", "C5"],
)
plt.xticks(rotation=10)
ax.set_xlabel("Configurations")
ax.set_title("Decoding Throughput")
ax.set_ylabel("Throughput (tokens/s)")
plt.savefig("decode_throughput_barplot.png")

In [None]:
# Print results
import json
import pandas as pd

# List of config names
config_names = list(configs.keys())

# Stages we want to include in the table
stages = ["load_model", "first_generate", "prefill", "generate", "decode"]

table_rows = []

for config_name in config_names:
    report_file = f"{config_name}_report.json"
    with open(report_file, "r") as f:
        report_data = json.load(f)
    
    row = {"Configuration": config_name}
    
    for stage in stages:
        stage_data = report_data.get(stage, {})

        # Latency (mean)
        latency_mean = stage_data.get("latency", {}).get("mean")
        row[f"{stage} Latency (s)"] = round(latency_mean, 3) if latency_mean is not None else "N/A"
        
        # Throughput (value + unit)
        throughput_data = stage_data.get("throughput")
        if throughput_data:
            throughput_value = throughput_data.get("value")
            throughput_unit = throughput_data.get("unit", "")
            row[f"{stage} Throughput"] = f"{throughput_value:.3f} {throughput_unit}" if throughput_value else "N/A"
        else:
            row[f"{stage} Throughput"] = "N/A"
        
        # Max RAM
        memory_max = stage_data.get("memory", {}).get("max_ram")
        row[f"{stage} Memory (MB)"] = round(memory_max, 2) if memory_max is not None else "N/A"
    
    table_rows.append(row)

# Build the DataFrame
df = pd.DataFrame(table_rows)

# Optional: reorder columns for readability
columns_order = ["Configuration"]
for stage in stages:
    columns_order += [
        f"{stage} Latency (s)",
        f"{stage} Throughput",
        f"{stage} Memory (MB)"
    ]
df = df[columns_order]

df

## Conclusion

Great! We've successfully quantized our VLM model using Optimum Intel. The results show:

1. **Quality**: The quantized model produces the same output as the original model
2. **Size**: We achieved approximately 4x reduction in model size (from ~1GB to ~260MB)
3. **Performance**: The INT8 model has been reduced on size maintaining the accuracy

This demonstrates how quantization can significantly reduce model size preserving the model's accuracy for visual language tasks.
