# Quantize your VLM with 🤗 Optimum Intel

This notebook shows how to quantize a question answering model with [Optimum Intel](https://huggingface.co/docs/optimum-intel/en/openvino/optimization) and OpenVINO's [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf) (NNCF). 

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.


## Step 1: Installation and Setup

First, let's install the required dependencies.



If you're opening this Notebook on colab, you will probably need to install 🤗 Optimum, . Uncomment the following cell and run it.
 First make sure everything is installed as expected by uncommenting this cell :

In [1]:
#! pip install "optimum-intel[openvino]" datasets num2words

## Step 2: Preparation

Now let's load the processor and prepare our input data. We'll use a sample image of a bee on a flower and ask the model what's on the flower.


![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg)

Load processor and prepare inputs :

In [2]:
import transformers
from transformers import AutoProcessor
from transformers.image_utils import load_image
transformers.logging.set_verbosity_error()

model_id = "echarlaix/SmolVLM2-256M-Video-Instruct-openvino"
processor = AutoProcessor.from_pretrained(model_id)
prompt, img_url = "What is on the flower?", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt}
        ]
    }
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[load_image(img_url)], return_tensors="pt")

print(img_url)

https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg


## Step 3: Load Original Model and Test

Let's load the original FP32 model and test it with our prepared inputs to establish a baseline.


In [3]:
from optimum.intel import OVModelForVisualCausalLM


model = OVModelForVisualCausalLM.from_pretrained(model_id)
fp32_model_path = "smolvlm_ov"
model.save_pretrained(fp32_model_path)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

User:



What is on the flower?
Assistant: A bee is on the flower.


## Step 4: Configure and Apply Quantization

Now we'll configure the quantization settings and apply them to create an INT8 version of our model. We'll use weight-only quantization for size reduction with minimal accuracy loss. You can explore other quantization options [here](https://huggingface.co/docs/optimum/en/intel/openvino/optimization).


### Step 4a: Configure Quantization Settings


In [4]:
from optimum.intel import OVQuantizationConfig, OVWeightQuantizationConfig, OVPipelineQuantizationConfig

dataset, num_samples = "contextual", 50

# weight only data free
woq_data_free = OVWeightQuantizationConfig(bits=8)

# static quantization
ppl_q = OVPipelineQuantizationConfig(
    quantization_configs={
        "lm_model": OVQuantizationConfig(bits=8),
        "text_embeddings_model": OVWeightQuantizationConfig(bits=8),
        "vision_embeddings_model": OVWeightQuantizationConfig(bits=8),
    },
    dataset=dataset,
    num_samples=num_samples,
)


The provided dataset won't have any effect on the resulting compressed model because no data-aware quantization algorithm is selected and compression ratio is 1.0.


### Step 4b: Apply Quantization


In [5]:
q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=woq_data_free)
int8_model_path = "smolvlm_int8"
q_model.save_pretrained(int8_model_path)



INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 100% (211 / 211)            │ 100% (211 / 211)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_sym                  │ 100% (1 / 1)                │ 100% (1 / 1)                           │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_sym                  │ 100% (75 / 75)              │ 100% (75 / 75)                         │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

## Step 5: Compare Results

Let's test the quantized model and compare it with the original model in terms of both output quality and model size.


### Step 5a: Test Quantized Model Output


In [6]:
# Generate outputs with quantized model
generated_ids = q_model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

User:



What is on the flower?
Assistant: A bee is on the flower.


### Step 5b: Compare Model Sizes

Now let's compare the file sizes of the original FP32 model and the quantized INT8 model:


In [7]:
from pathlib import Path

def get_model_size(model_folder):
    model_size = 0
    for file in Path(model_folder).iterdir():
        if file.suffix==".xml":
            model_size += file.stat().st_size + file.with_suffix(".bin").stat().st_size
    model_size /= 1000 * 1000
    return model_size

In [None]:
fp32_model_size = get_model_size(fp32_model_path)
int8_model_size = get_model_size(int8_model_path)
print(f"FP32 model size: {fp32_model_size:.2f} MB")
print(f"INT8 model size: {int8_model_size:.2f} MB")
print(f"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x")

## Conclusion

Great! We've successfully quantized our VLM model using Optimum Intel. The results show:

1. **Quality**: The quantized model produces the same output as the original model
2. **Size**: We achieved approximately 4x reduction in model size (from ~1GB to ~260MB)
3. **Performance**: The INT8 model has been reduced on size maintaining the accuracy

This demonstrates how quantization can significantly reduce model size preserving the model's accuracy for visual language tasks.
