##### Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#### Developed by AI/ML GDE [Nitin Tiwari](https://linkedin.com/in/tiwari-nitin).
* LinkedIn: [linkedin.com/in/tiwari-nitin](https://linkedin.com/in/tiwari-nitin)
* GitHub: [github.com/NSTiwari](https://github.com/NSTiwari)
* X: [@NSTiwari21](https://x.com/NSTiwari21)




## Convert PaliGemma 2 to ONNX and inference on the browser using Transformers.js

This notebook covers Part 1 of the implementation for converting and quantizing the PaliGemma 2 Vision Language Model to ONNX for inference with Transformers.js.

* [Part 1]: [Convert and quantize PaliGemma 2 to ONNX.](https://github.com/google-gemini/gemma-cookbook/blob/main/PaliGemma/[PaliGemma_2]Convert_PaliGemma2_to_ONNX.ipynb)

* [Part 2]: [Inference the converted model using 🤗 Transformers.js for tasks like image captioning, zero-shot object detection, OCR, and visual Q&A.](https://github.com/google-gemini/gemma-cookbook/blob/main/PaliGemma/[PaliGemma_2]Inference_PaliGemma2_with_Transformers_js.ipynb)

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/PaliGemma/[PaliGemma_2]Convert_PaliGemma2_to_ONNX.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>


### Get access to PaliGemma 2

Before using PaliGemma 2 for the first time, you must request access to the model through Hugging Face by completing the following steps:

1. Log in to [Hugging Face](https://huggingface.co), or create a new Hugging Face account if you don't already have one.
2. Go to the [PaliGemma 2 model card](https://huggingface.co/google/paligemma2-3b-pt-224) to get access to the model.
3. Complete the consent form and accept the terms and conditions.

To generate a Hugging Face token, open your [**Settings** page in Hugging Face](https://huggingface.co/settings), choose **Access Tokens** option in the left pane and click **New token**. In the next window that appears, give a name to your token and choose the type as **Write** to get the write access.

Then, in Colab, select **Secrets** (🔑) in the left pane and add your Hugging Face token. Store your Hugging Face token under the name `HF_TOKEN`.

### Select the runtime

To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to load the PaliGemma 2 model. In this case, you need at least an L4 GPU:

1. In the upper-right of the Colab window, click the **▾ (Additional connection options)** dropdown menu.
1. Select **Change runtime type**.
1. Under **Hardware accelerator**, select **L4 GPU**.

### Step 1: Install libraries and dependencies
*Note: You might need to restart the runtime after the cell finishes execution.*

In [None]:
!pip install -q --upgrade git+https://github.com/huggingface/transformers.git
!pip install optimum[exporters]
!pip install onnxslim
!pip install onnxconverter_common
!pip install onnx_graphsurgeon==0.5.2
!pip install onnxruntime
!pip install onnxruntime-tools
!pip install optimum[onnxruntime]

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
Collecting optimum[exporters]
  Downloading optimum-1.24.0-py3-none-any.whl.metadata (21 kB)
Collecting onnx (from optimum[exporters])
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime (from optimum[exporters])
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting transformers>=4.29 (from optimum[exporters])
  Downloading transformers-4.48.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11->optimum[exporters])
  Downloading nvidia_cuda_nvrtc_c

Collecting onnx_graphsurgeon==0.5.2
  Downloading onnx_graphsurgeon-0.5.2-py2.py3-none-any.whl.metadata (8.1 kB)
Downloading onnx_graphsurgeon-0.5.2-py2.py3-none-any.whl (56 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.4/56.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: onnx_graphsurgeon
Successfully installed onnx_graphsurgeon-0.5.2
Collecting onnxruntime-tools
  Downloading onnxruntime_tools-1.7.0-py3-none-any.whl.metadata (14 kB)
Collecting py3nvml (from onnxruntime-tools)
  Downloading py3nvml-0.2.7-py3-none-any.whl.metadata (13 kB)
Collecting xmltodict (from py3nvml->onnxruntime-tools)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading onnxruntime_tools-1.7.0-py3-none-any.whl (212 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.7/212.7 kB[0m [31m21.4 MB/s

### Step 2: Setup environment
Before we begin with the conversion of PaliGemma 2 to ONNX (Open Neural Network Exchange), we first need to include the following line of code:

`GLOBAL.onnx_shape_inference = False`

This should be added before line 662 in the file /usr/local/lib/python3.11/dist-packages/torch/onnx/utils.py as follows:

```
# Add the below line.
GLOBAL.onnx_shape_inference = False
if GLOBALS.onnx_shape_inference:
        _C._jit_pass_onnx_graph_shape_type_inference(
            graph, params_dict, GLOBALS.export_onnx_opset_version
        )
```




 This adjustment serves as a temporary workaround for a [bug](https://github.com/pytorch/pytorch/issues/147259) in PyTorch until a permanent fix is implemented.

*Note: Restart the runtime for the changes to take effect.*

### Step 3: Convert PaliGemma 2 to ONNX
Now, we're ready to begin the conversion process. This process involves converting the PaliGemma 2 model weights, which include:

* Language Decoder (Gemma 2)
* Vision Encoder (SigLIP)
* Embedding Tokens

In [None]:
import os
from google.colab import userdata

os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

In [None]:
# Choose the PaliGemma 2 variant.

model_id = "paligemma2-3b-mix-224" # @param ["paligemma2-3b-mix-224", "paligemma2-3b-mix-448", "paligemma2-3b-pt-224", "paligemma2-3b-ft-docci-448", "paligemma2-3b-pt-448", "paligemma2-3b-pt-896"]
model_id = f"google/{model_id}"

In [None]:
import os
import torch
import torch.nn as nn
from transformers import (
    AutoProcessor,
    PaliGemmaForConditionalGeneration,
    DynamicCache,
)

print(f"Converting {model_id} to ONNX.")

def new_len(self: torch.Tensor):
    return self.shape[0]

torch.Tensor.__len__ = new_len


class VisionEncoder(nn.Module):
  def __init__(self, paligemma_model):
    super().__init__()
    self.config = paligemma_model.config
    self.vision_tower = paligemma_model.vision_tower
    self.multi_modal_projector = paligemma_model.multi_modal_projector

  def forward(self, pixel_values: torch.FloatTensor):
      """
      Obtains image last hidden states from the vision tower and apply multimodal projection.

      Args:
          pixel_values (`torch.FloatTensor]` of shape `(batch_size, channels, height, width)`)
              The tensors corresponding to the input images.
      Returns:
          image_features (`torch.Tensor`): Image feature tensor of shape `(num_images, image_length, embed_dim)`).
      """
      image_outputs = self.vision_tower(pixel_values)
      selected_image_feature = image_outputs.last_hidden_state
      image_features = self.multi_modal_projector(selected_image_feature)
      image_features = image_features / (self.config.text_config.hidden_size**0.5)
      return image_features


class PatchedPaliGemmaForConditionalGeneration(PaliGemmaForConditionalGeneration):
    def forward(self, *args):
        inputs_embeds, position_ids, *past_key_values_args = args
        config = model.config.text_config

        # Convert past_key_values list to DynamicCache
        if len(past_key_values_args) == 0:
            past_key_values = None
        else:
            past_key_values = DynamicCache(config.num_hidden_layers)
            for i in range(config.num_hidden_layers):
                key = past_key_values_args.pop(0)
                value = past_key_values_args.pop(0)
                past_key_values.update(key_states=key, value_states=value, layer_idx=i)


        batch_size = inputs_embeds.shape[0]

        o = self.language_model.forward(
            inputs_embeds=inputs_embeds,
            # Create a 4D attention mask of all zeros (attend to everything)
            attention_mask=torch.zeros(
                batch_size,
                1, # num_attention_heads (1 -> expand to num_attention_heads)
                1, # sequence_length (1 -> expand to sequence_length)
                1, # total_sequence_length (1 -> expand to total_sequence_length)
                dtype=torch.float32,
            ),
            position_ids=position_ids,
            past_key_values=past_key_values,
        )

        flattened_past_key_values_outputs = {
            "logits": o.logits,
        }
        output_past_key_values: DynamicCache = o.past_key_values
        for i, (key, value) in enumerate(
            zip(output_past_key_values.key_cache, output_past_key_values.value_cache)
        ):
            flattened_past_key_values_outputs[f"present.{i}.key"] = key
            flattened_past_key_values_outputs[f"present.{i}.value"] = value

        return flattened_past_key_values_outputs


# Constants
OUTPUT_FOLDER = os.path.join("output", model_id)
TEXT_MODEL_NAME = "decoder_model_merged.onnx"
VISION_MODEL_NAME = "vision_encoder.onnx"
EMBED_MODEL_NAME = "embed_tokens.onnx"
TEMP_MODEL_OUTPUT_FOLDER = os.path.join(OUTPUT_FOLDER, "temp")
FINAL_MODEL_OUTPUT_FOLDER = os.path.join(OUTPUT_FOLDER, "onnx")


# Load model and processor
model = PatchedPaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
).eval()
vision_model = VisionEncoder(model)
embed_layer = model.language_model.model.embed_tokens

processor = AutoProcessor.from_pretrained(model_id)

# Save model configs and processor
model.config.save_pretrained(OUTPUT_FOLDER)
model.generation_config.save_pretrained(OUTPUT_FOLDER)
processor.save_pretrained(OUTPUT_FOLDER)
os.makedirs(TEMP_MODEL_OUTPUT_FOLDER, exist_ok=True)


# Configuration values
## Text model
text_config = model.config.text_config
num_attention_heads = text_config.num_attention_heads
num_key_value_heads = text_config.num_key_value_heads
head_dim = text_config.head_dim
num_layers = text_config.num_hidden_layers
hidden_size = text_config.hidden_size

# Dummy input sizes
batch_size = 2
sequence_length = 32
past_sequence_length = 8

## Text inputs
dummy_past_key_values_kwargs = {
    f"past_key_values.{i}.{key}": torch.zeros(
        batch_size,
        num_key_value_heads,
        past_sequence_length,
        head_dim,
        dtype=torch.float32,
    )
    for i in range(num_layers)
    for key in ["key", "value"]
}
inputs_embeds = torch.randn(
    (batch_size, sequence_length, hidden_size),
)

total_sequence_length = sequence_length + past_sequence_length
position_ids = torch.arange(1, sequence_length + 1, dtype=torch.int64).expand(batch_size, sequence_length)

text_inputs = dict(
    inputs_embeds=inputs_embeds,
    position_ids=position_ids,
    **dummy_past_key_values_kwargs,
)
text_inputs_positional = tuple(text_inputs.values())
text_outputs = model.forward(*text_inputs_positional)  # Test forward pass

## Vision inputs
size = processor.image_processor.size
w, h = size['width'], size['height']
pixel_values = torch.randn(2, 3, h, w, requires_grad=True)
vision_inputs = dict(pixel_values=pixel_values)
vision_inputs_positional = tuple(vision_inputs.values())
vision_outputs = vision_model.forward(*vision_inputs_positional)  # Test forward pass



# ONNX Exports
from torch.onnx._globals import GLOBALS
GLOBALS.onnx_shape_inference = False # Bug in pytorch

## Text model (Gemma 2).
TEXT_MODEL_OUTPUT_PATH=os.path.join(TEMP_MODEL_OUTPUT_FOLDER, TEXT_MODEL_NAME)
torch.onnx.export(
    model,
    args=text_inputs_positional,
    f=TEXT_MODEL_OUTPUT_PATH,
    export_params=True,
    opset_version=14,
    do_constant_folding=True,
    input_names=list(text_inputs.keys()),
    output_names=["logits"]
    + [f"present.{i}.{key}" for i in range(num_layers) for key in ["key", "value"]],
    dynamic_axes={
        "inputs_embeds": {0: "batch_size", 1: "sequence_length"},
        "position_ids": {0: "batch_size", 1: "sequence_length"},
        **{
            f"past_key_values.{i}.{key}": {0: "batch_size", 2: "past_sequence_length"}
            for i in range(num_layers)
            for key in ["key", "value"]
        },
        "logits": {0: "batch_size", 1: "sequence_length"},
        **{
            f"present.{i}.{key}": {0: "batch_size", 2: "total_sequence_length"}
            for i in range(num_layers)
            for key in ["key", "value"]
        },
    },
    external_data_format=True,
)

## Vision model (SigLIP).
VISION_MODEL_OUTPUT_PATH = os.path.join(TEMP_MODEL_OUTPUT_FOLDER, VISION_MODEL_NAME)
torch.onnx.export(
    vision_model,
    args=vision_inputs_positional,
    f=VISION_MODEL_OUTPUT_PATH,
    export_params=True,
    opset_version=14,
    do_constant_folding=True,
    input_names=['pixel_values'],
    output_names=['image_features'],
    dynamic_axes={
        'pixel_values': {0: 'batch_size'},
        'image_features': {0: 'batch_size'}
    },
)

input_ids = torch.randint(0, embed_layer.num_embeddings, (batch_size, sequence_length))

## Embedding model
EMBED_MODEL_OUTPUT_PATH = os.path.join(TEMP_MODEL_OUTPUT_FOLDER, EMBED_MODEL_NAME)
torch.onnx.export(
    embed_layer,
    args=(input_ids,),
    f=EMBED_MODEL_OUTPUT_PATH,
    export_params=True,
    opset_version=14,
    do_constant_folding=True,
    input_names=['input_ids'],
    output_names=['inputs_embeds'],
    dynamic_axes={
        'input_ids': {0: 'batch_size', 1: 'sequence_length'},
        'inputs_embeds': {0: 'batch_size', 1: 'sequence_length'}
    },
)


# Post-processing
import onnx
import onnxslim
from optimum.onnx.graph_transformations import check_and_save_model

os.makedirs(FINAL_MODEL_OUTPUT_FOLDER, exist_ok=True)
for name in (TEXT_MODEL_NAME, VISION_MODEL_NAME, EMBED_MODEL_NAME):
    temp_model_path = os.path.join(TEMP_MODEL_OUTPUT_FOLDER, name)

    onnx.shape_inference.infer_shapes_path(temp_model_path, check_type=True, strict_mode=True)

    ## Attempt to optimize the model with onnxslim
    """
    try:
        onnx_model = onnxslim.slim(temp_model_path)
    except Exception as e:
        print(f"Failed to slim {temp_model_path}: {e}")
        onnx_model = onnx.load(temp_model_path)
    """
    onnx_model = onnx.load(temp_model_path)

    ## Save model
    final_model_path = os.path.join(FINAL_MODEL_OUTPUT_FOLDER, name)
    check_and_save_model(onnx_model, final_model_path)


# Minify tokenizer.json
import json
tokenizer_path = os.path.join(OUTPUT_FOLDER, "tokenizer.json")
with open(tokenizer_path, "r") as f:
    tokenizer = json.load(f)
with open(tokenizer_path, "w") as f:
    json.dump(tokenizer, f) # No need for indenting

# Add head_dim and num_image_tokens to config.json
config_path = os.path.join(OUTPUT_FOLDER, "config.json")
with open(config_path, "r") as f:
    config = json.load(f)
config["text_config"]["head_dim"] = head_dim
config["num_image_tokens"] = config["text_config"]["num_image_tokens"]
with open(config_path, "w") as f:
    json.dump(config, f, indent=2)


## Cleanup
import shutil
shutil.rmtree(TEMP_MODEL_OUTPUT_FOLDER)

Converting google/paligemma2-3b-mix-224 to ONNX.


config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/75.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/424 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/243k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.6M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

  attention_mask.shape[-1] if attention_mask.dim() == 2 else cache_position[-1].item()
  normalizer = torch.tensor(self.config.hidden_size**0.5, dtype=hidden_states.dtype)
  effective_seq_len = max(cache_position.shape[0], self.sliding_window)


### Step 4: Quantize the ONNX model weights (optional, but recommended)
To optimize inference performance, it is recommended to quantize the ONNX model weights. We will be quantizing to the following precision data types:

* fp16
* int8
* uint8
* q4
* q4f16
* bnb4

The overall quantization process will take approximately 40-45 minutes.

In [None]:
# Python script to quantize the ONNX model weights.
!wget https://raw.githubusercontent.com/NSTiwari/PaliGemma2-ONNX-Transformers.js/main/quantize.py

# Create a new directory to store quantized weights.
!mkdir onnx_model_quantized

--2025-02-19 19:11:50--  https://raw.githubusercontent.com/NSTiwari/PaliGemma2-ONNX-Transformers.js/main/quantize.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12362 (12K) [text/plain]
Saving to: ‘quantize.py’


2025-02-19 19:11:51 (133 MB/s) - ‘quantize.py’ saved [12362/12362]



In [None]:
!python quantize.py \
  --input_folder $FINAL_MODEL_OUTPUT_FOLDER \
  --output_folder onnx_model_quantized \
  --modes fp16 int8 uint8 q4 q4f16 bnb4 \
  --per_channel \
  --reduce_range \
  --block_size 64 \
  --is_symmetric \
  --accuracy_level 2 \
  --quant_type 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
2025-02-19 19:54:31,514 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /vision_tower/vision_model/encoder/layers.11/self_attn/Transpose_2 ...
2025-02-19 19:54:31,514 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /vision_tower/vision_model/encoder/layers.11/self_attn/Sqrt_1 ...
2025-02-19 19:54:31,514 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /vision_tower/vision_model/encoder/layers.11/self_attn/Mul ...
2025-02-19 19:54:31,514 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /vision_tower/vision_model/encoder/layers.11/self_attn/Sqrt_2 ...
2025-02-19 19:54:31,514 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /vision_tower/vision_model/encoder/layers.11/self_attn/Mul_1 ...
2025-02-19 19:54:31,514 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - start to quantize /vision_tower/v

In [None]:
# Copy the quantized ONNX weights to the final model output folder.
source = "/content/onnx_model_quantized/."
destination = f"/content/output/{model_id}/onnx/"

!cp -a $source $destination

### Step 5: Upload the ONNX weights on Hugging Face

In [None]:
from huggingface_hub import whoami
from pathlib import Path
from huggingface_hub import upload_folder, create_repo

# Output directory.
output_dir = f"/content/output/{model_id}/"
username = whoami(token=Path("/root/.cache/huggingface/"))["name"]
repo_id = f"{username}/paligemma2-3b-mix-224-onnx"

repo_id = create_repo(repo_id, exist_ok=True).repo_id

upload_folder(
    repo_id=repo_id,
    folder_path=output_dir,
    commit_message=f"{model_id} ONNX",
    ignore_patterns=["step_*", "epoch_*"],
)

decoder_model_merged_fp16.onnx:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

decoder_model_merged.onnx:   0%|          | 0.00/1.81M [00:00<?, ?B/s]

Upload 29 LFS files:   0%|          | 0/29 [00:00<?, ?it/s]

decoder_model_merged_bnb4.onnx:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

decoder_model_merged.onnx_data:   0%|          | 0.00/10.5G [00:00<?, ?B/s]

decoder_model_merged_fp16.onnx_data:   0%|          | 0.00/5.23G [00:00<?, ?B/s]

decoder_model_merged_int8.onnx:   0%|          | 0.00/6.83M [00:00<?, ?B/s]

decoder_model_merged_int8.onnx_data:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

decoder_model_merged_q4.onnx:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

decoder_model_merged_q4f16.onnx:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

decoder_model_merged_uint8.onnx:   0%|          | 0.00/6.83M [00:00<?, ?B/s]

decoder_model_merged_uint8.onnx_data:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

embed_tokens.onnx:   0%|          | 0.00/299 [00:00<?, ?B/s]

embed_tokens_q4.onnx_data:   0%|          | 0.00/2.37G [00:00<?, ?B/s]

embed_tokens_bnb4.onnx:   0%|          | 0.00/323 [00:00<?, ?B/s]

embed_tokens_q4.onnx_data:   0%|          | 0.00/2.37G [00:00<?, ?B/s]

embed_tokens_fp16.onnx:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

embed_tokens_uint8.onnx:   0%|          | 0.00/593M [00:00<?, ?B/s]

embed_tokens_q4.onnx:   0%|          | 0.00/321 [00:00<?, ?B/s]

embed_tokens_q4.onnx_data:   0%|          | 0.00/2.37G [00:00<?, ?B/s]

embed_tokens_q4f16.onnx:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

embed_tokens_uint8.onnx:   0%|          | 0.00/593M [00:00<?, ?B/s]

vision_encoder.onnx:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

vision_encoder_bnb4.onnx:   0%|          | 0.00/239M [00:00<?, ?B/s]

vision_encoder_fp16.onnx:   0%|          | 0.00/831M [00:00<?, ?B/s]

vision_encoder_int8.onnx:   0%|          | 0.00/419M [00:00<?, ?B/s]

vision_encoder_q4.onnx:   0%|          | 0.00/240M [00:00<?, ?B/s]

vision_encoder_q4f16.onnx:   0%|          | 0.00/224M [00:00<?, ?B/s]

vision_encoder_uint8.onnx:   0%|          | 0.00/419M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/18.7M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/NSTiwari/paligemma2-3b-mix-224-onnx/commit/fb4873c575fbf05f2cbc813bcd524bf06f81b0e7', commit_message='google/paligemma2-3b-mix-224 ONNX', commit_description='', oid='fb4873c575fbf05f2cbc813bcd524bf06f81b0e7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/NSTiwari/paligemma2-3b-mix-224-onnx', endpoint='https://huggingface.co', repo_type='model', repo_id='NSTiwari/paligemma2-3b-mix-224-onnx'), pr_revision=None, pr_num=None)

Congratulations, we have successfully converted and quantized the PaliGemma 2 model to the ONNX format, making it compatible with 🤗 Transformers.js for inference on the web.

Next, to run inference with the converted PaliGemma 2 ONNX model, refer to this [notebook](https://github.com/google-gemini/gemma-cookbook/blob/main/PaliGemma/[PaliGemma_2]Inference_PaliGemma2_with_Transformers_js.ipynb). For the web application, check out this [demo app](https://github.com/google-gemini/gemma-cookbook/tree/main/Demos/PaliGemma2-on-Web).