# Kosmos-2: Multimodal Large Language Model and OpenVINO

[KOSMOS-2](https://github.com/microsoft/unilm/tree/master/kosmos-2) is a multimodal large language model (MLLM) that has new capabilities of multimodal grounding and referring. KOSMOS-2 can understand multimodal input, follow instructions, 
perceive object descriptions (e.g., bounding boxes), and ground language to the visual world.

Multimodal Large Language Models (MLLMs) have successfully played a role as a general-purpose interface across a wide range of tasks, such as language, vision, and vision-language tasks. MLLMs can perceive general modalities, including texts, images, and audio, and generate responses using free-form texts under zero-shot and few-shot settings. 

[In this work](https://arxiv.org/abs/2306.14824), authors unlock the grounding capability for multimodal large language models. Grounding capability can provide a more convenient and efficient human-AI interaction for vision-language tasks. It enables the user to point to the object or region in the image directly rather than input detailed text descriptions to refer to it, the model can understand that image region with its spatial locations. Grounding capability also enables the model to respond with visual answers (i.e., bounding boxes), which can support more vision-language tasks such as referring expression comprehension. Visual answers are more accurate and resolve the coreference ambiguity compared with text-only responses. In addition, grounding capability can link noun phrases and referring expressions in the generated free-form text response to the image regions, providing more accurate, informational, and comprehensive answers.


![image](https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg)

#### Table of contents:
- [Install requirements](#Install-requirements)
- [Original model inference](#Original-model-inference)
- [Convert models to OpenVINO Intermediate representation (IR) format](#Convert-models-to-OpenVINO-Intermediate-representation-(IR)-format)


## Install requirements
[back to top ⬆️](#Table-of-contents:)


In [1]:
%pip install --upgrade pip
%pip install -q "transformers>=4.35" Pillow
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu torch torchvision
%pip uninstall -q -y openvino-dev openvino openvino-nightly
%pip install -q "openvino-nightly"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Original model inference


In [1]:
import requests

from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

prompt = "<grounding>An image of"

url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)

# The original Kosmos-2 demo saves the image first then reload it. For some images, this will give slightly different image input and change the generation outputs.
image.save("new_image.jpg")
image = Image.open("new_image.jpg")

inputs = processor(text=prompt, images=image, return_tensors="pt")

generated_ids = model.generate(
    pixel_values=inputs["pixel_values"],
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    image_embeds=None,
    image_embeds_position_mask=inputs["image_embeds_position_mask"],
    use_cache=True,
    max_new_tokens=128,
)
print(generated_ids)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
# Specify `cleanup_and_extract=False` in order to see the raw model generation.
processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)

print(processed_text)
# `<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>.`

# By default, the generated  text is cleanup and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)

print(processed_text)
# `An image of a snowman warming himself by a fire.`

print(entities)
# `[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)])]`

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


tensor([[    0, 64003,     4,     5,     6,     7,     8,     9,    10,    11,
            12,    13,    14,    15,    16,    17,    18,    19,    20,    21,
            22,    23,    24,    25,    26,    27,    28,    29,    30,    31,
            32,    33,    34,    35,    36,    37,    38,    39,    40,    41,
            42,    43,    44,    45,    46,    47,    48,    49,    50,    51,
            52,    53,    54,    55,    56,    57,    58,    59,    60,    61,
            62,    63,    64,    65,    66,    67, 64004, 64012,   712,  1648,
             9, 64007,    10, 43867, 64008, 64009, 64057, 64876, 64010,  5950,
           597,    32, 64007,    10,   646, 64008, 64009, 64018, 64924, 64010,
             4,     2]])
<image>. the, to and of as in I that' for is was- on’ it with The as at bet he have from by are " you his “ this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> An image of<phrase

## Convert models to OpenVINO Intermediate representation (IR) format
[back to top ⬆️](#Table-of-contents:)

In [2]:
inputs["input_ids"]

tensor([[    0, 64003,     4,     5,     6,     7,     8,     9,    10,    11,
            12,    13,    14,    15,    16,    17,    18,    19,    20,    21,
            22,    23,    24,    25,    26,    27,    28,    29,    30,    31,
            32,    33,    34,    35,    36,    37,    38,    39,    40,    41,
            42,    43,    44,    45,    46,    47,    48,    49,    50,    51,
            52,    53,    54,    55,    56,    57,    58,    59,    60,    61,
            62,    63,    64,    65,    66,    67, 64004, 64012,   712,  1648,
             9]])

In [3]:
generated_ids[0][len(inputs["input_ids"][0]):]

tensor([64007,    10, 43867, 64008, 64009, 64057, 64876, 64010,  5950,   597,
           32, 64007,    10,   646, 64008, 64009, 64018, 64924, 64010,     4,
            2])

In [4]:
model.config

Kosmos2Config {
  "_name_or_path": "microsoft/kosmos-2-patch14-224",
  "architectures": [
    "Kosmos2ForConditionalGeneration"
  ],
  "latent_query_num": 64,
  "model_type": "kosmos-2",
  "text_config": {
    "_name_or_path": "",
    "activation_dropout": 0.0,
    "activation_function": "gelu",
    "add_cross_attention": false,
    "architectures": null,
    "attention_dropout": 0.1,
    "attention_heads": 32,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "dropout": 0.1,
    "early_stopping": false,
    "embed_dim": 2048,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "ffn_dim": 8192,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "id2label": {
      "0": "LABEL_0",

In [5]:
import gc
from pathlib import Path

import torch
import openvino as ov


model.config.torchscript = True

models_base_folder = Path("models")


def cleanup_torchscript_cache():
    """
    Helper for removing cached model representation
    """
    torch._C._jit_clear_class_registry()
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
    torch.jit._state._clear_class_state()

### Convert the vision model
[back to top ⬆️](#Table-of-contents:)

In [6]:
vision_model_ir_path = models_base_folder / "vision_model.xml"


if not vision_model_ir_path.exists():
    with torch.no_grad():
        ov_model = ov.convert_model(model.vision_model, example_input=inputs["pixel_values"])

    ov.save_model(ov_model, vision_model_ir_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()
    print("Vision model successfully converted to IR")
else:
    print(f"Vision model will be loaded from {vision_model_ir_path}")

  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):


Vision model successfully converted to IR


### Convert Image To Text Projection model
[back to top ⬆️](#Table-of-contents:)

In [7]:
from torch import nn


image_to_text_projection_model_ir_path = models_base_folder / "image_to_text_projection_model.xml"


def get_image_embeds(pixel_values):
    vision_model_output = model.vision_model(pixel_values)
    image_embeds = model.vision_model.model.post_layernorm(vision_model_output[0])
    image_embeds = nn.functional.normalize(image_embeds, dim=-1)

    return image_embeds


if not image_to_text_projection_model_ir_path.exists():
    image_embeds = get_image_embeds(inputs["pixel_values"])
    
    with torch.no_grad():
        ov_model = ov.convert_model(model.image_to_text_projection, example_input=image_embeds)

    ov.save_model(ov_model, image_to_text_projection_model_ir_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()
    print("Image To Text Projection model successfully converted to IR")
else:
    print(f"Image To Text Projection model will be loaded from {image_to_text_projection_model_ir_path}")

  if a.grad is not None:


Image To Text Projection model successfully converted to IR


### Convert Text model 
[back to top ⬆️](#Table-of-contents:)

In [8]:
from typing import Optional, Tuple, List
import torch.nn.functional as F
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
from transformers.models.kosmos2.modeling_kosmos2 import create_position_ids_from_input_ids


first_stage_model_path = models_base_folder / "cosmos_input_embed.xml"
second_stage_model_path = models_base_folder / "cosmos_with_past.xml"


def get_image_embeds(pixel_values):
    vision_model_output = model.vision_model(pixel_values)
    image_embeds = model.vision_model.model.post_layernorm(vision_model_output[0])
    image_embeds = nn.functional.normalize(image_embeds, dim=-1)
    image_embeds, _ = model.image_to_text_projection(image_embeds)

    return image_embeds


def flattenize_inputs(inputs):
    """
    Helper function for making nested inputs flattens
    """
    flatten_inputs = []
    for input_data in inputs:
        if input_data is None:
            continue
        if isinstance(input_data, (list, tuple)):
            flatten_inputs.extend(flattenize_inputs(input_data))
        else:
            flatten_inputs.append(input_data)
    return flatten_inputs


def postprocess_converted_model(ov_model, example_input=None, input_names=None, output_names=None, dynamic_shapes=None):
    """
    Helper function for appling postprocessing on converted model with updating input names, shapes and output names
    acording to requested specification
    """

    flatten_example_inputs = flattenize_inputs(example_input) if example_input else []
    if input_names:
        for inp_name, m_input, input_data in zip(input_names, ov_model.inputs, flatten_example_inputs):
            m_input.get_tensor().set_names({inp_name})
    
    if output_names:
        for out, out_name in zip(ov_model.outputs, output_names):
            out.get_tensor().set_names({out_name})

    return ov_model


def convert_text_model():
    model.text_model.model.config.torchscript = True
    model.text_model.config.torchscript = True
    image_embeds = get_image_embeds(inputs["pixel_values"])
    conv_inputs = {
        'input_ids': inputs["input_ids"],
        'attention_mask': inputs["attention_mask"],
        'image_embeds': image_embeds,
        'image_embeds_position_mask': inputs["image_embeds_position_mask"],
    }
    outs = model.text_model.model(**conv_inputs)
    inputs_ = ["input_ids", 'attention_mask']
    outputs = ["logits"]
    dynamic_shapes = {"input_ids": {1: "seq_len"}, "attention_mask": {1: "seq_len"}, "position_ids": {0: "seq_len"}}
    for idx in range(len(outs[1])):
        inputs_.extend([f"past_key_values.{idx}.key", f"past_key_values.{idx}.value"])
        dynamic_shapes[inputs_[-1]] = {2: "past_sequence + sequence"}
        dynamic_shapes[inputs_[-2]] = {2: "past_sequence + sequence"}
        outputs.extend([f"present.{idx}.key", f"present.{idx}.value"])

    
    
    if not first_stage_model_path.exists():
        ov_model = ov.convert_model(model.text_model.model, example_input=conv_inputs)
        ov_model = postprocess_converted_model(ov_model, output_names=outputs)
        ov.save_model(ov_model, first_stage_model_path, compress_to_fp16=False)
        del ov_model
        cleanup_torchscript_cache()
        gc.collect()
    
    if not second_stage_model_path.exists():
        position_ids = create_position_ids_from_input_ids(
            inputs["input_ids"],
            padding_idx=model.text_model.config.pad_token_id,
            past_key_values_length=0,
        )[:, -1:]

        example_input_second_stage = {
            "input_ids": inputs["input_ids"][:, -1:],
            "attention_mask": inputs["input_ids"].new_ones(1, inputs["input_ids"].shape[1]+1),
            'position_ids': position_ids,
            "past_key_values": outs[1],
        }
        
        ov_model = ov.convert_model(model.text_model.model, example_input=example_input_second_stage)
        ov_model = postprocess_converted_model(
            ov_model, 
            example_input=example_input_second_stage.values(), 
            input_names=inputs_, 
            output_names=outputs, 
            dynamic_shapes=dynamic_shapes
        )
        ov.save_model(ov_model, second_stage_model_path, compress_to_fp16=False)
        del ov_model
        cleanup_torchscript_cache()
        gc.collect()


convert_text_model()     

  if max_pos > self.weights.size(0):
  if input_shape[-1] > 1:
  if attention_mask.size() != (batch_size, 1, seq_length, src_len):
  if past_key_values_length > 0:


#### Select inference device
[back to top ⬆️](#Table-of-contents:)

Select device that will be used to do models inference using OpenVINO from the dropdown list:

In [9]:
import ipywidgets as widgets


core = ov.Core()
DEVICE = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

DEVICE

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

In [10]:
import numpy as np


class WraperInternalVisionModel:
    post_layernorm = model.vision_model.model.post_layernorm
    

class VisionModelWrapper(torch.nn.Module):
    def __init__(self, model_ir_path):
        super().__init__()
        self.model = WraperInternalVisionModel()
        self.vision_model = core.compile_model(model_ir_path, DEVICE.value)

    def forward(self, pixel_values, **kwargs):
        vision_model_output = self.vision_model(pixel_values)[0]

        return [torch.from_numpy(vision_model_output)]
    

class ImageToTextProjectionModelWrapper(torch.nn.Module):
    def __init__(self, model_ir_path):
        super().__init__()
        self.image_to_text_projection = core.compile_model(model_ir_path, DEVICE.value)

    def forward(self, image_embeds):
        output = self.image_to_text_projection(image_embeds)
        image_embeds = output[0]
        projection_attentions = output[1]
        return image_embeds, projection_attentions

In [11]:
vision_model_ov = VisionModelWrapper(vision_model_ir_path)
image_to_text_projection_ov = ImageToTextProjectionModelWrapper(image_to_text_projection_model_ir_path)

In [99]:
import ipywidgets as widgets


core = ov.Core()
USE_ORIGINAL_FIRST_STAGE = widgets.Dropdown(
    options=[True, False],
    value=False,
    description='Use original first stage:',
    disabled=False,
)

USE_ORIGINAL_FIRST_STAGE

Dropdown(description='Use original first stage:', index=1, options=(True, False), value=False)

In [103]:
from transformers.generation import GenerationConfig, GenerationMixin
from transformers.models.kosmos2.modeling_kosmos2 import Kosmos2ForConditionalGenerationModelOutput
from transformers import AutoConfig
import numpy as np
import torch


class KosmosForCausalLM(GenerationMixin):
    def __init__(self, vision_model, image_to_text_projection_model, first_stage_model_path, second_stage_model_path, device):
        self.model_stage_1 = core.compile_model(first_stage_model_path, DEVICE.value)
        self.model_stage_2 = core.read_model(second_stage_model_path)
        self.vision_model = vision_model_ov
        self.image_to_text_projection = image_to_text_projection_ov
        self.input_names = {
            key.get_any_name(): idx for idx, key in enumerate(self.model_stage_2.inputs)
        }
        self.output_names = {
            key.get_any_name(): idx for idx, key in enumerate(self.model_stage_2.outputs)
        }
        self.key_value_input_names = [
            key for key in self.input_names if "key_values" in key
        ]
        self.key_value_output_names = [
            key for key in self.output_names if "present" in key
        ]
        self.model_stage_2 = core.compile_model(self.model_stage_2, DEVICE.value)

        self.request = self.model_stage_2.create_infer_request()
        self.config = model.config
        self.generation_config = GenerationConfig.from_model_config(model.config)
        self.main_input_name = "input_ids"
        self.device = torch.device("cpu")
        self.num_pkv = 2
        self.lm_head = nn.Linear(in_features=model.text_model.config.embed_dim, out_features=model.text_model.config.vocab_size, bias=False)


    def get_input_embeddings(self) -> nn.Module:
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def get_output_embeddings(self) -> nn.Module:
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def can_generate(self):
        """Returns True to validate the check that the model using `GenerationMixin.generate()` can indeed generate."""
        return True

    def __call__(
        self,
        input_ids=None,
        attention_mask: Optional[torch.Tensor] = None,
        image_embeds: Optional[torch.Tensor] = None,
        image_embeds_position_mask: Optional[torch.Tensor] = None,
        position_ids = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        **kwargs,
    ):
        return self.forward(
            input_ids, attention_mask, image_embeds, image_embeds_position_mask, position_ids, past_key_values
        )

    def forward(
        self,
        input_ids=None,
        
        attention_mask: Optional[torch.Tensor] = None,
        image_embeds: Optional[torch.Tensor] = None,
        image_embeds_position_mask: Optional[torch.Tensor] = None,
        position_ids = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        
        **kwargs
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
            `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
            ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`

        Returns:

        """

        if past_key_values is None:
            
            outs = self.model_stage_1(
                {
                    'input_ids': input_ids,
                    'attention_mask': attention_mask,
                    'image_embeds': image_embeds,
                    'image_embeds_position_mask': image_embeds_position_mask,
                }
            )
            #outs = self.model_stage_1([input_ids, attention_mask, image_embeds, image_embeds_position_mask])
            
            lm_logits = model.text_model.lm_head(torch.from_numpy(outs[0]))
            outs2 = model.text_model.model(
                input_ids,
                attention_mask=attention_mask,
                image_embeds=torch.from_numpy(image_embeds),
                image_embeds_position_mask=image_embeds_position_mask,
            )

            pkv = list(outs.values())[1:]
            pkv = tuple(pkv[i : i + 2] for i in range(0, len(pkv), 2))

            pkv2 = tuple((i.numpy(), j.numpy()) for i, j in outs2[1])
            if USE_ORIGINAL_FIRST_STAGE.value:
                return Kosmos2ForConditionalGenerationModelOutput(logits=lm_logits, past_key_values=pkv2)
            else:
                return Kosmos2ForConditionalGenerationModelOutput(logits=lm_logits, past_key_values=pkv)
        
        if past_key_values is not None:
            past_key_values = tuple(
                past_key_value
                for pkv_per_layer in past_key_values
                for past_key_value in pkv_per_layer
            )
            inputs_ = {
                "input_ids": input_ids[:, -1].unsqueeze(-1),
                "attention_mask": attention_mask,
                'position_ids': position_ids
            }
            inputs_.update(dict(zip(self.key_value_input_names, past_key_values)))

        # Run inference
        self.request.start_async(inputs_, share_inputs=True)
        self.request.wait()

        logits = torch.from_numpy(self.request.get_tensor("logits").data)
        logits = model.text_model.lm_head(logits)

        # Tuple of length equal to : number of layer * number of past_key_value per decoder layer (2 corresponds to the self-attention layer)
        past_key_values = tuple(
            self.request.get_tensor(key).data for key in self.key_value_output_names
        )
        # Tuple of tuple of length `n_layers`, with each tuple of length equal to 2 (k/v of self-attention)

        past_key_values = tuple(
            past_key_values[i : i + self.num_pkv]
            for i in range(0, len(past_key_values), self.num_pkv)
        )
        
        return Kosmos2ForConditionalGenerationModelOutput(logits=logits, past_key_values=past_key_values)

    def prepare_inputs_for_generation(
        self,
        input_ids,
        image_embeds=None,
        image_embeds_position_mask=None,
        past_key_values=None,
        attention_mask=None,
        use_cache=None,
        **kwargs,
    ):
        input_shape = input_ids.shape
        # # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
        if attention_mask is None:
            attention_mask = input_ids.new_ones(input_shape)

        position_ids = None

        # cut input_ids if past_key_values is used
        if past_key_values is not None:
            position_ids = create_position_ids_from_input_ids(
                input_ids,
                padding_idx=model.text_model.config.pad_token_id,
                past_key_values_length=0,
            )[:, -1:]

            input_ids = input_ids[:, -1:]
            # the image info. is already encoded into the past keys/values
            image_embeds = None
            image_embeds_position_mask = None
        elif image_embeds_position_mask is not None:
            # appending `False` to `image_embeds_position_mask` (because `input_ids` grows during generation)
            batch_size, seq_len = input_ids.size()
            mask_len = image_embeds_position_mask.size()[-1]
            image_embeds_position_mask = torch.cat(
                (
                    image_embeds_position_mask,
                    torch.zeros(size=(batch_size, seq_len - mask_len), dtype=torch.bool, device=input_ids.device),
                ),
                dim=1,
            )
        # return {
        #     "input_ids": input_ids,
        #     "image_embeds": image_embeds,
        #     "image_embeds_position_mask": image_embeds_position_mask,
        #     "past_key_values": past_key_values,
        #     "attention_mask": attention_mask,
        #     "position_ids": position_ids,
        #     #"use_cache": use_cache,
        #}
        # if attention_mask is None:
        #     attention_mask = input_ids.new_ones(input_shape)
            
        # past_len = 0
        # if past_key_values is not None:
        #     input_ids = input_ids[:, -1].unsqueeze(-1)
        #     # past_len = past_key_values[-1][-1].shape[-2]
        #     image_embeds = None
        #     image_embeds_position_mask = None
        # elif image_embeds_position_mask is not None:
        #     # appending `False` to `image_embeds_position_mask` (because `input_ids` grows during generation)
        #     batch_size, seq_len = input_ids.size()
        #     mask_len = image_embeds_position_mask.size()[-1]
        #     image_embeds_position_mask = torch.cat(
        #         (
        #             image_embeds_position_mask,
        #             torch.zeros(size=(batch_size, seq_len - mask_len), dtype=torch.bool, device=input_ids.device),
        #         ),
        #         dim=1,
        #     )
        # attention_mask = kwargs.get(
        #     "attention_mask",
        #     torch.ones(input_ids.shape[0], input_ids.shape[1] + past_len),
        # )
        # if not kwargs.get("use_cache", True):
        #     raise NotImplementedError("MPT with prefix_lm=True does not support use_cache=False.")
        # else:
        #     prefix_mask = None

        return {
            "input_ids": input_ids,
            "image_embeds": image_embeds,
            "image_embeds_position_mask": image_embeds_position_mask,
            'position_ids': position_ids,
            "past_key_values": past_key_values,
            "attention_mask": attention_mask,
        }
    

    @staticmethod
    # Copied from transformers.models.umt5.modeling_umt5.UMT5ForConditionalGeneration._reorder_cache
    def _reorder_cache(past_key_values, beam_idx):
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
            )
        return reordered_past

In [104]:
ov_model = KosmosForCausalLM(vision_model_ov, image_to_text_projection_ov, first_stage_model_path, second_stage_model_path, DEVICE)

In [105]:
vision_model_output = vision_model_ov(inputs["pixel_values"])
image_embeds = model.vision_model.model.post_layernorm(vision_model_output[0])
# normalized features
image_embeds = nn.functional.normalize(image_embeds, dim=-1)
image_embeds, projection_attentions = image_to_text_projection_ov(image_embeds.detach().numpy())

In [110]:
generated_ids_ = ov_model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    image_embeds=image_embeds,
    image_embeds_position_mask=inputs["image_embeds_position_mask"],
    #use_cache=True,
    max_new_tokens=128,
)

In [111]:
generated_ids_

tensor([[    0, 64003,     4,     5,     6,     7,     8,     9,    10,    11,
            12,    13,    14,    15,    16,    17,    18,    19,    20,    21,
            22,    23,    24,    25,    26,    27,    28,    29,    30,    31,
            32,    33,    34,    35,    36,    37,    38,    39,    40,    41,
            42,    43,    44,    45,    46,    47,    48,    49,    50,    51,
            52,    53,    54,    55,    56,    57,    58,    59,    60,    61,
            62,    63,    64,    65,    66,    67, 64004, 64012,   712,  1648,
             9,    10,   242,     8,    10,   460,    12,    10,   370,     6,
            23,    10,   242,  1369,    10,  1402,     6,     8,    10,   242,
            23,    10,  1402,    12,    10,   373,     6,    23,     5,   460,
            12,     5,   373,     4,    24,   242,    23,     5,  1402,    17,
          5229,    22,    26,     5,   460,     6,     8,     5,   460,    17,
           383,    26,     5,   242,    23,    36,  

In [112]:
generated_text = processor.batch_decode(generated_ids_, skip_special_tokens=True)[0]
print(generated_text)

<image>. the, to and of as in I that' for is was- on’ it with The as at bet he have from by are " you his “ this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> An image of a man and a woman in a room, with a man holding a gun, and a man with a gun in a car, with the woman in the car. The man with the gun is pointing it at the woman, and the woman is looking at the man with his gun. The woman is saying, "I'm not going to do this." The man is saying "I know you're not going." The woman says, "You're going to shoot me." The gun is pointed at the car, and she is saying to the man, "Don't shoot me!" The man says, "...I'


In [113]:

# Specify `cleanup_and_extract=False` in order to see the raw model generation.
processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)

print(processed_text)
# `<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>.`

# By default, the generated  text is cleanup and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)

print(processed_text)
# `An image of a snowman warming himself by a fire.`

print(entities)

<grounding> An image of a man and a woman in a room, with a man holding a gun, and a man with a gun in a car, with the woman in the car. The man with the gun is pointing it at the woman, and the woman is looking at the man with his gun. The woman is saying, "I'm not going to do this." The man is saying "I know you're not going." The woman says, "You're going to shoot me." The gun is pointed at the car, and she is saying to the man, "Don't shoot me!" The man says, "...I'
An image of a man and a woman in a room, with a man holding a gun, and a man with a gun in a car, with the woman in the car. The man with the gun is pointing it at the woman, and the woman is looking at the man with his gun. The woman is saying, "I'm not going to do this." The man is saying "I know you're not going." The woman says, "You're going to shoot me." The gun is pointed at the car, and she is saying to the man, "Don't shoot me!" The man says, "...I'
[]
