### Import Dependencies

In [1]:
import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM

### Load Model

In [2]:
MODEL_PATH = "AIDC-AI/Ovis2.5-2B"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda()

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Download Image(s)

In [3]:
 !curl -O "https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  767k  100  767k    0     0   9.9M      0 --:--:-- --:--:-- --:--:--  9.9M


In [4]:
!ls

"Ayush's.ipynb"   bird.jpg   Ovis2.5-2B.ipynb   TIlymOb86R6_Mez3bpmcB.png


In [5]:
image = Image.open('TIlymOb86R6_Mez3bpmcB.png')
image.size

(2831, 2652)

### Resize the Image

In [6]:
scaling_factor = 0.3

new_width = int(image.width * scaling_factor)
new_height = int(image.height * scaling_factor)

# Resize the image
resized_image = image.resize((new_width, new_height))

### Setup Query

In [7]:
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": resized_image},
        {"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
    ],
}]

### Setup Input, Generate, and Decode Output

In [8]:
input_ids, pixel_values, grid_thws = model.preprocess_inputs(
    messages=messages,
    add_generation_prompt=True,
    enable_thinking=False
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=False,
    enable_thinking_budget=False,
    max_new_tokens=3072,
    thinking_budget=2048,
)

response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


The sum of the numbers in the middle box in figure (c) is 1.0. This is calculated by adding 0.2, 0.5, and 0.3 together: 0.2 + 0.5 + 0.3 = 1.0.


In [9]:
!nvidia-smi 

Thu Sep 18 18:33:04 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3080 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| 30%   34C    P2              90W / 350W |   8774MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
