# SmolVLM2-Image



----

### Reference:
- ***Paper***
    - ...
- ***Blogs***
    - [SmolVLM2: Bringing Video Understanding to Every Device](https://huggingface.co/blog/smolvlm2)
    - https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct
    - [SmolVLM to SmolVLM2: Compact Models for Multi-Image VQA](https://pyimagesearch.com/2025/06/23/smolvlm-to-smolvlm2-compact-models-for-multi-image-vqa/)
- ***GitHub***
    - ...

----

### Conda env : [cv_playgrounds](../README.md#setup-a-conda-environment)

## Device Setup

In [1]:
import torch

if torch.backends.mps.is_available():
    g_device = "mps"
elif torch.cuda.is_available():
    g_device = "cuda"
    !nvidia-smi
else:
    g_device = "cpu"

print(f"Available device : {g_device}")

Thu Sep 18 10:54:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 27%   46C    P5             38W /  250W |    1322MiB /  11264MiB |     36%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Load Model

In [2]:
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM-256M-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    # _attn_implementation="flash_attention_2"
).to(g_device)

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

`torch_dtype` is deprecated! Use `dtype` instead!


##  Image Description

In [3]:
from IPython.display import Image, display

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},
        ]
    },
]

display(Image(url='https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg', width=500))
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])


User:



Can you describe this image?
Assistant: The image depicts a close-up view of a flower with a distinctively colored blossom. The flower is prominently featured in the foreground, with its petals fully open and a central yellow center. The petals are a deep shade of pink, and the edges are slightly curled, giving the flower a delicate and delicate appearance.


In [4]:

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from IPython.display import Image, display

# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
display(Image(url='https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg', width=500))
# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(g_device)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

User:



Can you describe this image?
Assistant: The image depicts a large, historic statue of Liberty situated on a small island in a body of water. The statue is a green, cylindrical structure with a human figure at the top, which is the actual statue of Liberty. The statue is mounted on a pedestal that is supported by a cylindrical base. The base of the pedestal is made of stone and has a rounded top, which is typical of pedestal bases used for statues.

The statue is surrounded by a large body of water, which is likely the Hudson River, as indicated by the presence of trees and a small dock extending from the water's edge. The water is calm, with gentle ripples indicating the gentle movement of the water.

In the background, there are several tall buildings, including a modern skyscraper and a more traditional building, which are both part of the cityscape. The buildings are constructed with glass and steel, and they are of varying heights, with some taller structures closer to the

In [5]:
from IPython.display import Image, display

image1_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
display(Image(url=image1_url, width=400, height=300))

image2_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
display(Image(url=image2_url, width=300, height=400))

messages = [
   {
       "role": "user",
       "content": [
           {"type": "text", "text": "What are the differences between these two images?"},
         {"type": "image", "url": image1_url},
         {"type": "image", "url": image2_url},
       ]
   },
]

inputs = processor.apply_chat_template(
   messages,
   add_generation_prompt=True,
   tokenize=True,
   return_dict=True,
   return_tensors="pt",
).to(g_device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=100)
generated_texts = processor.batch_decode(
   generated_ids,
   skip_special_tokens=True,
)
print(generated_texts[0])

User: What are the differences between these two images?









Assistant: A rabbit in a blue coat stands in a dirt path with a village in the background.


In [None]:
from IPython.display import Video

# video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/bowling/-WH-lxmGJVY_000005_000015.mp4"
video_url = "./data/test01.mp4"
# Assuming 'my_video.mp4' is in the same directory as the notebook
Video(video_url, width=640, height=360)

In [13]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_url},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(g_device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

User: Describe this video in detail
Assistant: The video is titled "The 100 Best Places to Visit in the World" and is part of a series of videos that cover various topics related to travel and tourism. The title of the video is written in a large, bold font at the top of the video. The content of the video is divided into


In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_url},
            {"type": "text", "text": "How many dogs are there"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(g_device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

User: How many dogs are there
Assistant: There are 10 dogs


: 