# Transformers 4.56 vision models 🔥

New transformers release comes with amazing vision/multimodal models: Florence-2 by MSFT, SAM-2 by Meta, KOSMOS-2.5 by MSFT, MetaCLIP2 by Meta, all runnable in Colab free tier. This notebook enables you to try them all!

Note: This notebook has a lot of image outputs, so you need to run the notebook to see them.

## Florence-2

We'll first take a look at Florence-2. The model in transformers format will be uploaded to microsoft org soon, but in the meantime, we can use the models `ducviet00/Florence-2-large-hf` and `ducviet00/Florence-2-base-hf`. It comes in sizes 200M and 800M parameters, very small.

In [None]:
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

processor = AutoProcessor.from_pretrained("ducviet00/Florence-2-large-hf")
model = AutoModelForImageTextToText.from_pretrained("ducviet00/Florence-2-large-hf").to("cuda", torch.bfloat16)

Florence-2 is a prompt based model, you can use following task prompts to use it:
```
<OCR>
<OCR_WITH_REGION>
<CAPTION>
<DETAILED_CAPTION>
<MORE_DETAILED_CAPTION>
<OD>
<DENSE_REGION_CAPTION>
<CAPTION_TO_PHRASE_GROUNDING>
<REFERRING_EXPRESSION_SEGMENTATION>
<REGION_TO_SEGMENTATION>
<OPEN_VOCABULARY_DETECTION>
<REGION_TO_CATEGORY>
<REGION_TO_DESCRIPTION>
<REGION_TO_OCR>
<REGION_PROPOSAL>
```

In [12]:
import torch
import requests
from PIL import Image

url = "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/menu.JPG"
image = Image.open(requests.get(url, stream=True).raw)
prompt="<OCR>"

In [13]:
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda", torch.bfloat16)

generated_ids = model.generate(**inputs, max_new_tokens=1024, num_beams=3)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

In [14]:
image_size = image.size
parsed_answer = processor.post_process_generation(generated_text, task=prompt, image_size=image_size)
print(parsed_answer)

{'<OCR>': "FRIDAY, DEC 20th\nNEW OFFICE PARTY\n- COCKTAIL MENU -\nOFFICE MARTINI\nvodka fraise des bois - liss de framboise - liqueur de fleur de surreau - fleur\nwild strawberry volks - raspberry juice - raspberry litor - a déflower lior - flower\nDIFFUSER'S SUNRISE\ntequila, manchurian impédio, lus d'orange sansquine - contreu - cherry bitter\ntequila, tangerine lime - blood orange juice - contreau - cherry bitter\nTRANSFORMERS TWIST\ngin Intégrale - chèvre-lemon - jauné - citron - pouvre blanc\npepper\nPERUVIAN PEFT\nPapaya - lemonade - orange blanc - green tea & lemon - lemon - white\npeppers - pomegranate - orange marmalade - ananas\nplace - creme de crème - cérémonie - mandarin - mandarins\nroasted mango-infused gin - lemongrass - grenadilla - orange cocktail - pineapple"}


You can also do object detection with it.

In [15]:
url = "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/candy.JPG"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<OD>"

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda", torch.bfloat16)

generated_ids = model.generate(**inputs, max_new_tokens=1024, num_beams=3)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

In [16]:
image_size = image.size
parsed_answer = processor.post_process_generation(generated_text, task=prompt, image_size=image_size)
print(parsed_answer)

{'<OD>': {'bboxes': [[2272, 2085, 2659, 2453], [1925, 1335, 2296, 1707], [1651, 1431, 1961, 1788], [2457, 1915, 2840, 2193], [2009, 1955, 2388, 2187], [1155, 533, 3784, 3022]], 'labels': ['candy', 'candy', 'candy', 'candy', 'candy', 'human hand']}}


In [None]:
from PIL import ImageDraw

draw = ImageDraw.Draw(image)
bboxes = parsed_answer['<OD>']['bboxes']
labels = parsed_answer['<OD>']['labels']

for bbox, label in zip(bboxes, labels):
    x1, y1, x2, y2 = bbox
    draw.rectangle([x1, y1, x2, y2], outline="red", width=3)
    draw.text((x1, y1), label, fill="red")

display(image)

## DINOv3

DINOv3 is an advanced image backbone/embedding model which you can use for variety of tasks as is. Here's a bunch of apps and tutorials in case you're interested in what you can do, and how to fine-tune it for image classification.
- [DINOv3 Fine-tuning](https://huggingface.co/merve/smol-vision/blob/main/DINOv3_FT.ipynb)
- [DINOv3 for Keypoint Matching through patch similarities](https://huggingface.co/spaces/merve/DINOv3-keypoint-matching)
- [DINOv3 object perception](https://huggingface.co/spaces/merve/dinov3-viz)

Note that to run this model, you need to have access to it. Head to repository to ask for access by filling the form if you don't have the access. [Here's all the DINOv3 models](https://huggingface.co/collections/facebook/dinov3-68924841bd6b561778e31009).



In [None]:
import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image

url = "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/thailand.jpg"
image = load_image(url)

pretrained_model_name = "facebook/dinov3-convnext-base-pretrain-lvd1689m"
processor = AutoImageProcessor.from_pretrained(pretrained_model_name)
model = AutoModel.from_pretrained(
    pretrained_model_name,
    device_map="auto",
)

inputs = processor(images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model(**inputs)

pooled_output = outputs.pooler_output


## Kosmos 2.5

Kosmos 2.5 by Microsoft is a great document model that can not only convert documents to markdown, it also can locate meaningful structures on documents.
It has a "normal" checkpoint and a "chat" checkpoint which can be used for VQA tasks. Let's see how to use it.

In [None]:
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
import torch

model = Kosmos2_5ForConditionalGeneration.from_pretrained("microsoft/kosmos-2.5").to("cuda", torch.bfloat16)
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2.5")


In [3]:
from PIL import Image, ImageDraw
import requests

url = "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/fiche.jpg"
image = Image.open(requests.get(url, stream=True).raw)

It works a bit like Florence-2 where you can provide a task prompt. It takes two: `<md>` (for markdown) and `<ocr>` (for OCR).

In [4]:
import re

prompt = "<md>"
inputs = processor(text=prompt, images=image, return_tensors="pt")

height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_height = raw_height / height
scale_width = raw_width / width

inputs = {k: v.to("cuda") if v is not None else None for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(torch.bfloat16)
generated_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text[0])

# CATERIA DEI FERMENTINI

UNO SRLS
VIA CIMABUE 1 R
50125 FIRENZE
P.iva 04109381204
Tel. 055 2466781

## DOCUMENTO COMMERCIALE

di vendita o prestazione

- **QTA.** **DESCRIZIONE**
- 1 x Coperti
- 1 x Coca Fanta Sprite
- 1 x Rigatoni 3 pomodori

- **IVA**
- 10,00%
- 10,00%
- 10,00%

- **TOTAL** **EURO**
- 17,00

di cui **IVA**
- 1.55

Pagamento elettronico
Importo pagato

26-05-2023 21:52
DOC.N. 0175-0011
RT 941BQ003454

---

**DETTAGLIO FORME DI PAGAMENTO**
Carta di Credito

17,00


Let's try chat version. Note how it takes a chat template as input.

In [None]:
model = Kosmos2_5ForConditionalGeneration.from_pretrained("microsoft/kosmos-2.5-chat").to("cuda", torch.bfloat16)
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2.5-chat")

In [6]:
question = "What is the sub total of the receipt?"
template = "<md>A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
prompt = template.format(question)
inputs = processor(text=prompt, images=image, return_tensors="pt")

# rest is the same
height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_height = raw_height / height
scale_width = raw_width / width

inputs = {k: v.to("cuda") if v is not None else None for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(torch.bfloat16)
generated_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text[0])

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: What is the sub total of the receipt? ASSISTANT: 17,00


## MetaCLIP2

MetaCLIP2 is a multimodal zero-shot image classifier by Meta, which you can use for a variety of tasks that require image-text understanding. [Here's all the MetaCLIP2 models](https://huggingface.co/collections/facebook/meta-clip-1-2-687e97787e9155bc480ef446), we will use the multilingual one.

In [None]:
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification
import torch

model = AutoModelForZeroShotImageClassification.from_pretrained("facebook/metaclip-2-worldwide-huge-378", dtype=torch.bfloat16, attn_implementation="sdpa").to("cuda", torch.bfloat16)
processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-378")

In [8]:
import requests
import torch
from PIL import Image

url = "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/venice.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["venice", "venezia", "berlin"]

In [9]:
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True, ).to("cuda")

outputs = model(**inputs)

We take the probabilities assigned to labels "venice", "venezia", "berlin" respectively.

In [10]:
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

formatted_probs = [f"{p.item()*100:.2f}%" for p in probs[0]]
print(formatted_probs)

['59.38%', '40.82%', '0.00%']


## SAM2

SAM2 is continuation for SAM (Segment Anything Model) by Meta, with addition of video inference and keeping additional memory across video frames to propagate a mask to next frames.

In [1]:
from transformers import Sam2Processor, Sam2Model
import torch

model = Sam2Model.from_pretrained("facebook/sam2-hiera-tiny").to("cuda")
processor = Sam2Processor.from_pretrained("facebook/sam2-hiera-tiny")

config.json: 0.00B [00:00, ?B/s]

You are using a model of type sam2_video to instantiate a model of type sam2. This is not supported for all configurations of models and can yield errors.


model.safetensors:   0%|          | 0.00/156M [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/683 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

Image inference is pretty similar to previous SAM model where you can provide a point of box prompt around object of interest.

On top of it, you can indicate what type of click you're leaving on the image, i.e. 1 is positive click to indicate it's the object you're interested in, and 0 is negative click to exclude an object. Here we leave a positive click on a flower petal.

In [2]:
from PIL import Image
import requests

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee_edited.jpg"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

input_points = [[[[750, 750]]]]
input_labels = [[[1]]]

In [None]:
from PIL import ImageDraw
img = raw_image.copy()
draw = ImageDraw.Draw(img)

draw.regular_polygon((750, 750, 25), n_sides=3, fill="yellow")
img

In [8]:
inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model(**inputs)

Outputs have prediction masks and `iou_scores`. We return three masks, so we can access best prediction through scores.

In [10]:
outputs.iou_scores

tensor([[[0.3297, 0.7263, 0.4257]]], device='cuda:0')

In [None]:
masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]

print(f"Generated {masks.shape[1]} masks with shape {masks.shape}")

Let's overlay the mask at the index 1 (with score 0.72).

In [None]:
import numpy as np
from PIL import Image, ImageDraw

binary_mask = masks[0][1]

colored_mask = Image.fromarray(binary_mask.numpy().astype(np.uint8) * 255, mode='L').convert('RGBA')

overlay_color = (255, 0, 0, 128)
color_overlay = Image.new('RGBA', colored_mask.size, overlay_color)

colored_mask.paste(color_overlay, (0, 0), color_overlay)

raw_image_rgba = raw_image.convert('RGBA')

output_image = Image.composite(colored_mask, raw_image_rgba, colored_mask)

display(output_image)

With SAM2 you can do:
- inference for single points per object per image → `[[[[500, 375]]]]` (single point)
- inference for multiple points for one object in an image → `[[[[500, 375], [1125, 625]]]]`
- multiple points per multiple objects → `[[[[500, 375]], [[650, 750]]]]`
- batch images for above. → `[[[[500, 375]]], [[[770, 200]]]]` we should provide same for click indicators, e.g. for this case `[[[1]], [[1]]]`



What makes SAM2 stand out is video tracking. We select a frame in a video, leave a click, get a mask. Then we propagate that mask across video itself, it's called a "masklet" and is tracked throughout the video with memory, so we need to start an inference session, unlike any other transformers model.

Let's install av for the video backend, so let's install that.

In [13]:
!pip install av

Collecting av
  Downloading av-15.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Downloading av-15.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (39.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: av
Successfully installed av-15.1.0


In [None]:
from transformers import Sam2VideoModel, Sam2VideoProcessor, infer_device
import torch

device = infer_device()
model = Sam2VideoModel.from_pretrained("facebook/sam2.1-hiera-tiny").to(device, dtype=torch.bfloat16)
processor = Sam2VideoProcessor.from_pretrained("facebook/sam2.1-hiera-tiny")

In [2]:
from transformers.video_utils import load_video
video_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4"
video_frames, _ = load_video(video_url)

In [None]:
display(video_frames[0])

We have video of jumping kids. Let's start video session.

In [3]:
inference_session = processor.init_video_session(
    video=video_frames,
    inference_device=device,
    dtype=torch.bfloat16,
)

We leave a point on the first frame on the kid's pants.

In [7]:
ann_frame_idx = 0
ann_obj_id = 1
points = [[[[210, 350]]]]
labels = [[[1]]]

In [10]:
x, y = points[0][0][0][0], points[0][0][0][1]

In [None]:
from PIL import ImageDraw, Image
img = Image.fromarray(video_frames[0]).copy()
draw = ImageDraw.Draw(img)

draw.regular_polygon((x, y, 5), n_sides=3, fill="yellow")
img

In [18]:
processor.add_inputs_to_inference_session(
    inference_session=inference_session,
    frame_idx=ann_frame_idx,
    obj_ids=ann_obj_id,
    input_points=points,
    input_labels=labels,
)


In [24]:
outputs = model(
    inference_session=inference_session,
    frame_idx=ann_frame_idx,
)
video_res_masks = processor.post_process_masks(
    [outputs.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=True
)[0]
print(f"Segmentation shape: {video_res_masks.shape}")

Segmentation shape: torch.Size([1, 1, 540, 960])


In [None]:
import numpy as np
from PIL import Image, ImageDraw


colored_mask = Image.fromarray(video_res_masks[0][0].cpu().detach().numpy().astype(np.uint8) * 255, mode='L').convert('RGBA')

overlay_color = (255, 0, 0, 128)
color_overlay = Image.new('RGBA', colored_mask.size, overlay_color)

colored_mask.paste(color_overlay, (0, 0), color_overlay)

raw_image_rgba = Image.fromarray(video_frames[0]).convert('RGBA')

output_image = Image.composite(colored_mask, raw_image_rgba, colored_mask)

display(output_image)

We can overlay the mask for that frame and if we like that, we propagate it in the video.

In [None]:
video_segments = {}
for sam2_video_output in model.propagate_in_video_iterator(inference_session):
    video_res_masks = processor.post_process_masks(
        [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=True
    )[0]
    video_segments[sam2_video_output.frame_idx] = video_res_masks

print(f"Tracked object through {len(video_segments)} frames")

Let's check a random frame and see if the object was tracked properly.

In [38]:
video_segments[100][0][0]

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]], device='cuda:0')

In [None]:
import numpy as np
from PIL import Image, ImageDraw


colored_mask = Image.fromarray(video_segments[110][0][0].cpu().detach().numpy().astype(np.uint8) * 255, mode='L').convert('RGBA')

overlay_color = (255, 0, 0, 128)
color_overlay = Image.new('RGBA', colored_mask.size, overlay_color)

colored_mask.paste(color_overlay, (0, 0), color_overlay)

raw_image_rgba = Image.fromarray(video_frames[110]).convert('RGBA')

output_image = Image.composite(colored_mask, raw_image_rgba, colored_mask)

display(output_image)