# DINOV3


DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, <u>__without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks__</u>, significantly surpassing previous self- and weakly-supervised foundation models.

----

### Reference:
- ***Paper***
    - [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
- ***Blogs***
    - [DINOV3 Project](https://ai.meta.com/dinov3/)
    - https://huggingface.co/docs/transformers/en/model_doc/dinov3
- ***GitHub***
    - https://github.com/facebookresearch/dinov3
    - https://colab.research.google.com/github/facebookresearch/dinov3/blob/main/notebooks/foreground_segmentation.ipynb#scrollTo=a47d6a1b
    - https://colab.research.google.com/github/facebookresearch/dinov3/blob/main/notebooks/segmentation_tracking.ipynb


## Device Setup

In [2]:
import torch

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
    !nvidia-smi
else:
    device = "cpu"

print(f"Available device : {device}")

Fri Sep 12 14:59:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 29%   46C    P5             39W /  250W |    1277MiB /  11264MiB |     31%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

image
print(image.height, image.width)

480 640


In [4]:
import torch
from transformers import AutoImageProcessor, AutoModel

processor = AutoImageProcessor.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m")
model = AutoModel.from_pretrained(
    "facebook/dinov3-vits16-pretrain-lvd1689m",
    dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
).to(device)

inputs = processor(images=image, return_tensors="pt").to(model.device)
print(f"Input Shape :: ")
with torch.inference_mode():
    outputs = model(**inputs)

pooled_output = outputs.pooler_output
print("Pooled output shape:", pooled_output.shape)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Input Shape :: 
Pooled output shape: torch.Size([1, 384])


In [9]:
inputs.pixel_values.shape

torch.Size([1, 3, 224, 224])

In [12]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [10]:

import torch
from transformers import TorchAoConfig, AutoImageProcessor, AutoModel
from torchao.quantization import Int4WeightOnlyConfig
from transformers.image_utils import load_image


processor = AutoImageProcessor.from_pretrained("facebook/dinov3-vits16plus-pretrain-lvd1689m")

quant_type = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_type)

model = AutoModel.from_pretrained(
    "facebook/dinov3-vit7b16-pretrain-lvd1689m",
    dtype=torch.bfloat16,
    device_map="auto",
    # quantization_config=quantization_config
)

inputs = processor(images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model(**inputs)

pooled_output = outputs.pooler_output
print("Pooled output shape:", pooled_output.shape)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


Pooled output shape: torch.Size([1, 4096])


In [15]:
inputs.pixel_values.shape

torch.Size([1, 3, 224, 224])

In [12]:
outputs.last_hidden_state.shape

torch.Size([1, 201, 4096])