# Dino-v2 Documentation
Dino-v2 is a vision transformer model developed by Facebook Research. It is a successor to the original DINO model, designed for self-supervised learning in vision tasks.

# Installation
Dino-v2 requires the following dependencies:

Python 3.x
PyTorch
torchvision
faiss
numpy
Pillow (PIL)
OpenCV (cv2)
tqdm
matplotlib

You can install these dependencies using pip:

```pip install torch torchvision faiss-cpu numpy Pillow opencv-python tqdm matplotlib```

# Model Description
Dino-v2 uses a vision transformer architecture, similar to the original DINO model. It utilizes a transformer encoder to process input images. Specifically, it uses the dinov2_vits14 variant, which consists of 14 transformer layers.

In [2]:
#Usage
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
import cv2
import json
from tqdm.notebook import tqdm
from matplotlib import pyplot as plt
import supervision as sv

# Load Dino-v2 model
dinov2_vits14 = torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")

# Choose device
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

# Move model to device
dinov2_vits14.to(device)

Using cache found in /home/shravan/.cache/torch/hub/facebookresearch_dinov2_main


DinoVisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 384, kernel_size=(14, 14), stride=(14, 14))
    (norm): Identity()
  )
  (blocks): ModuleList(
    (0-11): 12 x NestedTensorBlock(
      (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
      (attn): MemEffAttention(
        (qkv): Linear(in_features=384, out_features=1152, bias=True)
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): LayerScale()
      (drop_path1): Identity()
      (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=384, out_features=1536, bias=True)
        (act): GELU(approximate='none')
        (fc2): Linear(in_features=1536, out_features=384, bias=True)
        (drop): Dropout(p=0.0, inplace=False)
      )
      (ls2): LayerScale()
      (drop_path2): Identity()
    )
  )
  (n

Basic Usage
Dino-v2 can be used for various computer vision tasks, including feature extraction, image classification, and object detection. Here's an example of using Dino-v2 for feature extraction:

In [3]:
# Load an image
image_path = "night-1927265_1280.jpg"
image = Image.open(image_path)

# Preprocess the image
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_image = transform(image).unsqueeze(0).to(device)

In [4]:
input_image.shape

torch.Size([1, 3, 224, 224])

In [5]:
# Get features from Dino-v2
with torch.no_grad():
    features = dinov2_vits14(input_image)

In [7]:
# Do something with the features
print(features.shape)

torch.Size([1, 384])


### References

Original DINO paper: https://arxiv.org/abs/2104.14294

Facebook Research GitHub repository:https://github.com/facebookresearch/dino