# 使用 DINOv2 生成图片 Embedding

本节我们来完成三项任务：

1. 从 huggingface 下载 DINOv2 模型文件，并完成单张图片的 Embedding 推理
2. 开发批量推理代码：首先实现多张图片在 CPU 上的推理。然后更进阶一点，固定 batch_size 参数，在 GPU 上实现分 batch 推理
3. 开发 FastAPI 推理服务，输入图片的 base64 编码，输出图片的 Embedding

dinov2:
- GitHub: [facebookresearch/dinov2](https://github.com/facebookresearch/dinov2)
- Hugging Face: [facebook/dinov2-base](https://huggingface.co/facebook/dinov2-base)

> **为什么要做图片 Embeddding：**
> 
> 本项目的目标是对图片做无监督聚类，为了让图片数据适于作为聚类算法的输入，需对图片做预处理。
> 
> 这里列出两种预处理方法，进行比较：
> 
> 1. 将图片输入一个降噪自编码器，让自编码器通过最小化重构损失，学习图片的低维表达
> 2. 先将图片用预训练图片 Embedding 模型转换成高维 Embedding，再用自编码器转换成低维 Embedding
> 
> 哪一种方法听起来能获得更好的效果呢？
> 
> 我认为是第二种。因为预训练模型已经通过大规模数据学习到高级语义特征和上下文关系，其 Embedding 能更高效地压缩信息，减少噪音干扰。这意味着后续模型的学习难度更低、效率更高。
> 
> 将预训练模型作为特征提取器使用，是一种通用做法。

## 1. 从 huggingface 下载模型文件

```bash
# 安装 Hugging Face 提供的，用来下载模型的包
pip install -U huggingface_hub

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download facebook/dinov2-base --local-dir ./model/dinov2-base
```

## 2. 计算图片 Embedding

使用 Meta 团队的 DINOv2 模型获取 768 维图片 Embedding.

In [1]:
import os
import base64
import torch
import requests
import numpy as np

from transformers import AutoImageProcessor, AutoModel
from torch.utils.data import DataLoader
from PIL import Image

MODEL_PATH = './model/dinov2-base'
IMG_PATH = './img'
API_URL = 'http://localhost:8210/embeddings/'

In [2]:
processor = AutoImageProcessor.from_pretrained(MODEL_PATH)
model = AutoModel.from_pretrained(MODEL_PATH)

In [3]:
image = Image.open(f'{IMG_PATH}/book.jpg')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

In [4]:
last_hidden_states

tensor([[[ 0.6843, -0.7391,  0.5536,  ...,  0.7991, -2.7756, -1.5053],
         [-0.0223, -2.5828, -0.6256,  ...,  0.0468, -2.4570, -0.9830],
         [ 0.0512, -1.5871, -1.0260,  ..., -0.4017, -2.4379, -1.1177],
         ...,
         [-0.9939, -0.2953,  3.4886,  ...,  1.0758, -0.2287,  0.5143],
         [-0.5174, -0.4167,  1.6565,  ...,  1.9487, -1.0684, -0.0174],
         [-1.2063, -0.6236, -0.4622,  ...,  0.2101, -2.1345, -0.9429]]],
       grad_fn=<NativeLayerNormBackward0>)

In [5]:
last_hidden_states.shape

torch.Size([1, 257, 768])

In [6]:
last_hidden_states[:, 0, :].shape

torch.Size([1, 768])

## 3. 批量计算图片 Embedding

### 3.1 在 CPU 上批量推理

In [7]:
def get_img_path(dir_path):
    img_extensions = (".jpg", ".jpeg", ".png")
    img_list = [os.path.join(dir_path, f) for f in os.listdir(dir_path) 
            if f.lower().endswith(img_extensions)]  # 直接过滤图片文件
    return [os.path.abspath(p) for p in img_list]  # 转换为绝对路径

def load_model(device, model_path="facebook/dinov2-base"):
    # 加载模型和预处理器
    processor = AutoImageProcessor.from_pretrained(model_path)
    model = AutoModel.from_pretrained(model_path).to(device)
    return model, processor

In [8]:
# 获取指定目录下所有图片文件的绝对路径
image_paths = get_img_path(dir_path=IMG_PATH)
images = [Image.open(path) for path in image_paths]

In [9]:
inputs = processor(images=images, return_tensors="pt")
outputs = model(**inputs)

embeddings = outputs.last_hidden_state[:, 0, :]
embeddings.shape

torch.Size([2, 768])

In [10]:
inputs.keys()

dict_keys(['pixel_values'])

### 3.2 在 GPU 上批量推理

In [11]:
class ImageDataset(torch.utils.data.Dataset):
    def __init__(self, image_paths, processor):
        self.image_paths = image_paths
        self.processor = processor

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")
        return self.processor(images=image, return_tensors="pt")['pixel_values'][0]

def batch_inference(dataloader, model, device):
    embeddings = []
    device = torch.device(device)
    for pixel_values in dataloader:
        pixel_values = pixel_values.to(device)
        with torch.autocast(device_type=device.type), torch.no_grad():
            outputs = model(pixel_values=pixel_values)

        embeddings.append(outputs.last_hidden_state[:, 0, :].cpu())

        del pixel_values, outputs
        torch.cuda.empty_cache()

    return torch.cat(embeddings, dim=0)

In [12]:
# 获取设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 加载模型
model, processor = load_model(device=device, model_path=MODEL_PATH)

Using device: cuda


In [13]:
# 获取指定目录下所有图片文件的绝对路径
image_paths = get_img_path(dir_path=IMG_PATH)
# images = [Image.open(path) for path in image_paths]

# 加载数据
dataset = ImageDataset(image_paths, processor)
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)

In [14]:
embeddings = batch_inference(dataloader, model, device=device)
embeddings.shape

torch.Size([2, 768])

## 4. 批量推理服务化 

### 4.1 启动服务端

可以用 FastAPI 将 GPU 批量推理功能做成一个 API 服务。该服务接收一个编码了 N 张图片的 base64 字符串列表，返回一个 N 长的图片 Embedding 列表。 

我写了服务端和客户端的代码，放在项目的 `./server` 路径下：

- `dinov2_server.py`: DINOv2 Embedding 服务端
- `dinov2_client.py`: DINOv2 Embedding 客户端

打开命令行，启动 API 服务：

```bash
cd server
python dinov2_server.py
```

如果启动成功，命令行会输出如下提示信息：

```
INFO:     Started server process [22220]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8210 (Press CTRL+C to quit)
```

### 4.2 运行客户端

客户端支持多线程分 batch 调用，我们运行客户端，获取一组图片的 Embedding.

> 注：服务端默认对返回的 Embedding 进行归一化，可通过可选参数 `normalize` 进行调节，详见代码 `./server/dinov2_server.py`

In [15]:
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
    return encoded_string

def client(base64_images):
    response = requests.post(
        API_URL,
        json={"base64_images": base64_images},
        timeout=30
    )
    response.raise_for_status()  # 触发 HTTP 错误状态异常
    return response.json()

In [16]:
image_paths = [
    './img/cat.jpg',
    './img/book.jpg'
]
base64_images = [image_to_base64(p) for p in image_paths]
result = client(base64_images)
np.array(result['embeddings']).shape

(2, 768)

In [17]:
# print(result)