In [None]:
!pip install -qU transformers

# Monocular Depth Estimation

**Monocular depth estimation** is a computer vision task that involves predicting the depth information of a scene from a single image. It is the process of estimating the distance of objects in a scene from a single camera viewpoint.

Applications of monocular depth estimation:
* 3D reconstruction,
* augmented reality,
* autonomous driving,
* robotics

Depth estimation is challenging because it requires the model to understand the complex relationships between objects in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, occlusion, and texture.

Depth estimation is categories into two parts:
* **Absolute depth estimation** aims to provide exact depth measurements from the camera. Also known as **metric depth estimation**, where depth is provided in precise measurements in meters or feet.
* **Relative depth estimation** aims to predict the depth order of objects or points in a scene without providing the precise measurements. These models output a depth map that indicates which parts of the scene are closer or farther relative to each other without the actual distances to A and B.

We will use **Depth Anything V2** and **ZeoDepth** here.

## Depth estimation pipeline

In [3]:
from transformers import pipeline
import torch
from accelerate.test_utils.testing import get_backend

device, _, _ = get_backend()
checkpoint = 'depth-anything/Depth-Anything-V2-base-hf'
pipe = pipeline(
    'depth-estimation',
    model=checkpoint,
    device=device
)

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/390M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cpu


In [4]:
# load an example image
from PIL import Image
import requests

url ='https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image

Output hidden; open in https://colab.research.google.com to view.

In [6]:
predictions = pipe(image)
predictions

{'predicted_depth': tensor([[1.7198, 1.7202, 1.7148,  ..., 1.3417, 1.3370, 1.3373],
         [1.7199, 1.7203, 1.7150,  ..., 1.3412, 1.3365, 1.3368],
         [1.7179, 1.7183, 1.7132,  ..., 1.3471, 1.3425, 1.3428],
         ...,
         [4.4570, 4.4574, 4.4519,  ..., 1.0097, 1.0122, 1.0120],
         [4.4591, 4.4595, 4.4540,  ..., 1.0104, 1.0130, 1.0128],
         [4.4590, 4.4594, 4.4538,  ..., 1.0104, 1.0129, 1.0128]]),
 'depth': <PIL.Image.Image image mode=L size=5184x3456>}

The pipeline returns a dictionary with two entries
* `predicted_depth`, a tensor with the values being the depth expressed in meters for each pixel.
* `depth`, a PIL image that visualizes the depth estimation result.

In [7]:
predictions['depth']

Output hidden; open in https://colab.research.google.com to view.

## Depth estimation inference by hand

In [None]:
from transformers import AutoImageProcessor, AutoModelForDepthEstimation

checkpoint = 'Intel/zoedepth-nyu-kitti'

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
model = AutoModelForDepthEstimation.from_pretrained(checkpoint)

We still use the same image and prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations such as resizing and normalization:

In [9]:
pixel_values = image_processor(
    image,
    return_tensors='pt'
).pixel_values.to(device)

In [10]:
import torch

with torch.no_grad():
    outputs = model(pixel_values)

We still need to post-process the results to remove any padding and resize the depth map to match the original image size. The `post_process_depth_estimation` outputs a list of dictionaries containing the `"predicted_depth"`.

In [12]:
# ZoeDepth dynamically pads the input image.
# We pass the original image size as arguemnt to
# `post_process_depth_estimation` to remove the padding and resize to original dimensions
post_processed_output = image_processor.post_process_depth_estimation(
    outputs,
    source_sizes=[(image.height, image.width)]
)

predicted_depth = post_processed_output[0]['predicted_depth']

depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
depth = depth.detach().cpu().numpy() * 255
depth = Image.fromarray(depth.astype('uint8'))
depth

Output hidden; open in https://colab.research.google.com to view.