# Monocular depth estimation

Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a single image. In other words, it is the process of estimating the distance of objects in a scene from a single camera viewpoint.

Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, occlusion, and texture.

There are two main depth estimation categories:

Absolute depth estimation: This task variant aims to provide exact depth measurements from the camera's perspective. The term is used interchangeably with metric depth estimation, where depth is provided in precise measurements in meters or feet. Absolute depth estimation models output depth maps with numerical values that represent real-world distances.

Relative depth estimation: Relative depth estimation aims to predict the depth order of objects or points in a scene without providing the precise measurements. These models output a depth map that indicates which parts of the scene are closer or farther relative to each other without the actual distances to A and B.

In this guide, we will see how to:
1. Infer relative depth using Depth Anything V2, a state-of-the-art zero-shot relative depth estimation model.
2. Infer depth estimation using ZoeDepth, an absolute depth estimation model.

# Libraries

In [None]:
pip install -q -U transformers

In [None]:

import torch
import requests
from PIL import Image
from transformers import pipeline

# Depth estimation via HF pipeline

In [None]:
# Load checkpoint and send computation to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
checkpoint = "depth-anything/Depth-Anything-V2-base-hf"
pipe = pipeline("depth-estimation", model=checkpoint, device=device)

In [None]:
# Load image for analysis
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image

In [None]:
# Pass the image to the pipeline
# The pipeline returns a dictionary with two entries: predicted_depth and depth
# predicted_depth is a tensor with the values being the depth expressed in meters for each pixel
# depth is a PIL image that visualizes the depth estimation result
predictions = pipe(image)

In [None]:
# Visualise the prediction
predictions["depth"]