# Marigold Pipelines for Computer Vision Tasks

**Marigold** is a diffusion-based dense prediction approach, and a set of pipelines for various comptuer vision tasks, such as monocular depth estimation.

Each pipeline supports one CV task, which takes an input RGB image as input and produces a *prediction* of the modality of interest.

| Pipeline | Predicted Modality |
| -------- | ------------------ |
| MarigoldDepthPipeline | Depth, Disparity |
| MarigoldNormalsPipeline | Surface normals |

The official checkpoints is under the [PRES-ETH](https://huggingface.co/prs-eth/) and can be used in the official [codebase](https://github.com/prs-eth/marigold).

## Depth Prediction Quick Start

In [None]:
import torch
from diffusers import MarigoldDepthPipeline
from diffusers.utils import load_image, make_image_grid

pipe = MarigoldDepthPipeline.from_pretrained(
    'prs-eth/marigold-depth-lcm-v1-0', # we use LCM ckpt here for speed
    torch_dtype=torch.float16,
    variant='fp16'
).to('cuda')

In [None]:
image = load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)

vis = pipe.image_processor.visualize_depth(depth.prediction)
depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
make_image_grid([image, vis, depth_16bit], rows=1, cols=3)

The `visualize_depth` function applies `matplotlib.Colormaps` to map the predicted pixel values from a single-channel `[0, 1]` depth range into an RGB image.

## Surface Normals Prediction Quick Start

In [None]:
import torch
from diffusers import MarigoldNormalsPipeline
from diffusers.utils import load_image, make_image_grid

pipe = MarigoldNormalsPipeline.from_pretrained(
    'prs-eth/marigold-normals-lcm-v1-0', # we use LCM ckpt here for speed
    torch_dtype=torch.float16,
    variant='fp16'
).to('cuda')

In [None]:
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
normals = pipe(image)

vis = pipe.image_processor.visualize_normals(normals.prediction)
make_image_grid([image, vis], rows=1, cols=2)

The `visualize_normals` function maps the three-dimensional prediction with pixel values in the range `[-1, 1]` into an RGB image. Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis points right, `Y` axis points up, and `Z` axis points at the viewer.

## Speeding up inference

We have already optimized for speed by
* loading a LCM checkpoint,
* use `fp16` variant of weights and computation,
* perform just one denosing diffusion step.

Internally,
1. the VAE encoder encodes the input image,
2. the UNet performs one denoising step,
3. the VAE decoder decodes the prediction latent into pixel space.

Since Marigold's latent space is compatible with the base SD, it is possible to speed up the pipeline by using a lightweight replacement of the SD VAE

In [None]:
import torch
from diffusers import MarigoldNormalsPipeline, AutoencoderTiny
from diffusers.utils import load_image, make_image_grid

pipe = MarigoldNormalsPipeline.from_pretrained(
    'prs-eth/marigold-depth-lcm-v1-0', # we use LCM ckpt here for speed
    torch_dtype=torch.float16,
    variant='fp16'
).to('cuda')

pipe.vae = AutoencoderTiny.from_pretrained(
    'madebyollin/taesd',
    torch_dtype=torch.float16
).cuda()

In [None]:
image = load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)

vis = pipe.image_processor.visualize_depth(depth.prediction)
make_image_grid([image, vis], rows=1, cols=2)

As suggested in the Optimization notebook, we can add `torch.compile` to squeeze extra performance:

In [None]:
import torch
from diffusers import MarigoldNormalsPipeline, AutoencoderTiny
from diffusers.utils import load_image, make_image_grid

pipe = MarigoldNormalsPipeline.from_pretrained(
    'prs-eth/marigold-depth-lcm-v1-0', # we use LCM ckpt here for speed
    torch_dtype=torch.float16,
    variant='fp16'
).to('cuda')

pipe.vae = AutoencoderTiny.from_pretrained(
    'madebyollin/taesd',
    torch_dtype=torch.float16
).cuda()

pipe.unet = torch.compile(
    pipe.unet,
    mode='reduce-overhead',
    fullgraph=True
)

In [None]:
image = load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)

vis = pipe.image_processor.visualize_depth(depth.prediction)
make_image_grid([image, vis], rows=1, cols=2)

## Qualitative Comparison with Depth Anything

In [None]:
# Marigold Depth
import torch
from diffusers import MarigoldDepthPipeline
from diffusers.utils import load_image, make_image_grid

pipe = MarigoldDepthPipeline.from_pretrained(
    'prs-eth/marigold-depth-lcm-v1-0', # we use LCM ckpt here for speed
    torch_dtype=torch.float16,
    variant='fp16'
).to('cuda')

In [None]:
image = load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth_marigold = pipe(image)

vis = pipe.image_processor.visualize_depth(depth.prediction)

In [None]:
# Depth Anything
from transformers import pipeline

pipe = pipeline(
    task='depth-estimation',
    model='LiheYoung/depth-anything-large-hf'
).to('cuda')

In [None]:
depth_anything = pipe(image)['depth']

In [None]:
make_image_grid([image, vis, depth_anything], rows=1, cols=3)

## Maximizing Prediction and Ensembling

Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents. This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion.

In [None]:
from diffusers import MarigoldNormalsPipeline
from diffusers.schedulers import DDIMScheduler, LCMScheduler
from diffusers.utils import load_image, make_image_grid

model_path = 'prs-eth/marigold-normals-v1-0'

model_paper_kwargs = {
    DDIMScheduler: {
        'num_inference_steps': 10,
        'ensemble_size': 10,
    },
    LCMScheduler: {
        'num_inference_steps': 4,
        'ensemble_size': 5
    }
}

pipe = MarigoldNormalsPipeline.from_pretrained(model_path).to('cuda')
pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)]

In [None]:
image = load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")

normals = pipe(image, **pipe_kwargs)

vis = pipe.image_processor.visualize_normals(normals.prediction)
make_image_grid([image, vis], rows=1, cols=2)

## Quantitative Evaluation

To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), follow the evaluation protocol:
* load the full precision `fp32` model and use appropriate values for `num_inference_steps` and `ensemble_size`.
* set up random seed for reproducibility.

In [None]:
from diffusers import MarigoldDepthPipeline
from diffusers.schedulers import DDIMScheduler, LCMScheduler
from diffusers.utils import load_image, make_image_grid
import torch

device = 'cuda'
seed = 111
model_path = 'prs-eth/marigold-v1-0'

model_paper_kwargs = {
	diffusers.schedulers.DDIMScheduler: {
		"num_inference_steps": 50,
		"ensemble_size": 10,
	},
	diffusers.schedulers.LCMScheduler: {
		"num_inference_steps": 4,
		"ensemble_size": 10,
	},
}

generator = torch.Generator(device).manual_seed(seed)

pipe = MarigoldDepthPipeline.from_pretrained(model_path).to(device)
pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)]

In [None]:
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image, generator=generator, **pipe_kwargs)

## Using Predictive Uncertainty

The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random latents. As a side effect, it can be used to quantify epistemic (model) uncertainty.

In [None]:
from diffusers import MarigoldDepthPipeline
from diffusers.utils import load_image, make_image_grid
import torch

pipe = MarigoldDepthPipeline.from_pretrained(
    'prs-eth/marigold-depth-lcm-v1-0',
    torch_dtype=torch.float16,
    variant='fp16'
).to('cuda')

In [None]:
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")

depth = pipe(
    image,
    ensemble_size=10, # any number greater than 1; higher values yield higher precision
    output_uncertainty=True
)

uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty)
vis = pipe.image_processor.visualize_depth(depth.prediction)
make_image_grid([image, vis, uncertainty], rows=1, cols=3)

Higher values (white) correspod to pixels, where the model struggles to make consistent predictions.

## Frame-by-frame Video Processing with Temporal Consistency

Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent initialization. This becomes an obvious drawback compared to traditional end-to-end dense regression networks.

To address this issue, it is possible to pass `latents` to the pipeline, which defines the starting point of diffusion.

In [None]:
from diffusers import MarigoldDepthPipeline, AutoencoderTiny
from diffusers.utils import load_image, export_to_gif
import imageio
from PIL import Image
from tqdm import tqdm
import torch

device = 'cuda'
path_in = 'obama.mp4'
path_out = 'obama_depth.gif'

pipe = MarigoldDepthPipeline.from_pretrained(
    'prs-eth/marigold-depth-lcm-v1-0',
    torch_dtype=torch.float16,
    variant='fp16'
).to(device)

pipe.vae = AutoencoderTiny.from_pretrained(
    'madebyollin/taesd',
    torch_dtype=torch.float16
).to(device)
pipe.set_progress_bar_config(disable=True)

In [None]:
with imageio.get_reader(path_in) as reader:
    size = reader.get_meta_data()['size']
    last_frame_latent = None

    latent_common = torch.rand(
        (1, 4, 768 * size[1] // (8 * max(size)), 768 * size[0] // (8 * max(size)))
    ).to(device=device, dtype=torch.float16)

    out = []
    for frame_id, frame in tqdm(enumerate(reader), desc="Processing Video"):
        frame = Image.fromarray(frame)
        latents = latent_common
        if last_frame_latent is not None:
            latents = 0.9 * latents + 0.1 * last_frame_latent

        depth = pipe(
            frame,
            match_input_resolution=False,
            latents=latents,
            output_latent=True
        )
        last_frame_latent = depth.latent

        out.append(pipe.image_processor.visualize_depth(depth.prediction)[0])

    export_to_gif(out, path_out, fps=reader.get_meta_data()['fps'])

The diffusion process starts from the given computed latent. The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent initialization.

## Marigold for ControlNet

In [None]:
from diffusers import MarigoldDepthPipeline, ControlNetModel, StableDiffusionXLControlNetPipeline, DPMSolverMultistepScheduler
from diffusers.utils import load_image, make_image_grid
import torch

device = 'cuda'

pipe = MarigoldDepthPipeline.from_pretrained(
    'prs-eth/marigold-depth-lcm-v1-0',
    torch_dtype=torch.float16,
    variant='fp16'
).to(device)

controlnet = ControlNetModel.from_pretrained(
    'diffusers/controlnet-depth-sdxl-1.0',
    torch_dtype=torch.float16,
    variant='fp16',
).to(device)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    'SD161222/RealVisXL_V4.0',
    controlnet=controlnet,
    torch_dtype=torch.float16,
    variant='fp16'
).to(device)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config,
    use_karras_sigmas=True
)

In [None]:
image = diffusers.utils.load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"
)
generator = torch.Generator(device).manual_seed(111)

depth_image = pipe(image, generator=generator).prediction
depth_image = pipe.image_processor.visualize_depth(depth_image, color_map='binary')

controlnet_out = pipe(
    prompt='high quality photo of a sports bike, city',
    negative_prompt="",
    guidance_scale=6.5,
    num_inference_steps=25,
    image=depth_image,
    generator=generator,
    controlnet_conditioning_scale=0.7,
    control_guidance_end=0.7,
).images[0]
make_image_grid([image, depth_image, controlnet_out], rows=1, cols=3)