Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps #22

GonzaloMartinGarcia · 2023-12-19T14:44:25Z

Hello everybody,

I have come across this issue while experimenting with the VAE depth decoder ‘decode_depth’ and the single inference function ‘single_infer’. The VAE decoder is not bound to the ranges of [-1,1]. In many instances, for a given image (normalized to the Stable Diffusion v2 native resolution), its decoded latent results in min-max values of around [-1.5, 1.4]. These ranges differ with respect to the image contents, aspect ratio, and in the case of inference, the initial isotropic noise.

At the end of the inference function ‘single_infer’, the decoded generated depth map is simply clipped to [-1,1]. This effectively removes valuable depth information from the generated value distribution, and thus assigns the depth value of 0 (or 1, respectively) to all values outside of [-1,1]. Intuitively, clipping results in a squished depth map. Instead, to retain the complete generated depth value distribution, it is best to swap the clipping and shifting operations to min-max normalization to [0,1]:
min_depth = torch.min(depth)
max_depth = torch.max(depth)
depth = (depth - min_depth) / (max_depth - min_depth)
depth = torch.clamp(depth, 0, 1).

This squishing also affects the final aggregated depth map, as some generated depth maps have decoded ranges closer to [-1,1], retaining these extreme depth values, while others do not. Usually, min-max normalization is not a fix in these kinds of situations. However, since the task is monocular depth estimation, the closest and farthest points must be associated with the values 0 and 1 respectively.

Please let me know if I am missing something.
Best.

jaidevshriram · 2023-12-21T07:18:06Z

I've noticed this as well - curious as to why the depth was clipped too!

GonzaloMartinGarcia · 2024-01-17T23:01:52Z

During the fine-tuning process, Stable Diffusion quickly adapts its latents to be within the range of [-1,1] after decoding. One can plot a histogram of the depth values of the decoded generated depth maps and see that the overwhelming majority of the depth map distribution is bound between [-1,1]. The few depth values outside of these ranges can be considered outliers. If not clipped, extreme outliers may lead to a squishing of the objects within the [-1,1] range to accommodate them. I presume that with more training time, the number of outliers will converge to 0.

GonzaloMartinGarcia closed this as completed Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps #22

Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps #22

GonzaloMartinGarcia commented Dec 19, 2023

jaidevshriram commented Dec 21, 2023

GonzaloMartinGarcia commented Jan 17, 2024 •

edited

Loading

Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps #22

Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps #22

Comments

GonzaloMartinGarcia commented Dec 19, 2023

jaidevshriram commented Dec 21, 2023

GonzaloMartinGarcia commented Jan 17, 2024 • edited Loading

GonzaloMartinGarcia commented Jan 17, 2024 •

edited

Loading