Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps #22

Closed
GonzaloMartinGarcia opened this issue Dec 19, 2023 · 2 comments

Comments

@GonzaloMartinGarcia
Copy link

Hello everybody,

I have come across this issue while experimenting with the VAE depth decoder ‘decode_depth’ and the single inference function ‘single_infer’. The VAE decoder is not bound to the ranges of [-1,1]. In many instances, for a given image (normalized to the Stable Diffusion v2 native resolution), its decoded latent results in min-max values of around [-1.5, 1.4]. These ranges differ with respect to the image contents, aspect ratio, and in the case of inference, the initial isotropic noise.

At the end of the inference function ‘single_infer’, the decoded generated depth map is simply clipped to [-1,1]. This effectively removes valuable depth information from the generated value distribution, and thus assigns the depth value of 0 (or 1, respectively) to all values outside of [-1,1]. Intuitively, clipping results in a squished depth map. Instead, to retain the complete generated depth value distribution, it is best to swap the clipping and shifting operations to min-max normalization to [0,1]:
min_depth = torch.min(depth)
max_depth = torch.max(depth)
depth = (depth - min_depth) / (max_depth - min_depth)
depth = torch.clamp(depth, 0, 1).

This squishing also affects the final aggregated depth map, as some generated depth maps have decoded ranges closer to [-1,1], retaining these extreme depth values, while others do not. Usually, min-max normalization is not a fix in these kinds of situations. However, since the task is monocular depth estimation, the closest and farthest points must be associated with the values 0 and 1 respectively.

Please let me know if I am missing something.
Best.

@jaidevshriram
Copy link

I've noticed this as well - curious as to why the depth was clipped too!

@GonzaloMartinGarcia
Copy link
Author

GonzaloMartinGarcia commented Jan 17, 2024

During the fine-tuning process, Stable Diffusion quickly adapts its latents to be within the range of [-1,1] after decoding. One can plot a histogram of the depth values of the decoded generated depth maps and see that the overwhelming majority of the depth map distribution is bound between [-1,1]. The few depth values outside of these ranges can be considered outliers. If not clipped, extreme outliers may lead to a squishing of the objects within the [-1,1] range to accommodate them. I presume that with more training time, the number of outliers will converge to 0.
depth_map_histogram

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants