You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have come across this issue while experimenting with the VAE depth decoder ‘decode_depth’ and the single inference function ‘single_infer’. The VAE decoder is not bound to the ranges of [-1,1]. In many instances, for a given image (normalized to the Stable Diffusion v2 native resolution), its decoded latent results in min-max values of around [-1.5, 1.4]. These ranges differ with respect to the image contents, aspect ratio, and in the case of inference, the initial isotropic noise.
At the end of the inference function ‘single_infer’, the decoded generated depth map is simply clipped to [-1,1]. This effectively removes valuable depth information from the generated value distribution, and thus assigns the depth value of 0 (or 1, respectively) to all values outside of [-1,1]. Intuitively, clipping results in a squished depth map. Instead, to retain the complete generated depth value distribution, it is best to swap the clipping and shifting operations to min-max normalization to [0,1]:
min_depth = torch.min(depth)
max_depth = torch.max(depth)
depth = (depth - min_depth) / (max_depth - min_depth)
depth = torch.clamp(depth, 0, 1).
This squishing also affects the final aggregated depth map, as some generated depth maps have decoded ranges closer to [-1,1], retaining these extreme depth values, while others do not. Usually, min-max normalization is not a fix in these kinds of situations. However, since the task is monocular depth estimation, the closest and farthest points must be associated with the values 0 and 1 respectively.
Please let me know if I am missing something.
Best.
The text was updated successfully, but these errors were encountered:
During the fine-tuning process, Stable Diffusion quickly adapts its latents to be within the range of [-1,1] after decoding. One can plot a histogram of the depth values of the decoded generated depth maps and see that the overwhelming majority of the depth map distribution is bound between [-1,1]. The few depth values outside of these ranges can be considered outliers. If not clipped, extreme outliers may lead to a squishing of the objects within the [-1,1] range to accommodate them. I presume that with more training time, the number of outliers will converge to 0.
Hello everybody,
I have come across this issue while experimenting with the VAE depth decoder ‘decode_depth’ and the single inference function ‘single_infer’. The VAE decoder is not bound to the ranges of [-1,1]. In many instances, for a given image (normalized to the Stable Diffusion v2 native resolution), its decoded latent results in min-max values of around [-1.5, 1.4]. These ranges differ with respect to the image contents, aspect ratio, and in the case of inference, the initial isotropic noise.
At the end of the inference function ‘single_infer’, the decoded generated depth map is simply clipped to [-1,1]. This effectively removes valuable depth information from the generated value distribution, and thus assigns the depth value of 0 (or 1, respectively) to all values outside of [-1,1]. Intuitively, clipping results in a squished depth map. Instead, to retain the complete generated depth value distribution, it is best to swap the clipping and shifting operations to min-max normalization to [0,1]:
min_depth = torch.min(depth)
max_depth = torch.max(depth)
depth = (depth - min_depth) / (max_depth - min_depth)
depth = torch.clamp(depth, 0, 1).
This squishing also affects the final aggregated depth map, as some generated depth maps have decoded ranges closer to [-1,1], retaining these extreme depth values, while others do not. Usually, min-max normalization is not a fix in these kinds of situations. However, since the task is monocular depth estimation, the closest and farthest points must be associated with the values 0 and 1 respectively.
Please let me know if I am missing something.
Best.
The text was updated successfully, but these errors were encountered: