Models turn to swiss cheese after >5000 iters #14

Sazoji · 2022-04-02T23:32:10Z

What loss is supposed to control regularity? I've been focusing on this relatively simple model to see what sort of values works well in training for handheld video, and at least the lefthand outer image looks like it fits the images until 5k iters (1 batchsize) into training, the actual model looks very irregular, and finally collapses near the end of training tests.

the colmap (both exhaustive and sequential) tracking look like they map accurately, the paper describes a loss to solve this issue, but I don't know if the total loss must be reduced due to batchsize, or the loss for model regularity must be increased (which I dont know the config line for).
30% of the dataset images have been manually removed due to motion blur and being too far off the edges, which have helped in early training, and slight reduction in early collapses.

I have included the dataset and the key iters that start collapsing (0-1000 iters, 4000-6000 iters) and the final models in the zip below (updated with more compressed iter images):
https://files.catbox.moe/3v79c6.zip

Is the training scheme running through the dataset sequentially, and therefore the final iters failing due to the images at the end? Both passes seem to fail at near end of the session. If so, then randomizing the images (if captured from video) would spread the bad frames out, instead of destroying the model at the end of training.

jmunkberg · 2022-04-04T06:04:05Z

Training should already used shuffled order of the training images, unless you have changed that in your fork
https://github.com/NVlabs/nvdiffrec/blob/main/train.py#L370

If it looks good up to 5k iterations, it seems to be a problem in the second phase of training (where we switch from volumetric texturing w/ MLP, to standard 2D textures, and only optimize vertex positions, not topology).
We use a mesh Laplacian to promote well-formed meshes (only applied in the second phase of training). You can control the influence with the config parameter "laplace_scale" : 10000,. The default value is pretty high, however.

If you see divergence after the first part of training, one option may be to lower the learning rate in the second pass. Change from "learning_rate": [0.03, 0.01], to "learning_rate": [0.03, 0.003], or lower. You can also lock the geometry in the second pass if you want, and only optimize lighting and textures. This is controlled with the "lock_pos": true, config option.

Also, if your GPU memory allows, I would recommend to run with batch size 4 or 8 to get less noisy gradients and better convergence.

JHnvidia · 2022-04-04T12:47:13Z

There seems to be an issue with the dataset. There are just 81 elements in the poses_bounds.npy array, but 87 images.

Note that disabling the assert means that the poses will be mapped to the first 81 image/mask files in the order enumerated by glob.glob(). If you have removed any other images (and I don't see any obvious problems in the last 7 images), the poses will be out of sync causing corruption. I note that the error convergence is quite erratic and cyclical, which could indicate bad poses.

iter=  280, img_loss=0.070088, reg_loss=0.235528, lr=0.02636, time=264.1 ms, rem=25.18 m
iter=  290, img_loss=0.058213, reg_loss=0.233342, lr=0.02624, time=262.8 ms, rem=25.01 m
iter=  300, img_loss=0.060313, reg_loss=0.231738, lr=0.02612, time=260.0 ms, rem=24.70 m
iter=  310, img_loss=0.049451, reg_loss=0.230278, lr=0.02600, time=261.9 ms, rem=24.84 m
iter=  320, img_loss=0.044996, reg_loss=0.228497, lr=0.02588, time=260.2 ms, rem=24.63 m
iter=  330, img_loss=0.078116, reg_loss=0.226427, lr=0.02576, time=262.8 ms, rem=24.83 m
iter=  340, img_loss=0.040160, reg_loss=0.224508, lr=0.02564, time=263.3 ms, rem=24.84 m
iter=  350, img_loss=0.036886, reg_loss=0.222771, lr=0.02552, time=261.8 ms, rem=24.65 m

There are no pose <-> image path links in the LLFF format, so there's unfortunately a direct ordering dependency between the image files and the .npy file.

Sazoji · 2022-04-04T13:08:14Z

that would be what precisely caused it, to match with this merge: Fyusion/LLFF#60
I modified the assert to only require >= images in the set instead of an exact match, will need to find a way of cleaning up with the generated images.txt or further modify the llff loader, considering NeRF datasets can generate valid sets using the same colmap list.

Sazoji · 2022-04-06T16:17:59Z

ok, fork handles the images based on the list produced in the pull I mentioned, I'll probably prefilter images to remove unsuitable/blurry frames with a new video -> images script with ffmpeg-python and remove unused colmap images in the colmap2poses script to avoid this issue. My fork is using a modified dataloader to check for view_imgs.txt, but I'd rather just modify the dataset generation script to keep it compatible with all LLFF dataset loaders.

sdf_regularizer seems like it can be increased far higher than I expected, although I think blown out or near-white textures causes issues (the same places where colmap can't map features seem to be where the mesh fails). A higher laplace_scale and lock_pos: true helps, lock_pos just preventing deviation (or occasional destruction of the model at lower loss) between the dmnet and final pass.
Reducing images to be <864p allows for a batchsize of 3 and without OOM or near OOM causing HDR and texture noise(seen in the opening image).

Sazoji mentioned this issue Apr 4, 2022

Custom dataset config options #3

Closed

Sazoji closed this as completed Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models turn to swiss cheese after >5000 iters #14

Models turn to swiss cheese after >5000 iters #14

Sazoji commented Apr 2, 2022 •

edited

Loading

jmunkberg commented Apr 4, 2022

JHnvidia commented Apr 4, 2022 •

edited

Loading

Sazoji commented Apr 4, 2022

Sazoji commented Apr 6, 2022

Models turn to swiss cheese after >5000 iters #14

Models turn to swiss cheese after >5000 iters #14

Comments

Sazoji commented Apr 2, 2022 • edited Loading

jmunkberg commented Apr 4, 2022

JHnvidia commented Apr 4, 2022 • edited Loading

Sazoji commented Apr 4, 2022

Sazoji commented Apr 6, 2022

Sazoji commented Apr 2, 2022 •

edited

Loading

JHnvidia commented Apr 4, 2022 •

edited

Loading