Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nerfacto-huge #2003

Merged
merged 45 commits into from
Jun 15, 2023
Merged

nerfacto-huge #2003

merged 45 commits into from
Jun 15, 2023

Conversation

kerrj
Copy link
Collaborator

@kerrj kerrj commented May 25, 2023

WIP for a very high quality nerfacto version
People should play with it and see how it works, I noticed stability issues on a couple scenes (albeit captured from robots with bad camera poses) I think this mainly comes from the higher hashgrid resolution introducing more high frequency noise.
Main parameters that matter:

  • batch size (surprisingly)
  • hidden dims for transient/mlp/color (haven't ablated whether transient matters but the others do for sure)
  • number of samples (cranked up to 512, 512, 64)
  • hashgrid resolution (fairly important for fine details on things like lego and typewriter in the scene below)
    I had stability issues when I used Adam, RAdam fixes this, but it feels odd to have to use RAdam. Maybe worth dropping the learning rate for these?
    comparison pics below between nerfacto-big and nerfacto-huge
    Big:
    image
    Huge:
    image

@kerrj
Copy link
Collaborator Author

kerrj commented Jun 15, 2023

I want to know that I use nerfacto-big (3090) and nerfact (RTXA5500) to process the same data. I found that the effect of nerfacto-big is not as good as nerfacto. Is it because of the graphics card problem?

@wtj-zhong Interesting, the GPU shouldn't matter, could you give more details on the scenes/images?

@@ -112,8 +113,7 @@ def __init__(
self.use_pred_normals = use_pred_normals
self.pass_semantic_gradients = pass_semantic_gradients

base_res: int = 16
features_per_level: int = 2
base_res: int = 32
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neuralangelo uses 32 as base_res, maybe we could make it a parameter, but I think bumping to 32 is fairly safe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably break existing checkpoints. Maybe make it a parameter with the default being the old value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok

@kerrj kerrj marked this pull request as ready for review June 15, 2023 17:18
nerfstudio/configs/method_configs.py Outdated Show resolved Hide resolved
nerfstudio/configs/method_configs.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tancik tancik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kerrj kerrj merged commit 8fad92c into main Jun 15, 2023
4 checks passed
@kerrj kerrj deleted the justin/nerf-gigantic branch June 15, 2023 18:05
@SharkWipf
Copy link
Contributor

FWIW: A week or so ago, I tried this branch and it ran perfectly on my 3090, even on a 1200 image sample input.
Retesting the now-merged version, trying to run with 1200 sample images hits an out of memory after a ~175GB of system RAM used, and running with 300 images takes up 60GB system memory and shortly after OOM's on my GPU's 24GB of VRAM.

I am not sure if this is intentional or not, as the PR wasn't "done" before, but it is certainly an explosive growth in resource usage compared to before.
If it is not intentional, I can create an issue.

@kerrj
Copy link
Collaborator Author

kerrj commented Jun 15, 2023

I'll look into this and try to patch it, the intention is to fit on a 24G GPU

@kerrj
Copy link
Collaborator Author

kerrj commented Jun 15, 2023

Try justin/nerfhuge-memory

@kerrj
Copy link
Collaborator Author

kerrj commented Jun 15, 2023

#2082

@SharkWipf
Copy link
Contributor

You move quickly!
The system memory still seems to be way higher than before, running up to ~60GB during load on 300 images before dropping down to 34GB, but the VRAM issue seems to be solved, it seems to be running with ~21-23GB VRAM now, so far (though it is only 2% done).
I can even open the web viewer again while rendering, something I couldn't do anymore when training nerfacto-big in recent commits.
So the VRAM issue seems to be solved on this branch, but the system RAM issue (if it is an issue) is still present, going past 175GB to load 1200 images where I could do it in "somewhere" sub-100GB before (Not sure on the exact figure, but this machine only had 100GB RAM allocated at the time, so it can't have been more than that as upper bound).

@SharkWipf
Copy link
Contributor

I just noticed I'm getting some TCNN CutlassMLP warnings now that I didn't get before that might be relevant to the memory usage I suppose:
image

For now I'll leave this running overnight and see if it hits any OOMs, and I'll see if I can find anything more tomorrow.

@kerrj
Copy link
Collaborator Author

kerrj commented Jun 15, 2023

That warning is not a problem it just means that tinycudann is defaulting to a slightly slower version of MLP which can handle larger internal hidden units. Before we didn't print anything but now it prints some warnings.

Does it seem run without issues on your GPU now?

@machenmusik
Copy link
Contributor

(Also, if your dataset has masks, #1730 changed handling)

@SharkWipf
Copy link
Contributor

Does it seem run without issues on your GPU now?

Overnight it completed the run without errors, I haven't gotten around to check the result yet but I assume that's fine too.
I'll have to see if I can find a specific commit where the system memory increased so I can bisect it, though it'll have to wait a bit as, well, my power will be out today, heh.

(Also, if your dataset has masks, #1730 changed handling)

No masks here.

@SharkWipf
Copy link
Contributor

SharkWipf commented Jun 16, 2023

Okay, power stayed on after all so I did some digging into the system memory issue.
The first commit that exhibits this behavior is f3ac598.
It's possible it was introduced in 4acc06b, but this commit throws an error when run, so I can't test that.
The last commit before that, 374d7fe, uses around up to 12GB RAM, the later commits use >175GB RAM for the same command (ns-train nerfacto-huge --max-num-iterations 100000 --viewer.quit-on-train-completion True --pipeline.model.use_gradient_scaling True --pipeline.model.predict-normals True --data /home/sebastiaan/src/nerf/work/out/colmaps/VID_20230615_163207/1200 --output-dir /home/sebastiaan/src/nerf/work/out/models/VID_20230615_163207/100000).

This all happens during the "Loading data batch" stage.
It also seems to affect the other nerfacto methods to a (much) lesser degree, where the loading is significantly slower and uses ~50-100% more memory than before the offending commit.
I'll do some more digging later to see if I can find the main branch commit that caused this and open an issue when I've found it (unless you beat me to it of course), since it doesn't seem to be introduced by this PR, this PR just seems to be disproportionally affected by it for some reason.

@SharkWipf
Copy link
Contributor

Sorry, no, it does not seem to affect the other nerfacto methods after all. There seems to be an inconsistency on my system that makes the loading process take varying amounts of time that made it seem so.
In other words, since it doesn't apply to the other methods I can't effectively bisect it on the main branch unless I go cherry picking and hand-merging individual commits.

I'll make an issue for this, since a closed PR is probably not the best way to track this.

lucasthahn pushed a commit to TNE-ai/nerfstudio that referenced this pull request Jun 20, 2023
An even larger version of nerfacto with scaled hashgrid resolution, MLP sizes, batch sizes, appearance embeddings, proposal sampler resolution, and more.
Also includes some minor tweaks to nerfacto-big configs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants