nerfacto-huge #2003

kerrj · 2023-05-25T16:59:57Z

WIP for a very high quality nerfacto version
People should play with it and see how it works, I noticed stability issues on a couple scenes (albeit captured from robots with bad camera poses) I think this mainly comes from the higher hashgrid resolution introducing more high frequency noise.
Main parameters that matter:

batch size (surprisingly)
hidden dims for transient/mlp/color (haven't ablated whether transient matters but the others do for sure)
number of samples (cranked up to 512, 512, 64)
hashgrid resolution (fairly important for fine details on things like lego and typewriter in the scene below)
I had stability issues when I used Adam, RAdam fixes this, but it feels odd to have to use RAdam. Maybe worth dropping the learning rate for these?
comparison pics below between nerfacto-big and nerfacto-huge
Big:

Huge:

kerrj · 2023-06-15T07:42:03Z

I want to know that I use nerfacto-big (3090) and nerfact (RTXA5500) to process the same data. I found that the effect of nerfacto-big is not as good as nerfacto. Is it because of the graphics card problem?

@wtj-zhong Interesting, the GPU shouldn't matter, could you give more details on the scenes/images?

kerrj · 2023-06-15T07:44:14Z

nerfstudio/fields/nerfacto_field.py

@@ -112,8 +113,7 @@ def __init__(
        self.use_pred_normals = use_pred_normals
        self.pass_semantic_gradients = pass_semantic_gradients

-        base_res: int = 16
-        features_per_level: int = 2
+        base_res: int = 32


Neuralangelo uses 32 as base_res, maybe we could make it a parameter, but I think bumping to 32 is fairly safe

This will probably break existing checkpoints. Maybe make it a parameter with the default being the old value

nerfstudio/configs/method_configs.py

tancik

LGTM

SharkWipf · 2023-06-15T21:46:32Z

FWIW: A week or so ago, I tried this branch and it ran perfectly on my 3090, even on a 1200 image sample input.
Retesting the now-merged version, trying to run with 1200 sample images hits an out of memory after a ~175GB of system RAM used, and running with 300 images takes up 60GB system memory and shortly after OOM's on my GPU's 24GB of VRAM.

I am not sure if this is intentional or not, as the PR wasn't "done" before, but it is certainly an explosive growth in resource usage compared to before.
If it is not intentional, I can create an issue.

kerrj · 2023-06-15T22:57:02Z

I'll look into this and try to patch it, the intention is to fit on a 24G GPU

kerrj · 2023-06-15T23:09:49Z

Try justin/nerfhuge-memory

kerrj · 2023-06-15T23:10:14Z

#2082

SharkWipf · 2023-06-15T23:25:06Z

You move quickly!
The system memory still seems to be way higher than before, running up to ~60GB during load on 300 images before dropping down to 34GB, but the VRAM issue seems to be solved, it seems to be running with ~21-23GB VRAM now, so far (though it is only 2% done).
I can even open the web viewer again while rendering, something I couldn't do anymore when training nerfacto-big in recent commits.
So the VRAM issue seems to be solved on this branch, but the system RAM issue (if it is an issue) is still present, going past 175GB to load 1200 images where I could do it in "somewhere" sub-100GB before (Not sure on the exact figure, but this machine only had 100GB RAM allocated at the time, so it can't have been more than that as upper bound).

SharkWipf · 2023-06-15T23:28:09Z

I just noticed I'm getting some TCNN CutlassMLP warnings now that I didn't get before that might be relevant to the memory usage I suppose:

For now I'll leave this running overnight and see if it hits any OOMs, and I'll see if I can find anything more tomorrow.

kerrj · 2023-06-15T23:30:24Z

That warning is not a problem it just means that tinycudann is defaulting to a slightly slower version of MLP which can handle larger internal hidden units. Before we didn't print anything but now it prints some warnings.

Does it seem run without issues on your GPU now?

machenmusik · 2023-06-16T00:30:51Z

(Also, if your dataset has masks, #1730 changed handling)

SharkWipf · 2023-06-16T06:19:38Z

Does it seem run without issues on your GPU now?

Overnight it completed the run without errors, I haven't gotten around to check the result yet but I assume that's fine too.
I'll have to see if I can find a specific commit where the system memory increased so I can bisect it, though it'll have to wait a bit as, well, my power will be out today, heh.

(Also, if your dataset has masks, #1730 changed handling)

No masks here.

SharkWipf · 2023-06-16T07:47:27Z

Okay, power stayed on after all so I did some digging into the system memory issue.
The first commit that exhibits this behavior is f3ac598.
It's possible it was introduced in 4acc06b, but this commit throws an error when run, so I can't test that.
The last commit before that, 374d7fe, uses around up to 12GB RAM, the later commits use >175GB RAM for the same command (ns-train nerfacto-huge --max-num-iterations 100000 --viewer.quit-on-train-completion True --pipeline.model.use_gradient_scaling True --pipeline.model.predict-normals True --data /home/sebastiaan/src/nerf/work/out/colmaps/VID_20230615_163207/1200 --output-dir /home/sebastiaan/src/nerf/work/out/models/VID_20230615_163207/100000).

This all happens during the "Loading data batch" stage.
It also seems to affect the other nerfacto methods to a (much) lesser degree, where the loading is significantly slower and uses ~50-100% more memory than before the offending commit.
I'll do some more digging later to see if I can find the main branch commit that caused this and open an issue when I've found it (unless you beat me to it of course), since it doesn't seem to be introduced by this PR, this PR just seems to be disproportionally affected by it for some reason.

SharkWipf · 2023-06-16T08:12:21Z

Sorry, no, it does not seem to affect the other nerfacto methods after all. There seems to be an inconsistency on my system that makes the loading process take varying amounts of time that made it seem so.
In other words, since it doesn't apply to the other methods I can't effectively bisect it on the main branch unless I go cherry picking and hand-merging individual commits.

I'll make an issue for this, since a closed PR is probably not the best way to track this.

An even larger version of nerfacto with scaled hashgrid resolution, MLP sizes, batch sizes, appearance embeddings, proposal sampler resolution, and more. Also includes some minor tweaks to nerfacto-big configs

kerrj added 30 commits January 24, 2023 17:19

Increase max_res of first proposal network

b95daad

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

bf0f2e5

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

d2698e0

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

fdd3c72

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

85491fa

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

bae8e5c

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

bb02ae0

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

03efc6a

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

6f89617

initial hashgrid schedulingimpl

d03317d

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

03a4a45

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

4a63d6b

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

a886dc4

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

ea2378e

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

1b1bb22

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

06825f4

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

4b040a1

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

0f56cc6

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

134ff3e

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

4ac02b3

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

b421b2e

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

1ebfaed

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

9b86877

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

a259f93

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

7e32c8c

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

9db295a

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

86135cf

Merge branch 'main' of https://github.com/nerfstudio-project/nerfstudio

04f7af5

add nerfacto huge

36d7564

cutlass/fullyfused toggling

2fb434c

kerrj added 3 commits June 15, 2023 05:49

remove grad accumulation stuff

3a9ea33

lint

6200328

lint

3bba412

kerrj commented Jun 15, 2023

View reviewed changes

kerrj added 2 commits June 15, 2023 07:47

small reverts to defaults

83ffe40

lint

e15d8fd

kerrj marked this pull request as ready for review June 15, 2023 17:18

kerrj requested review from jkulhanek, ethanweber, tancik and brentyi June 15, 2023 17:18

brentyi approved these changes Jun 15, 2023

View reviewed changes

nerfstudio/configs/method_configs.py Outdated Show resolved Hide resolved

nerfstudio/configs/method_configs.py Outdated Show resolved Hide resolved

base_res as param, powers of 2 for brent

9a9c5ba

tancik approved these changes Jun 15, 2023

View reviewed changes

kerrj merged commit 8fad92c into main Jun 15, 2023
4 checks passed

kerrj deleted the justin/nerf-gigantic branch June 15, 2023 18:05

SharkWipf mentioned this pull request Jun 16, 2023

[Regression] Nerfacto-huge system memory usage went from 12GB to 175+GB (1200 images) #2087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nerfacto-huge #2003

nerfacto-huge #2003

kerrj commented May 25, 2023

kerrj commented Jun 15, 2023

kerrj Jun 15, 2023

tancik Jun 15, 2023

kerrj Jun 15, 2023

tancik left a comment

SharkWipf commented Jun 15, 2023

kerrj commented Jun 15, 2023

kerrj commented Jun 15, 2023

kerrj commented Jun 15, 2023

SharkWipf commented Jun 15, 2023

SharkWipf commented Jun 15, 2023

kerrj commented Jun 15, 2023

machenmusik commented Jun 16, 2023

SharkWipf commented Jun 16, 2023

SharkWipf commented Jun 16, 2023 •

edited

SharkWipf commented Jun 16, 2023

nerfacto-huge #2003

nerfacto-huge #2003

Conversation

kerrj commented May 25, 2023

kerrj commented Jun 15, 2023

kerrj Jun 15, 2023

Choose a reason for hiding this comment

tancik Jun 15, 2023

Choose a reason for hiding this comment

kerrj Jun 15, 2023

Choose a reason for hiding this comment

tancik left a comment

Choose a reason for hiding this comment

SharkWipf commented Jun 15, 2023

kerrj commented Jun 15, 2023

kerrj commented Jun 15, 2023

kerrj commented Jun 15, 2023

SharkWipf commented Jun 15, 2023

SharkWipf commented Jun 15, 2023

kerrj commented Jun 15, 2023

machenmusik commented Jun 16, 2023

SharkWipf commented Jun 16, 2023

SharkWipf commented Jun 16, 2023 • edited

SharkWipf commented Jun 16, 2023

SharkWipf commented Jun 16, 2023 •

edited