Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gradient accumulation support #2049

Merged
merged 7 commits into from
Jun 18, 2023
Merged

Conversation

ksnzh
Copy link
Contributor

@ksnzh ksnzh commented Jun 8, 2023

This modification has no hurt to current repo. The train step number is same as before, so optimizer and scheduler works on old way. Users can easily increase batch size since nerf network has no batchnorm layer.

@tancik
Copy link
Contributor

tancik commented Jun 8, 2023

Run to fix issues.

pip install -e .[dev]
pre-commit install
pre-commit

@brentyi
Copy link
Collaborator

brentyi commented Jun 9, 2023

Would it make sense to:

  • Divide the batch size by the gradient accumulation count?
  • Provide an "auto" mode that computes the accumulation steps to achieve some effective batch size, while still respecting memory constraints?

(relevant to #2003, which won't fit on smaller GPUs)

@ksnzh
Copy link
Contributor Author

ksnzh commented Jun 12, 2023

Would it make sense to:

  • Divide the batch size by the gradient accumulation count?
  • Provide an "auto" mode that computes the accumulation steps to achieve some effective batch size, while still respecting memory constraints?

(relevant to #2003, which won't fit on smaller GPUs)

I think it's better to let user decide batchsize and coresponding learning rate.

loss = functools.reduce(torch.add, loss_dict.values())
self.grad_scaler.scale(loss).backward() # type: ignore
internal_step = 0
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be clearer with a for loop, since internal_step is unnecessary except for counting iterations

@kerrj
Copy link
Collaborator

kerrj commented Jun 13, 2023

I think memory budgeting is a larger PR in itself, since going from batch size->memory consumption is a pretty nontrivial thing. I'd be in favor of merging in grad accumulation and thinking about how to automatically set these parameters later.

@ksnzh
Copy link
Contributor Author

ksnzh commented Jun 16, 2023

Here are two experients which use num rays of batch 8192 and accumulate two 4096 rays step.

https://wandb.ai/ksnzh/nerfstudio-public

Screenshot from 2023-06-16 21-39-45

@ksnzh ksnzh requested a review from kerrj June 16, 2023 13:41
Copy link
Collaborator

@kerrj kerrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@tancik tancik merged commit c55f5d3 into nerfstudio-project:main Jun 18, 2023
4 checks passed
lucasthahn pushed a commit to TNE-ai/nerfstudio that referenced this pull request Jun 20, 2023
* add gradient accumulation support

* fix 2 blank lines

* fix possibly unbound variable

* update gradient accumulation step with for loop

* add seert and ignore pyright check

---------

Co-authored-by: Zhang Jian <zhangjian49@lenovo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants