Option to process images on GPU for speed at the expense of memory #2288
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Analogous to #2110. This fix speeds up training by over 3x when using multi-gpu DDP.
The observation that prompted this PR: training with
--num-devices=2
is significantly slower than training with single gpu, with extremely high cpu usage (~2600% on a 28-core cpu).Profiling with
--logging.profiler=pytorch
reveals thatcollate_image_dataset_batch()
inpixel_samplers.py
takes 70ms with--num-devices=2
and 30ms with single gpu. The batch image tensors are stored on the cpu, so the indexing operation is cpu-intensive.This PR adds an option to remove
"image"
fromexclude_batch_keys_from_device
to keep the image tensor on the gpu. Using the option with suitable hardware reduces cpu usage from 2600% to 200%, and speeds up 2-gpu training from ~70 Krays/s to ~250 Krays/s. Single-gpu training is also sped up from 100 to 140 Krays/s. VRAM usage is increased from 2gb to 5gb.Profiling the 2-gpu training with this fix confirms that the pixel sampler is no longer a bottleneck.
Training configuration: nerfacto default params, on F2Nerf grass scene
Machine: 2x Xeon Gold 5120 (disabled HT), 2x Titan RTX