RFC: single-GPU setups, improving worker 0 utilization #15

simra · 2022-10-10T19:50:37Z

This issue is to discuss a known limitation which is that FLUTE expects a minimum of two GPUs for any CUDA-based training. There must always be a Worker 0 GPU and then at least one more for client training. It would be valuable to be able to specify arbitrary mappings so that, say, Worker 0 and Worker 1 share the same GPU. From a memory standpoint this should be ok because they never need the GPU at the same time. I'm not sure that torch.distributed can support arbitrary mappings (note: CUDA_VISIBLE_DEVICES=0,0 doesn't work as a solution). Alternatively if we could assign worker 0 to cpu and worker 1+ to GPUs that might be a reasonable solution- relatively speaking, model aggregation is less expensive and could potentially be done on CPU.

Thoughts?

Mirian-Hipolito · 2023-06-28T18:05:39Z

Hi Rob, this issue has been addressed in the latest commit 43e1530. We have removed the hard constraint of a minimum number of GPUs available in FLUTE by allowing to instantiate Server and Clients in the same worker device. For more documentation about how to run an experiments using a single GPU, please refer to the README.

simra added the question Further information is requested label Oct 10, 2022

Mirian-Hipolito added enhancement New feature or request and removed question Further information is requested labels Jan 3, 2023

Mirian-Hipolito mentioned this issue Feb 14, 2023

RuntimeError: CUDA error: invalid device ordinal and setting up NCCL + requesting subprocess model update for python 3.6+ #20

Closed

Mirian-Hipolito closed this as completed Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: single-GPU setups, improving worker 0 utilization #15

RFC: single-GPU setups, improving worker 0 utilization #15

simra commented Oct 10, 2022

Mirian-Hipolito commented Jun 28, 2023

RFC: single-GPU setups, improving worker 0 utilization #15

RFC: single-GPU setups, improving worker 0 utilization #15

Comments

simra commented Oct 10, 2022

Mirian-Hipolito commented Jun 28, 2023