more detailed explanation of Multi GPU #373

hafezmg48 · 2024-05-06T17:24:58Z

I was wondering if it is possible to add more explanation in the README.md file for the multi-GPU section. Specifically a brief explanation of which functions and parts of the inference and training and to which level are parallelized with multiple GPUs.

From my basic knowledge, I understood that the inference is completely run on single GPU with all the forward paths. In the training phase, the batches of data are shared between multiple GPUs and the results of the losses and gradients are summed up. Could you please clarify a little. Thanks a lot.

chinthysl · 2024-05-07T03:10:17Z

I think what you mentioned is correct.
We have Distributed data parallel implemented.
Data loader is designed to shard the dataset to each process without any overlaps.
After each training step, MPI+NCCL average and distribute the gradients and losses across all the processes.
Each process needs its own GPU. DDP variant of using a single GPU by multiple processes is not implemented.

chinthysl · 2024-05-07T03:16:06Z

Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathy 's roadmap.

hafezmg48 · 2024-05-07T13:35:18Z

Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathys roadmap.

Absolutely! @karpathy is doing a lot of great stuff and I learned a lot here. The training is distributed between the processes as separate batches. This will be more than enough for speedups in the training phase. But for the inference parallelism, I was hoping they would do something similar to what is implemented in https://github.com/ggerganov/llama.cpp as --tensor-split in the example/main program.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more detailed explanation of Multi GPU #373

more detailed explanation of Multi GPU #373

hafezmg48 commented May 6, 2024

chinthysl commented May 7, 2024

chinthysl commented May 7, 2024 •

edited

hafezmg48 commented May 7, 2024

more detailed explanation of Multi GPU #373

more detailed explanation of Multi GPU #373

Comments

hafezmg48 commented May 6, 2024

chinthysl commented May 7, 2024

chinthysl commented May 7, 2024 • edited

hafezmg48 commented May 7, 2024

chinthysl commented May 7, 2024 •

edited