Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more detailed explanation of Multi GPU #373

Open
hafezmg48 opened this issue May 6, 2024 · 3 comments
Open

more detailed explanation of Multi GPU #373

hafezmg48 opened this issue May 6, 2024 · 3 comments

Comments

@hafezmg48
Copy link

I was wondering if it is possible to add more explanation in the README.md file for the multi-GPU section. Specifically a brief explanation of which functions and parts of the inference and training and to which level are parallelized with multiple GPUs.

From my basic knowledge, I understood that the inference is completely run on single GPU with all the forward paths. In the training phase, the batches of data are shared between multiple GPUs and the results of the losses and gradients are summed up. Could you please clarify a little. Thanks a lot.

@chinthysl
Copy link
Contributor

I think what you mentioned is correct.
We have Distributed data parallel implemented.
Data loader is designed to shard the dataset to each process without any overlaps.
After each training step, MPI+NCCL average and distribute the gradients and losses across all the processes.
Each process needs its own GPU. DDP variant of using a single GPU by multiple processes is not implemented.

@chinthysl
Copy link
Contributor

chinthysl commented May 7, 2024

Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathy 's roadmap.

@hafezmg48
Copy link
Author

Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathys roadmap.

Absolutely! @karpathy is doing a lot of great stuff and I learned a lot here. The training is distributed between the processes as separate batches. This will be more than enough for speedups in the training phase. But for the inference parallelism, I was hoping they would do something similar to what is implemented in https://github.com/ggerganov/llama.cpp as --tensor-split in the example/main program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants