Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/nccl only (delete MPI) #623

Closed
wants to merge 3 commits into from
Closed

feature/nccl only (delete MPI) #623

wants to merge 3 commits into from

Conversation

karpathy
Copy link
Owner

No description provided.

@karpathy
Copy link
Owner Author

few of the remaining issues:

  • let's find a way to init nccl without creating temporary files on the filesystem
  • move unified_buffer out of the model
  • improve multi_gpu_barrier, this can't be the way to do a barrier in nccl right?
  • fix up multi_gpu_float_sum it looks a bit odd
  • touch up docs a bit more, clarify MPI vs NCCL relationship etc.

@ngc92
Copy link
Contributor

ngc92 commented Jun 20, 2024

I don't think we can avoid the file system alltogether, as without MPI, this is the only "communication layer" that we actually have available before we initialize NCCL.
we should be much more careful about this file, though, e.g., if such a file remains left-over from a previous run, I think we just get race conditions in our file-system now.

@karpathy
Copy link
Owner Author

Copy pasting some Discord discussion. Adjustment to this PR

maybe one way we could do it is we

  1. remove MPI for all actual compute
  2. keep MPI as optional, use it for initializing NCCL
  3. in case you hate MPI or can't get it to work, fall back on the janky file system idea
    so in this world we keep the convenience of MPI and use it to init NCCL. and we don't delete that code, but we'd move it to its own mpi.h
    if you don't have MPI we fall back to init NCCL through shared file system
    certainly i think during the actual model run it feels very janky to use both MPI + NCCL for bits and pieces together.

@karpathy karpathy closed this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants