Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NSight Compute ranges, use CUDA events for timings #273

Merged
merged 2 commits into from
May 2, 2024

Conversation

PeterZhizhin
Copy link
Contributor

  • CUDA events allow for more accurate timings (as measured by a GPU)
  • nvtxRangePush/nvtxRangePop Adds simple stack traces to NSight Systems:
    • Sample run command: nsys profile mpirun --allow-run-as-root -np 2 ./train_gpt2cu
    • Then inspect the resulting profile on a local machine with NVidia NSight Systems.

You'll see something like this in NSight for GPU profile:
NSight Systems sample profile

@ademeure
Copy link
Contributor

ademeure commented Apr 28, 2024

This is really nice!

It's a fair bit of boilerplate, maybe it could be wrapped into a tiny tiny class with a constructor/destructor calling push/pop, since the push/pop are nearly always at { }.

EDIT: Just noticed NVIDIA already provides that: https://nvidia.github.io/NVTX/doxygen-cpp/index.html - not sure about Andrej's stylistic preferences here but it feels slightly nicer to me, especially "NVTX3_FUNC_RANGE()" described at the very end.

@karpathy
Copy link
Owner

agree quite eager to get this in on a high level, but is there a way to skip all the added LOC around push/pop?

@ademeure
Copy link
Contributor

agree quite eager to get this in on a high level, but is there a way to skip all the added LOC around push/pop?

what I mentioned above with NVTX3_FUNC_RANGE() means you only need 1 line for the push, instead of 2 lines (push+pop). I don't think we can get it to anything less than that. Also maybe the string manipulation could be done more cleanly in less code with C++ rather than sprintf.

@karpathy
Copy link
Owner

@PeterZhizhin are you able to incorporate the above into this PR? I can also take a look if no. Also this seems to have merge conflicts now bleh

@karpathy
Copy link
Owner

(Otherwise pretty eager to merge this!)

@PeterZhizhin
Copy link
Contributor Author

I can, spend my time working on a video yesterday. Will do it today.

@PeterZhizhin
Copy link
Contributor Author

Hello, @karpathy! I've rebased the PR and now I use a C++ class with a constructor-desctructor.

Using native NVTX3_FUNC_RANGE probably requires us to use CMake (https://github.com/NVIDIA/NVTX?tab=readme-ov-file#cmake). Seems like it's not bundled in with CUDA?

On my installation I get this error:

train_gpt2.cu:48:10: fatal error: nvtx3/nvtx3.hpp: No such file or directory
   48 | #include <nvtx3/nvtx3.hpp>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.

@rosslwheeler
Copy link
Contributor

rosslwheeler commented Apr 30, 2024

@PeterZhizhin - we can pull that header down from the GitHub repo (or others) in the Makefile. Do you want to pull the "c" directory in with the headers from that repo?

@karpathy
Copy link
Owner

oh i see, i didn't realize. in that case i would be ok with just keeping the two calls like before, that sounds gnarly, no pulling of anything

@ademeure
Copy link
Contributor

oh i see, i didn't realize. in that case i would be ok with just keeping the two calls like before, that sounds gnarly, no pulling of anything

@PeterZhizhin I think your latest PR solves the 2 calls without needing any additional headers, right? So we should be able to use that, it seems clean enough to me (since we control the name I guess we could rename it something nicer than "NVTX_RANGE_FN" if we wanted to as well).

@PeterZhizhin PeterZhizhin force-pushed the add_nvtx_ranges_and_timings branch 2 times, most recently from daffc27 to 3b466c2 Compare May 1, 2024 22:26
@PeterZhizhin
Copy link
Contributor Author

Rebased and fixed the build errors. Good to merge.

@karpathy karpathy merged commit 79505bc into karpathy:master May 2, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants