Add NSight Compute ranges, use CUDA events for timings #273

PeterZhizhin · 2024-04-28T00:29:44Z

CUDA events allow for more accurate timings (as measured by a GPU)
nvtxRangePush/nvtxRangePop Adds simple stack traces to NSight Systems:
- Sample run command: nsys profile mpirun --allow-run-as-root -np 2 ./train_gpt2cu
- Then inspect the resulting profile on a local machine with NVidia NSight Systems.

You'll see something like this in NSight for GPU profile:

ademeure · 2024-04-28T15:29:21Z

This is really nice!

It's a fair bit of boilerplate, maybe it could be wrapped into a tiny tiny class with a constructor/destructor calling push/pop, since the push/pop are nearly always at { }.

EDIT: Just noticed NVIDIA already provides that: https://nvidia.github.io/NVTX/doxygen-cpp/index.html - not sure about Andrej's stylistic preferences here but it feels slightly nicer to me, especially "NVTX3_FUNC_RANGE()" described at the very end.

karpathy · 2024-04-28T16:40:22Z

agree quite eager to get this in on a high level, but is there a way to skip all the added LOC around push/pop?

ademeure · 2024-04-29T16:11:18Z

agree quite eager to get this in on a high level, but is there a way to skip all the added LOC around push/pop?

what I mentioned above with NVTX3_FUNC_RANGE() means you only need 1 line for the push, instead of 2 lines (push+pop). I don't think we can get it to anything less than that. Also maybe the string manipulation could be done more cleanly in less code with C++ rather than sprintf.

karpathy · 2024-04-30T01:59:01Z

@PeterZhizhin are you able to incorporate the above into this PR? I can also take a look if no. Also this seems to have merge conflicts now bleh

karpathy · 2024-04-30T01:59:44Z

(Otherwise pretty eager to merge this!)

PeterZhizhin · 2024-04-30T07:43:20Z

I can, spend my time working on a video yesterday. Will do it today.

PeterZhizhin · 2024-04-30T17:45:45Z

Hello, @karpathy! I've rebased the PR and now I use a C++ class with a constructor-desctructor.

Using native NVTX3_FUNC_RANGE probably requires us to use CMake (https://github.com/NVIDIA/NVTX?tab=readme-ov-file#cmake). Seems like it's not bundled in with CUDA?

On my installation I get this error:

train_gpt2.cu:48:10: fatal error: nvtx3/nvtx3.hpp: No such file or directory
   48 | #include <nvtx3/nvtx3.hpp>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.

rosslwheeler · 2024-04-30T17:58:18Z

@PeterZhizhin - we can pull that header down from the GitHub repo (or others) in the Makefile. Do you want to pull the "c" directory in with the headers from that repo?

karpathy · 2024-04-30T20:02:38Z

oh i see, i didn't realize. in that case i would be ok with just keeping the two calls like before, that sounds gnarly, no pulling of anything

ademeure · 2024-04-30T20:40:47Z

oh i see, i didn't realize. in that case i would be ok with just keeping the two calls like before, that sounds gnarly, no pulling of anything

@PeterZhizhin I think your latest PR solves the 2 calls without needing any additional headers, right? So we should be able to use that, it seems clean enough to me (since we control the name I guess we could rename it something nicer than "NVTX_RANGE_FN" if we wanted to as well).

PeterZhizhin · 2024-05-01T22:31:40Z

Rebased and fixed the build errors. Good to merge.

PeterZhizhin force-pushed the add_nvtx_ranges_and_timings branch from 981652f to 9751477 Compare April 30, 2024 16:59

PeterZhizhin force-pushed the add_nvtx_ranges_and_timings branch 2 times, most recently from daffc27 to 3b466c2 Compare May 1, 2024 22:26

Add NSight Compute ranges, use CUDA events for timings

bcf7d4f

PeterZhizhin force-pushed the add_nvtx_ranges_and_timings branch from 3b466c2 to bcf7d4f Compare May 2, 2024 17:23

Added NVTX ranges for FlashAttention

41a0789

karpathy merged commit 79505bc into karpathy:master May 2, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NSight Compute ranges, use CUDA events for timings #273

Add NSight Compute ranges, use CUDA events for timings #273

PeterZhizhin commented Apr 28, 2024

ademeure commented Apr 28, 2024 •

edited

karpathy commented Apr 28, 2024

ademeure commented Apr 29, 2024

karpathy commented Apr 30, 2024

karpathy commented Apr 30, 2024

PeterZhizhin commented Apr 30, 2024

PeterZhizhin commented Apr 30, 2024

rosslwheeler commented Apr 30, 2024 •

edited

karpathy commented Apr 30, 2024

ademeure commented Apr 30, 2024

PeterZhizhin commented May 1, 2024

Add NSight Compute ranges, use CUDA events for timings #273

Add NSight Compute ranges, use CUDA events for timings #273

Conversation

PeterZhizhin commented Apr 28, 2024

ademeure commented Apr 28, 2024 • edited

karpathy commented Apr 28, 2024

ademeure commented Apr 29, 2024

karpathy commented Apr 30, 2024

karpathy commented Apr 30, 2024

PeterZhizhin commented Apr 30, 2024

PeterZhizhin commented Apr 30, 2024

rosslwheeler commented Apr 30, 2024 • edited

karpathy commented Apr 30, 2024

ademeure commented Apr 30, 2024

PeterZhizhin commented May 1, 2024

ademeure commented Apr 28, 2024 •

edited

rosslwheeler commented Apr 30, 2024 •

edited