New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NVTX tracing hooks for profiling with Nsight Systems #2723
Conversation
This comment has been minimized.
This comment has been minimized.
This is awesome, thanks for the contribution @maxhgerlach! @romerojosh can you help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @maxhgerlach, this is an excellent contribution! I often will add temporary NVTX ranges in my own development to get more insight from profiles generated from nsys, so having this functionality available to users is great. I will try and run a few workloads using this PR in the next couple of days to comment on the usability of the generated markers.
I think the implementation is solid in general, but did leave a comment concerning handling ranges within the framework specific code.
Thanks for reviewing this, @tgaddair and @romerojosh! In my experience the detailed annotations are most helpful to understand the impact of communication ops inserted into the forward pass of a model. Making sense of dozens of (potentially fused) gradient allreduces running in parallel is trickier with the Nsight UI. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR @maxhgerlach to this more framework general approach!
I left a comment concerning how the ranges are handled with the grouped allreduce.
Besides this comment, would you be willing to add an environment variable option to disable the NVTX range generation, for example HOROVOD_DISABLE_NVTX_RANGES
? There are still many NVTX ranges created in the path outside the timeline (e.g. one per every gradient allreduced in a typical case) so it would be nice to have a mechanism to disable them in an NVTX enabled build.
Hey @romerojosh, @tgaddair, I've added an environment variable There seems to be a general problem with the buildkite pipeline right now. It fails to build the image based on |
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(assertion in Timeline::NegotiateStart would fail otherwise) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(remove ineffective const qualifiers, pass strings by reference if possible) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(remove some redundant qualifiers, avoid some copies) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…notations Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…notations II (add missing files) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…ce ranges. Also avoid some shared_ptr copies in Enqueue*(). Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(helps the compiler optimize it away when building without NVTX) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
325afd4
to
691db99
Compare
This comment has been minimized.
This comment has been minimized.
Hey @maxhgerlach, tests should be fixed on HEAD now, thanks for rebasing. @romerojosh is there anything else you wanted to address before merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Unit Test Results 738 files ±0 738 suites ±0 4h 58m 53s ⏱️ ±0s For more details on these failures, see this check. Results for commit 386be42. ± Comparison against base commit 386be42. |
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach , thanks for a great add. A questions. Does the output get added to the Horovod TImeline json file or a new separate file is created ? |
Hi @whatdhack, thanks for your interest. You will need to run your Horovod script through You might want to check out the documentation for Nvidia Nsight to see the various options how to instrument the profiler. Often I would use |
@maxhgerlach , thanks for the quick reply and clarifying that NVTX data is collected per rank , and is limited to a small number of ranks, as of course if there are 1000s of ranks , then it is not viable to look at the NVTX output for each rank. Does Nsight do any post processing to create an aggregate profile/report at this point ? Also, is the NVTX output able to capture information on the negotiation mpi calls ? |
Hey @whatdhack, Nvidia has quite a thorough User Guide that should help you get started. If you plug in the profiler similarly to The functionality from this PR helps you correlate what Horovod is doing with all the rich information the nsys profiler is giving you on its own. |
Checklist before submitting
Description
This adds two types of hooks to NVTX to make it easier to profile Horovod models with NVIDIA Nsight Systems:
Very low-overhead annotations containing just the type of communication op, timestamps for its beginning and end, as well as the tensor size in bytes. These ranges start when tensors are enqueued and end when the op has finished. When no profiler is attached, the performance impact should be absolutely negligible. I find it easier to work with these than to manually insert annotation ops into my model code.
More detailed annotations that are only active when a Horovod timeline is being written. Users may prefer to work with the Nsight application rather than the tracing UI in Chrome. Also it is interesting to correlate the Horovod timeline data with actual GPU kernel utilization and other profiling data available there. These include the full name of each source tensor, which necessitates some string copies, so there is a slight overhead.
Building Horovod with
HOROVOD_WITHOUT_NVTX=1
disables all of this.If Horovod is built with NVTX support, users can still set the environment variable
HOROVOD_DISABLE_NVTX_RANGES=1
to skip generating any ranges even when a profiler is attached.