Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVTX tracing hooks for profiling with Nsight Systems #2723

Merged
merged 15 commits into from Mar 29, 2021

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Mar 18, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This adds two types of hooks to NVTX to make it easier to profile Horovod models with NVIDIA Nsight Systems:

  1. Very low-overhead annotations containing just the type of communication op, timestamps for its beginning and end, as well as the tensor size in bytes. These ranges start when tensors are enqueued and end when the op has finished. When no profiler is attached, the performance impact should be absolutely negligible. I find it easier to work with these than to manually insert annotation ops into my model code.

  2. More detailed annotations that are only active when a Horovod timeline is being written. Users may prefer to work with the Nsight application rather than the tracing UI in Chrome. Also it is interesting to correlate the Horovod timeline data with actual GPU kernel utilization and other profiling data available there. These include the full name of each source tensor, which necessitates some string copies, so there is a slight overhead.

Building Horovod with HOROVOD_WITHOUT_NVTX=1 disables all of this.

If Horovod is built with NVTX support, users can still set the environment variable HOROVOD_DISABLE_NVTX_RANGES=1 to skip generating any ranges even when a profiler is attached.

Screen Shot Nsight with Horovod NVTX

@github-actions

This comment has been minimized.

@tgaddair
Copy link
Collaborator

This is awesome, thanks for the contribution @maxhgerlach!

@romerojosh can you help review?

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @maxhgerlach, this is an excellent contribution! I often will add temporary NVTX ranges in my own development to get more insight from profiles generated from nsys, so having this functionality available to users is great. I will try and run a few workloads using this PR in the next couple of days to comment on the usability of the generated markers.

I think the implementation is solid in general, but did leave a comment concerning handling ranges within the framework specific code.

horovod/tensorflow/mpi_ops.cc Outdated Show resolved Hide resolved
@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Mar 22, 2021

Thanks for reviewing this, @tgaddair and @romerojosh!

In my experience the detailed annotations are most helpful to understand the impact of communication ops inserted into the forward pass of a model. Making sense of dozens of (potentially fused) gradient allreduces running in parallel is trickier with the Nsight UI.

@github-actions

This comment has been minimized.

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @maxhgerlach to this more framework general approach!

I left a comment concerning how the ranges are handled with the grouped allreduce.

Besides this comment, would you be willing to add an environment variable option to disable the NVTX range generation, for example HOROVOD_DISABLE_NVTX_RANGES? There are still many NVTX ranges created in the path outside the timeline (e.g. one per every gradient allreduced in a typical case) so it would be nice to have a mechanism to disable them in an NVTX enabled build.

horovod/common/operations.cc Outdated Show resolved Hide resolved
@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Mar 26, 2021

Hey @romerojosh, @tgaddair, I've added an environment variable HOROVOD_DISABLE_NVTX_RANGES to disable generating any NVTX ranges from Horovod and I believe to have fixed the issue with grouped allreduces. Happy to address further comments!

There seems to be a general problem with the buildkite pipeline right now. It fails to build the image based on tfhead because of an error with Keras. Apparently tests aren't run in the other containers either.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(assertion in Timeline::NegotiateStart would fail otherwise)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(remove ineffective const qualifiers, pass strings by reference if possible)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(remove some redundant qualifiers, avoid some copies)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…notations

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…notations II

(add missing files)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…ce ranges.

Also avoid some shared_ptr copies in Enqueue*().

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(helps the compiler optimize it away when building without NVTX)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@github-actions

This comment has been minimized.

@tgaddair
Copy link
Collaborator

Hey @maxhgerlach, tests should be fixed on HEAD now, thanks for rebasing. @romerojosh is there anything else you wanted to address before merging?

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgaddair tgaddair merged commit 386be42 into horovod:master Mar 29, 2021
@github-actions
Copy link

Unit Test Results

     738 files  ±0       738 suites  ±0   4h 58m 53s ⏱️ ±0s
     564 tests ±0       534 ✔️ ±0       29 💤 ±0  1 ❌ ±0 
15 016 runs  ±0  11 441 ✔️ ±0  3 574 💤 ±0  1 ❌ ±0 

For more details on these failures, see this check.

Results for commit 386be42. ± Comparison against base commit 386be42.

amogkam pushed a commit to amogkam/horovod that referenced this pull request Apr 6, 2021
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@whatdhack
Copy link

@maxhgerlach , thanks for a great add. A questions. Does the output get added to the Horovod TImeline json file or a new separate file is created ?

@maxhgerlach
Copy link
Collaborator Author

Hi @whatdhack,

thanks for your interest. You will need to run your Horovod script through nsys to gather profiling data, which will be collected in a separate file unrelated to the Horovod timeline json. This data can then be viewed and processed in the Nsight Systems UI.

You might want to check out the documentation for Nvidia Nsight to see the various options how to instrument the profiler. Often I would use nsys launch from a shell script, limiting it to rank 0 only, then enable the profiling for a few training steps via nsys start.

@whatdhack
Copy link

whatdhack commented May 27, 2021

@maxhgerlach , thanks for the quick reply and clarifying that NVTX data is collected per rank , and is limited to a small number of ranks, as of course if there are 1000s of ranks , then it is not viable to look at the NVTX output for each rank. Does Nsight do any post processing to create an aggregate profile/report at this point ? Also, is the NVTX output able to capture information on the negotiation mpi calls ?

@maxhgerlach
Copy link
Collaborator Author

Hey @whatdhack,

Nvidia has quite a thorough User Guide that should help you get started.

If you plug in the profiler similarly to nsys {launch|profile} [nsys options] mpirun [mpi options] python horovod_script.py, you will get an aggregate report encompassing multiple ranks (only tried this for small local jobs myself). It will also automatically insert NVTX ranges for collective MPI calls so you should see those negotiations, there is a section in the guide about that.

The functionality from this PR helps you correlate what Horovod is doing with all the rich information the nsys profiler is giving you on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants