Add NVTX tracing hooks for profiling with Nsight Systems #2723

maxhgerlach · 2021-03-18T14:12:57Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This adds two types of hooks to NVTX to make it easier to profile Horovod models with NVIDIA Nsight Systems:

Very low-overhead annotations containing just the type of communication op, timestamps for its beginning and end, as well as the tensor size in bytes. These ranges start when tensors are enqueued and end when the op has finished. When no profiler is attached, the performance impact should be absolutely negligible. I find it easier to work with these than to manually insert annotation ops into my model code.
More detailed annotations that are only active when a Horovod timeline is being written. Users may prefer to work with the Nsight application rather than the tracing UI in Chrome. Also it is interesting to correlate the Horovod timeline data with actual GPU kernel utilization and other profiling data available there. These include the full name of each source tensor, which necessitates some string copies, so there is a slight overhead.

Building Horovod with HOROVOD_WITHOUT_NVTX=1 disables all of this.

If Horovod is built with NVTX support, users can still set the environment variable HOROVOD_DISABLE_NVTX_RANGES=1 to skip generating any ranges even when a profiler is attached.

tgaddair · 2021-03-19T16:29:52Z

This is awesome, thanks for the contribution @maxhgerlach!

@romerojosh can you help review?

romerojosh

Hey @maxhgerlach, this is an excellent contribution! I often will add temporary NVTX ranges in my own development to get more insight from profiles generated from nsys, so having this functionality available to users is great. I will try and run a few workloads using this PR in the next couple of days to comment on the usability of the generated markers.

I think the implementation is solid in general, but did leave a comment concerning handling ranges within the framework specific code.

horovod/tensorflow/mpi_ops.cc

maxhgerlach · 2021-03-22T13:51:13Z

Thanks for reviewing this, @tgaddair and @romerojosh!

In my experience the detailed annotations are most helpful to understand the impact of communication ops inserted into the forward pass of a model. Making sense of dozens of (potentially fused) gradient allreduces running in parallel is trickier with the Nsight UI.

romerojosh

Thanks for updating the PR @maxhgerlach to this more framework general approach!

I left a comment concerning how the ranges are handled with the grouped allreduce.

Besides this comment, would you be willing to add an environment variable option to disable the NVTX range generation, for example HOROVOD_DISABLE_NVTX_RANGES? There are still many NVTX ranges created in the path outside the timeline (e.g. one per every gradient allreduced in a typical case) so it would be nice to have a mechanism to disable them in an NVTX enabled build.

horovod/common/operations.cc

maxhgerlach · 2021-03-26T19:12:15Z

Hey @romerojosh, @tgaddair, I've added an environment variable HOROVOD_DISABLE_NVTX_RANGES to disable generating any NVTX ranges from Horovod and I believe to have fixed the issue with grouped allreduces. Happy to address further comments!

There seems to be a general problem with the buildkite pipeline right now. It fails to build the image based on tfhead because of an error with Keras. Apparently tests aren't run in the other containers either.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

(assertion in Timeline::NegotiateStart would fail otherwise) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

(remove ineffective const qualifiers, pass strings by reference if possible) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

(remove some redundant qualifiers, avoid some copies) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…notations Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…notations II (add missing files) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…ce ranges. Also avoid some shared_ptr copies in Enqueue*(). Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

(helps the compiler optimize it away when building without NVTX) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

tgaddair · 2021-03-28T21:29:15Z

Hey @maxhgerlach, tests should be fixed on HEAD now, thanks for rebasing. @romerojosh is there anything else you wanted to address before merging?

romerojosh

LGTM!

github-actions · 2021-03-29T20:14:13Z

Unit Test Results

    738 files ±0     738 suites ±0 4h 58m 53s ⏱️ ±0s
    564 tests ±0     534 ✔️ ±0     29 💤 ±0 1 ❌ ±0
15 016 runs ±0 11 441 ✔️ ±0 3 574 💤 ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 386be42. ± Comparison against base commit 386be42.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

whatdhack · 2021-05-27T17:32:49Z

@maxhgerlach , thanks for a great add. A questions. Does the output get added to the Horovod TImeline json file or a new separate file is created ?

maxhgerlach · 2021-05-27T19:26:46Z

Hi @whatdhack,

thanks for your interest. You will need to run your Horovod script through nsys to gather profiling data, which will be collected in a separate file unrelated to the Horovod timeline json. This data can then be viewed and processed in the Nsight Systems UI.

You might want to check out the documentation for Nvidia Nsight to see the various options how to instrument the profiler. Often I would use nsys launch from a shell script, limiting it to rank 0 only, then enable the profiling for a few training steps via nsys start.

whatdhack · 2021-05-27T19:45:35Z

@maxhgerlach , thanks for the quick reply and clarifying that NVTX data is collected per rank , and is limited to a small number of ranks, as of course if there are 1000s of ranks , then it is not viable to look at the NVTX output for each rank. Does Nsight do any post processing to create an aggregate profile/report at this point ? Also, is the NVTX output able to capture information on the negotiation mpi calls ?

maxhgerlach · 2021-05-29T11:36:16Z

Hey @whatdhack,

Nvidia has quite a thorough User Guide that should help you get started.

If you plug in the profiler similarly to nsys {launch|profile} [nsys options] mpirun [mpi options] python horovod_script.py, you will get an aggregate report encompassing multiple ranks (only tried this for small local jobs myself). It will also automatically insert NVTX ranges for collective MPI calls so you should see those negotiations, there is a section in the guide about that.

The functionality from this PR helps you correlate what Horovod is doing with all the rich information the nsys profiler is giving you on its own.

This comment has been minimized.

Sign in to view

tgaddair requested a review from romerojosh March 19, 2021 16:29

tgaddair approved these changes Mar 19, 2021

View reviewed changes

romerojosh reviewed Mar 20, 2021

View reviewed changes

horovod/tensorflow/mpi_ops.cc Outdated Show resolved Hide resolved

maxhgerlach requested a review from romerojosh March 23, 2021 17:54

This comment has been minimized.

Sign in to view

romerojosh reviewed Mar 25, 2021

View reviewed changes

horovod/common/operations.cc Outdated Show resolved Hide resolved

tgaddair mentioned this pull request Mar 26, 2021

Compute, Communication, and Overlap Time #2737

Closed

maxhgerlach added 15 commits March 28, 2021 17:39

Add NVTX in cmake

20bbad2

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Some cleanup suggested by clang tidy

f15f97c

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

fix clearing tensor state in Timeline::End

bbf626f

(assertion in Timeline::NegotiateStart would fail otherwise) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Integrate NVTX tracing into Horovod timeline

e3037fa

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Fix unused variable warning

1f50e9d

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add basic NVTX tracing to tensorflow/mpi_ops.cc

ae15365

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Clean up Enqueue* function signatures (clang-tidy suggestions)

a0c7790

(remove ineffective const qualifiers, pass strings by reference if possible) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Minor cleanup in Status, TensorShape, OpContext

ac7ebdd

(remove some redundant qualifiers, avoid some copies) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Replace TensorFlow-specific NVTX annotations by framework-agnostic an…

a2921d4

…notations Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Replace TensorFlow-specific NVTX annotations by framework-agnostic an…

51cd685

…notations II (add missing files) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Introduce shared_ptr ref counting to fix ending HorovodGroupedAllredu…

72c13d0

…ce ranges. Also avoid some shared_ptr copies in Enqueue*(). Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Put NvtxOpRange shared_ptr into an opaque wrapper

df53e71

(helps the compiler optimize it away when building without NVTX) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Introduce environment variable HOROVOD_DISABLE_NVTX_RANGES

c9a99c8

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update changelog

bd5fb42

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Revert std::move calls in Enqueue*

691db99

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach force-pushed the nvtx-tracing branch from 325afd4 to 691db99 Compare March 28, 2021 15:40

This comment has been minimized.

Sign in to view

romerojosh approved these changes Mar 29, 2021

View reviewed changes

tgaddair merged commit 386be42 into horovod:master Mar 29, 2021

amogkam pushed a commit to amogkam/horovod that referenced this pull request Apr 6, 2021

Add NVTX tracing hooks for profiling with Nsight Systems (horovod#2723)

766ff88

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach mentioned this pull request Jul 2, 2021

Different Time Recorded for Allreduce Operation #3012

Closed

maxhgerlach mentioned this pull request Aug 12, 2021

CPU-based horovod/tensorflow is slower when using intel-tensorflow-avx512 #3098

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVTX tracing hooks for profiling with Nsight Systems #2723

Add NVTX tracing hooks for profiling with Nsight Systems #2723

maxhgerlach commented Mar 18, 2021 •

edited

This comment has been minimized.

tgaddair commented Mar 19, 2021

romerojosh left a comment

maxhgerlach commented Mar 22, 2021 •

edited

This comment has been minimized.

romerojosh left a comment

maxhgerlach commented Mar 26, 2021 •

edited

This comment has been minimized.

tgaddair commented Mar 28, 2021

romerojosh left a comment

github-actions bot commented Mar 29, 2021

whatdhack commented May 27, 2021

maxhgerlach commented May 27, 2021

whatdhack commented May 27, 2021 •

edited

maxhgerlach commented May 29, 2021

Add NVTX tracing hooks for profiling with Nsight Systems #2723

Add NVTX tracing hooks for profiling with Nsight Systems #2723

Conversation

maxhgerlach commented Mar 18, 2021 • edited

Checklist before submitting

Description

This comment has been minimized.

tgaddair commented Mar 19, 2021

romerojosh left a comment

Choose a reason for hiding this comment

maxhgerlach commented Mar 22, 2021 • edited

This comment has been minimized.

romerojosh left a comment

Choose a reason for hiding this comment

maxhgerlach commented Mar 26, 2021 • edited

This comment has been minimized.

tgaddair commented Mar 28, 2021

romerojosh left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 29, 2021

Unit Test Results

whatdhack commented May 27, 2021

maxhgerlach commented May 27, 2021

whatdhack commented May 27, 2021 • edited

maxhgerlach commented May 29, 2021

maxhgerlach commented Mar 18, 2021 •

edited

maxhgerlach commented Mar 22, 2021 •

edited

maxhgerlach commented Mar 26, 2021 •

edited

whatdhack commented May 27, 2021 •

edited