Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Merged
merged 4 commits into from Dec 9, 2021

Commits on Dec 3, 2021

  1. Reformulate torch duplicate name and join tests to ensure that each r…

    …ank is guaranteed to enqueue the same number of Horovod ops.
    
    This is meant to prevent later tests from stalling due to op name mismatches
    when these names are automatically generated from a handle integer. These
    handles can be different on different ranks if they did not enqueue the same
    number of Horovod ops.
    
    Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
    maxhgerlach committed Dec 3, 2021
    Copy the full SHA
    aa41c3a View commit details
    Browse the repository at this point in the history
  2. Fix hvd.barrier() tensor queue management

    If a barrier was enqueued together with an allreduce, the barrier request
    would erroneously be pushed again and again into the tensor queue each
    cycle even after the barrier was completed. Additionally, this could
    cause a mismatch in GetTensorEntriesFromResponse where a tensor from
    the response would not be present in the tensor table. This caused
    segmentation faults in the test_horovod_join_allgather unit test
    although the situation was established earlier by a hvd.barrier() in the
    test_horovod_allreduce_duplicate_name_error unit test.
    
    Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
    maxhgerlach committed Dec 3, 2021
    Copy the full SHA
    12ac10c View commit details
    Browse the repository at this point in the history

Commits on Dec 5, 2021

  1. Add synchronization barriers to the ends of the test_*_duplicate_name…

    …_error
    
    Without this, deadlocks in the subsequent test were possible: One process would
    already have enqueued a collective op like hvd.broadcast(), while the other
    would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()].
    
    I could not use hvd.barrier() for this second barrier because that would somehow
    cause a segmentation fault. Went for an allreduce instead.
    
    Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
    maxhgerlach committed Dec 5, 2021
    Copy the full SHA
    a1f4fa5 View commit details
    Browse the repository at this point in the history

Commits on Dec 6, 2021

  1. Replace all synchronization barriers in test_*_duplicate_name_error b…

    …y allreduces
    
    Apparently there is still UB with hvd.barrier() that may trigger
    failures under certain conditions. Let's avoid that for now.
    
    Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
    maxhgerlach committed Dec 6, 2021
    Copy the full SHA
    6bb255d View commit details
    Browse the repository at this point in the history