Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

…ank is guaranteed to enqueue the same number of Horovod ops. This is meant to prevent later tests from stalling due to op name mismatches when these names are automatically generated from a handle integer. These handles can be different on different ranks if they did not enqueue the same number of Horovod ops. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

If a barrier was enqueued together with an allreduce, the barrier request would erroneously be pushed again and again into the tensor queue each cycle even after the barrier was completed. Additionally, this could cause a mismatch in GetTensorEntriesFromResponse where a tensor from the response would not be present in the tensor table. This caused segmentation faults in the test_horovod_join_allgather unit test although the situation was established earlier by a hvd.barrier() in the test_horovod_allreduce_duplicate_name_error unit test. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…_error Without this, deadlocks in the subsequent test were possible: One process would already have enqueued a collective op like hvd.broadcast(), while the other would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()]. I could not use hvd.barrier() for this second barrier because that would somehow cause a segmentation fault. Went for an allreduce instead. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…y allreduces Apparently there is still UB with hvd.barrier() that may trigger failures under certain conditions. Let's avoid that for now. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Commits on Dec 3, 2021

Commits on Dec 5, 2021

Commits on Dec 6, 2021