Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for distributed tests on pytorch>=1.12 #2141

Merged
merged 20 commits into from Aug 1, 2022

Conversation

mrwyattii
Copy link
Contributor

@mrwyattii mrwyattii commented Jul 26, 2022

With the recent release of torch 1.12, we saw all unit tests using the @distirbuted_test decorator break (see this issue).

The problem involves changes in torch.multiprocessing and torch.distributed that prevents using fork to spawn new processes in distributed tests. We must now use spawn or forkserver to spin up the distributed processes. However, this change alone is not sufficient as the decorator causes issues when pickling the target function (e.g., dist_init or test_*):
AttributeError: Can't pickle local object and _pickle.PicklingError: Can't pickle <function test_example at 0x7f6c4e5640d0>: it's not the same object as unit.test_example.test_example errors are unavoidable with the distributed test decorator.

The solution implemented here loosely follows how pytorch unit tests are done and adds a DistributedTest class. This will require refactoring all distributed tests like so:

# OLD TEST:
@pytest.mark.parametrize('foo,bar', [(1,2),(3,4)])
def test_example(foo, bar):
    @distributed_test(world_size=[1,4])
    def _go():
        assert foo < bar
    _go()
# NEW TEST:
@pytest.mark.parametrize('foo,bar', [(1,2),(3,4)])
class TestExample(DistributedTest):
    world_size=[1,4]
    def test(self, foo, bar):
        assert foo < bar

@jeffra, @tjruwase, and I will be refactoring existing tests over time to use this new test class. In doing so, we are also organizing the unit tests files to match the directory structure of the DeepSpeed source code.

@mrwyattii mrwyattii enabled auto-merge (squash) August 1, 2022 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants