Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_ray.py::test_gpu_ids_num_workers sometimes fails on Buildkite #3357

Closed
maxhgerlach opened this issue Jan 11, 2022 · 2 comments
Closed
Assignees
Labels

Comments

@maxhgerlach
Copy link
Collaborator

CUDA_VISIBLE_DEVICES seems to contain too many entries.

Example from PR #3261: https://buildkite.com/horovod/horovod/builds/7041#9a807189-938e-491f-9f83-c6bc31420a67

        hjob = RayExecutor(setting, num_workers=4, use_gpu=True)
        hjob.start()
        all_envs = hjob.execute(lambda _: os.environ.copy())
        all_cudas = {ev["CUDA_VISIBLE_DEVICES"] for ev in all_envs}
        assert len(all_cudas) == 1, all_cudas
>       assert len(all_envs[0]["CUDA_VISIBLE_DEVICES"].split(",")) == 4
E       assert 8 == 4
E         +8
E         -4
@ashahab ashahab self-assigned this Jan 11, 2022
@ashahab
Copy link
Collaborator

ashahab commented Jan 11, 2022

@maxhgerlach thanks for the report, I'll take a look tonight.

@maxhgerlach
Copy link
Collaborator Author

Closing in favor of #3435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants