Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Pods tests are failing due to Instance Groups not stabilizing in time. #120

Closed
taylanbil opened this issue Apr 29, 2020 · 4 comments
Assignees

Comments

@taylanbil
Copy link
Contributor

This is a non-interesting failure mode, and these failures are mostly noise.

Happened twice this week. Example.

Instead of counting stable instances in the IG and sleeping, I think it would be better to use gcloud compute instance-groups managed wait-until-stable $IG --zone $ZONE. WDYT?

@will-cromar
Copy link
Collaborator

I think these issues were coming up because we lacked CPU quota in our project, which @zcain117 has since fixed. I agree that the command you recommend is a better solution to wait for the instance group to come up, though.

@taylanbil
Copy link
Contributor Author

This happened yesterday and the day before. When was the quota fix?

@will-cromar
Copy link
Collaborator

The quota change went through yesterday afternoon.

@taylanbil
Copy link
Contributor Author

awesome, I'll close this issue and leave the implementation change up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants