-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support auto-scale for inference job #1028
Conversation
leigaoms
commented
Apr 15, 2020
- Provide RestAPI interface;
- JobManager deployment could auto-scale; framework launcher does not support yet
Pull Request Test Coverage Report for Build 2997
💛 - Coveralls |
Hope you can add a test case here to test this functionality(scale up/down). You may want to add branch like this to skip the test case if it's It can be run with parameters I want to make sure all test cases not broken whenever new feature added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do add a test case as @xudifsd commented.
resp = utils.scale_job(args.rest, args.email, job_id, desired_replicas) | ||
assert "Success" == resp | ||
|
||
time.sleep(30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer block_until_replica_count_is with a timeout parameter in test case. This will wait at least 30s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is if cluster does not have enough resource, the number of pod may never reach to desired_replicas. We can just make sure deployment.replica is correctly set, but the real replica also impacted by cluster status.
assert state == "running" | ||
|
||
deployment_name = job_id + "-deployment" | ||
deployment = utils.kube_get_deployment(args.config, "default", deployment_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can relax assumption about it should be a deployment here. We can count pods with label jobId=xxx here to check if it scaled up or down. Since controller doesn't create deployment.
Anyway, it's good to have a test case. I will modify this test case when I have time. You can merge this for now. |