Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support auto-scale for inference job #1028

Merged
merged 2 commits into from
Apr 17, 2020
Merged

support auto-scale for inference job #1028

merged 2 commits into from
Apr 17, 2020

Conversation

leigaoms
Copy link
Contributor

  1. Provide RestAPI interface;
  2. JobManager deployment could auto-scale; framework launcher does not support yet

@coveralls
Copy link

coveralls commented Apr 15, 2020

Pull Request Test Coverage Report for Build 2997

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 92.893%

Totals Coverage Status
Change from base Build 2995: 0.0%
Covered Lines: 668
Relevant Lines: 709

💛 - Coveralls

src/RestAPI/dlwsrestapi.py Outdated Show resolved Hide resolved
src/ClusterManager/job_launcher.py Show resolved Hide resolved
@xudifsd
Copy link
Member

xudifsd commented Apr 15, 2020

Hope you can add a test case here to test this functionality(scale up/down). You may want to add branch like this to skip the test case if it's controller.

It can be run with parameters ./main.py --rest http://dltshub-int.redmond.corp.microsoft.com:5001 --vc platform --email xxx@microsoft.com --uid 1234 --config ~/dev/Deployment/Azure-EastUS-P40-Dev1/ please replace email and uid and config parameter when you run. With your email, uid get from identity table and config to backuped cluster config.

I want to make sure all test cases not broken whenever new feature added.

Copy link
Collaborator

@Anbang-Hu Anbang-Hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do add a test case as @xudifsd commented.

src/RestAPI/dlwsrestapi.py Outdated Show resolved Hide resolved
src/utils/JobRestAPIUtils.py Outdated Show resolved Hide resolved
resp = utils.scale_job(args.rest, args.email, job_id, desired_replicas)
assert "Success" == resp

time.sleep(30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer block_until_replica_count_is with a timeout parameter in test case. This will wait at least 30s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is if cluster does not have enough resource, the number of pod may never reach to desired_replicas. We can just make sure deployment.replica is correctly set, but the real replica also impacted by cluster status.

assert state == "running"

deployment_name = job_id + "-deployment"
deployment = utils.kube_get_deployment(args.config, "default", deployment_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can relax assumption about it should be a deployment here. We can count pods with label jobId=xxx here to check if it scaled up or down. Since controller doesn't create deployment.

@xudifsd
Copy link
Member

xudifsd commented Apr 16, 2020

Anyway, it's good to have a test case. I will modify this test case when I have time. You can merge this for now.

@leigaoms leigaoms closed this Apr 17, 2020
@leigaoms leigaoms reopened this Apr 17, 2020
@leigaoms leigaoms merged commit d7a5a67 into dltsdev Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants