support auto-scale for inference job #1028

leigaoms · 2020-04-15T12:24:34Z

Provide RestAPI interface;
JobManager deployment could auto-scale; framework launcher does not support yet

coveralls · 2020-04-15T12:26:36Z

Pull Request Test Coverage Report for Build 2997

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 92.893%

Totals
Change from base Build 2995:	0.0%
Covered Lines:	668
Relevant Lines:	709

💛 - Coveralls

src/RestAPI/dlwsrestapi.py

src/ClusterManager/job_launcher.py

xudifsd · 2020-04-15T22:01:15Z

Hope you can add a test case here to test this functionality(scale up/down). You may want to add branch like this to skip the test case if it's controller.

It can be run with parameters ./main.py --rest http://dltshub-int.redmond.corp.microsoft.com:5001 --vc platform --email xxx@microsoft.com --uid 1234 --config ~/dev/Deployment/Azure-EastUS-P40-Dev1/ please replace email and uid and config parameter when you run. With your email, uid get from identity table and config to backuped cluster config.

I want to make sure all test cases not broken whenever new feature added.

Anbang-Hu

Please do add a test case as @xudifsd commented.

src/RestAPI/dlwsrestapi.py

src/utils/JobRestAPIUtils.py

xudifsd · 2020-04-16T21:59:36Z

src/ClusterManager/test/test_inference_job.py

+        resp = utils.scale_job(args.rest, args.email, job_id, desired_replicas)
+        assert "Success" == resp
+
+        time.sleep(30)


I would prefer block_until_replica_count_is with a timeout parameter in test case. This will wait at least 30s.

The problem is if cluster does not have enough resource, the number of pod may never reach to desired_replicas. We can just make sure deployment.replica is correctly set, but the real replica also impacted by cluster status.

xudifsd · 2020-04-16T22:02:29Z

src/ClusterManager/test/test_inference_job.py

+        assert state == "running"
+
+        deployment_name = job_id + "-deployment"
+        deployment = utils.kube_get_deployment(args.config, "default", deployment_name)


I think we can relax assumption about it should be a deployment here. We can count pods with label jobId=xxx here to check if it scaled up or down. Since controller doesn't create deployment.

xudifsd · 2020-04-16T22:03:58Z

Anyway, it's good to have a test case. I will modify this test case when I have time. You can merge this for now.

leigaoms requested review from Anbang-Hu, xudifsd and hongzhili April 15, 2020 12:24

xudifsd reviewed Apr 15, 2020

View reviewed changes

src/RestAPI/dlwsrestapi.py Outdated Show resolved Hide resolved

src/ClusterManager/job_launcher.py Show resolved Hide resolved

Anbang-Hu reviewed Apr 16, 2020

View reviewed changes

src/RestAPI/dlwsrestapi.py Outdated Show resolved Hide resolved

src/utils/JobRestAPIUtils.py Outdated Show resolved Hide resolved

leigao-ms added 2 commits April 16, 2020 08:43

support auto-scale for inference job

af1f088

add auto-scale test case

27516cc

leigaoms force-pushed the lginf branch from a382382 to 27516cc Compare April 16, 2020 15:38

Anbang-Hu approved these changes Apr 16, 2020

View reviewed changes

xudifsd approved these changes Apr 16, 2020

View reviewed changes

xudifsd reviewed Apr 16, 2020

View reviewed changes

leigaoms closed this Apr 17, 2020

leigaoms reopened this Apr 17, 2020

leigaoms merged commit d7a5a67 into dltsdev Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support auto-scale for inference job #1028

support auto-scale for inference job #1028

leigaoms commented Apr 15, 2020

coveralls commented Apr 15, 2020 •

edited

Loading

xudifsd commented Apr 15, 2020 •

edited

Loading

Anbang-Hu left a comment

xudifsd Apr 16, 2020

leigaoms Apr 17, 2020

xudifsd Apr 16, 2020

xudifsd commented Apr 16, 2020

support auto-scale for inference job #1028

support auto-scale for inference job #1028

Conversation

leigaoms commented Apr 15, 2020

coveralls commented Apr 15, 2020 • edited Loading

Pull Request Test Coverage Report for Build 2997

💛 - Coveralls

xudifsd commented Apr 15, 2020 • edited Loading

Anbang-Hu left a comment

Choose a reason for hiding this comment

xudifsd Apr 16, 2020

Choose a reason for hiding this comment

leigaoms Apr 17, 2020

Choose a reason for hiding this comment

xudifsd Apr 16, 2020

Choose a reason for hiding this comment

xudifsd commented Apr 16, 2020

coveralls commented Apr 15, 2020 •

edited

Loading

xudifsd commented Apr 15, 2020 •

edited

Loading