Scaling group and scaling drivers #73

achimnol · 2018-04-12T08:25:02Z

Let's implement policy-based scalers and scaling groups.

Concept:

On clouds
- Agents are assigned instance metadata describing their scaling group.
- The scaling driver of each group runs its own auto-scaling policy and algorithm. It may be a proxy of vendor-specific auto-scaling APIs.
- The scaling driver may be a proxy of other container orchestration framework such as Kubernetes.
On premise
- The default policy is "fixed".
- Let's the users be able to provide their own plugins for external batch job schedulers such as:

Policy options:

image auto-update
- on/off
- update check interval
agent auto-update
- on/off
- idle wait time after last kernel termination on agent
container termination policy
- auto-terminate after specified idle time / persistent
external agents
- on/off
- public key (external agents should provide the auth message encrypted using its own private key)
- limit (max number of allowed external agents)

Batch job functions:

Get the job list and their status
Submit a job (input + output + program + resource spec)
- JobSpecAdaptor
Get the cluster node list
Get/set the scheduler configurations
(Optional) Cancel / interrupt a job

lablup/backend.ai-agent#5 — container termination policy (idle / persistent / etc.)
lablup/backend.ai-agent#39 — on/off option for self-autoupgrade
https://github.com/lablup/backend.ai-agent/projects/2 — on/off option and auth config to accept external agents

codecov-io · 2018-04-12T08:28:07Z

Codecov Report

Merging #73 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #73   +/-   ##
=======================================
  Coverage   64.29%   64.29%           
=======================================
  Files          24       24           
  Lines        2644     2644           
=======================================
  Hits         1700     1700           
  Misses        944      944

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b6647c...9119393. Read the comment docs.

- All session creation request is enqueued to job queue and return response immediately without waiting for actual kernel creation. - After a kernel creation request comes in or a kernel is terminated, then `ScalingGroup.schedule()` is invoked. It calls `AbstractJobScheduler.schedule()` and actually create kernels returned by scheduler.

- Scale in/out when a kernel is created/terminated or an agent is joined/left. - Note that scaling always preceed scheduling. When scaling up, scaling does not impact on available resource shares immediately since starting a new instance takes considerably long time, whereas scheduling depends on current available resource shares. Therefore, the order of scheduling and scaling does not matter when scaling up. However, when scaling down, AbstractScalingDriver should "mark" agents to terminate in the (near) future, and scheduler should avoid assigning kernels to those agents. This forces scaling down to be done before scheduling.

Zeniuus · 2018-11-06T07:24:21Z

Scaling group

States
- Pending request (put / get all)
- Resource shares
  - get_available_shares(): available resource shares for each agent
  - get_minimum_shares(): margin to create a few kernel immediately without scaling out
  - get_required_shares(lang): required resource share for specific kernel
API
- schedule(): create available requests
- scale(event)

Design?

Each state is not a property, but a method. This is to calculate each state from DB by on-demand.
Scaling always preceed scheduling. When scaling down, AbstractScalingDriver should "mark" agents to terminate in the (near) future, and scheduler should avoid assigning kernels to those agents. This forces scaling down to be done before scheduling.

@achimnol

Zeniuus · 2018-11-06T08:10:24Z

I think I need some more time to design interfaces. Please wait for my PR review request through offline.

achimnol · 2018-11-08T02:48:46Z

I agree with you that the scaler should be executed before the scheduler for the same single event.

We also need to consider some corner cases like this:

A session creation request may demand a resource specification impossible to schedule with current hardware configurations even after scaling out. For instance, when we use only one type of GPU instances having 1.0 GPU share for each, scheduling 1.5 GPU share for a single-container session is impossible no matter how many instances the scaler adds.
In this case, we should let the user (client) immediately know the request is impossible to schedule without waiting for the pending-queue timeout. Also the scaler should not try to infinitely add new instances to fill the resource demand gap that cannot be filled.

Zeniuus · 2018-11-08T03:12:37Z

Suggestion: I think AbstractScalingDriver contains two separate parts: scaling logic and vendor-specific implementation. For instance, scale_out and get_scale_out_info are independent of which vendor is used, and add_agents and get_ami_id are independent of scaling logic.

So, I think it is good to separate AbstractScalingDriver into AbstractScalingDriver and VenderScalingDriverMixin. Is it seemed to be over-engineering?

Let job scheduler to determine required instances to schedule pending jobs, and scaling driver should depend on this when scaling out or in

For example, do not schedule jobs without gpu to agents with gpu slots

achimnol · 2020-08-28T13:15:41Z

This is now considered replaced with #167.
Adding auto-scaling driver will be done as a separate issue in the futureu.

Initial sketch of scaling groups

bee4c85

achimnol added feature in-progress labels Apr 12, 2018

achimnol added this to Ongoing in Scalability Update Apr 12, 2018

achimnol added this to the 1.4 milestone Apr 12, 2018

achimnol added db-migration-required API labels Apr 13, 2018

achimnol mentioned this pull request Apr 13, 2018

Basic inventory management #29

Closed

achimnol added 2 commits April 26, 2018 17:16

Merge branch 'master' into feature/scaling-group

c335029

Merge branch 'master' into feature/scaling-group

44dcaba

achimnol mentioned this pull request Oct 5, 2018

User-specific resource policies #92

Closed

5 tasks

Merge branch 'master' into feature/scaling-group

c032ff8

achimnol modified the milestones: 1.4, 18.12 Nov 2, 2018

achimnol and others added 5 commits November 5, 2018 13:15

Relocate the model file for new directory structure

9119393

Draft abstract interfaces

0b59fcb

Merge branch 'master' into feature/scaling-group

8dad894

minor fix: Change ScalingGroup.scale()'s args to ScalingEvent

b69ba4c

Zeniuus force-pushed the feature/scaling-group branch from 1d5d39f to b69ba4c Compare November 6, 2018 07:51

achimnol assigned achimnol and Zeniuus Nov 7, 2018

Implement pseudo-code for ScalingGroup and ScalingDriver

ff4005c

Merge branch 'master' into feature/scaling-group

47e1b1b

Zeniuus added 4 commits December 5, 2018 23:01

Fix error: flake8 issues

d2e6b1e

Fix bug: type casting

dbe665c

Fix bug: do not schedule buffer jobs

52ba9a6

Split transaction for registering and actual kernel creation

cd93502

Zeniuus force-pushed the feature/scaling-group branch from 8a42e4a to cd93502 Compare December 5, 2018 14:41

Zeniuus added 5 commits December 7, 2018 23:35

Fix bug: minor fixes

7c6ab61

Prevent assigning kernels to pending scalings

0d9625b

Fix test to adapt scaling group

acaeeb8

Fix bug: await short time for async changes in tests

aaf013c

Fix bug: correct authorization logic

82a5d2d

Zeniuus force-pushed the feature/scaling-group branch from 145c893 to 82a5d2d Compare December 8, 2018 06:48

Zeniuus added 11 commits December 8, 2018 16:03

Fix bug: correct test fixture to use auth middleware

df77dd5

Merge branch 'master' into feature/scaling-group

e8ae9ee

Fix bug: fit to master branch

4aaca3e

Refactor out scaling driver and job scheduler

673cbea

Let job scheduler to determine required instances to schedule pending jobs, and scaling driver should depend on this when scaling out or in

KernelStatus.RESIZING -> KernelStatus.PENDING

7b06d7e

Fix bug: start new transcation when polling kernel creation

db99641

Fix bug: cast to int before addition

57c8ded

Fix bug: do not alter available_agent_infos args in schedule()

713656c

scheduling_plan -> scaling_out_plan

54dbccb

Fix bug: schedule job by FIFO correctly

5dcfcb4

Schedule jobs by resource precedence

2ebb683

For example, do not schedule jobs without gpu to agents with gpu slots

achimnol modified the milestones: 18.12, 19.06 Mar 27, 2019

achimnol mentioned this pull request Jul 25, 2019

Simplified scaling groups #167

Merged

15 tasks

achimnol mentioned this pull request Feb 5, 2020

Extend plug-in architecture lablup/backend.ai#143

Open

10 tasks

achimnol closed this Aug 28, 2020

Scalability Update automation moved this from Ongoing to Done Aug 28, 2020

achimnol deleted the feature/scaling-group branch August 28, 2020 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling group and scaling drivers #73

Scaling group and scaling drivers #73

achimnol commented Apr 12, 2018 •

edited

Loading

codecov-io commented Apr 12, 2018 •

edited

Loading

Zeniuus commented Nov 6, 2018 •

edited

Loading

Zeniuus commented Nov 6, 2018

achimnol commented Nov 8, 2018 •

edited

Loading

Zeniuus commented Nov 8, 2018

achimnol commented Aug 28, 2020

Scaling group and scaling drivers #73

Scaling group and scaling drivers #73

Conversation

achimnol commented Apr 12, 2018 • edited Loading

codecov-io commented Apr 12, 2018 • edited Loading

Codecov Report

Zeniuus commented Nov 6, 2018 • edited Loading

Scaling group

Zeniuus commented Nov 6, 2018

achimnol commented Nov 8, 2018 • edited Loading

Zeniuus commented Nov 8, 2018

achimnol commented Aug 28, 2020

achimnol commented Apr 12, 2018 •

edited

Loading

codecov-io commented Apr 12, 2018 •

edited

Loading

Zeniuus commented Nov 6, 2018 •

edited

Loading

achimnol commented Nov 8, 2018 •

edited

Loading