Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Scaling group and scaling drivers #73

Closed
wants to merge 53 commits into from
Closed

Conversation

achimnol
Copy link
Member

@achimnol achimnol commented Apr 12, 2018

Let's implement policy-based scalers and scaling groups.

Concept:

Policy options:

  • image auto-update
    • on/off
    • update check interval
  • agent auto-update
    • on/off
    • idle wait time after last kernel termination on agent
  • container termination policy
    • auto-terminate after specified idle time / persistent
  • external agents
    • on/off
    • public key (external agents should provide the auth message encrypted using its own private key)
    • limit (max number of allowed external agents)

Batch job functions:

  • Get the job list and their status
  • Submit a job (input + output + program + resource spec)
    • JobSpecAdaptor
  • Get the cluster node list
  • Get/set the scheduler configurations
  • (Optional) Cancel / interrupt a job

Related:

  • lablup/backend.ai-agent#5 — container termination policy (idle / persistent / etc.)
  • lablup/backend.ai-agent#39 — on/off option for self-autoupgrade
  • https://github.com/lablup/backend.ai-agent/projects/2 — on/off option and auth config to accept external agents

@codecov-io
Copy link

codecov-io commented Apr 12, 2018

Codecov Report

Merging #73 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #73   +/-   ##
=======================================
  Coverage   64.29%   64.29%           
=======================================
  Files          24       24           
  Lines        2644     2644           
=======================================
  Hits         1700     1700           
  Misses        944      944

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b6647c...9119393. Read the comment docs.

@achimnol achimnol mentioned this pull request Oct 5, 2018
5 tasks
@achimnol achimnol modified the milestones: 1.4, 18.12 Nov 2, 2018
achimnol and others added 5 commits November 5, 2018 13:15
- All session creation request is enqueued to job queue and return response immediately without waiting for actual kernel creation.
- After a kernel creation request comes in or a kernel is terminated, then `ScalingGroup.schedule()` is invoked. It calls `AbstractJobScheduler.schedule()` and actually create kernels returned by scheduler.
- Scale in/out when a kernel is created/terminated or an agent is joined/left.
- Note that scaling always preceed scheduling. When scaling up, scaling does not impact on available resource shares immediately since starting a new instance takes considerably long time, whereas scheduling depends on current available resource shares. Therefore, the order of scheduling and scaling does not matter when scaling up. However, when scaling down, AbstractScalingDriver should "mark" agents to terminate in the (near) future, and scheduler should avoid assigning kernels to those agents. This forces scaling down to be done before scheduling.
@Zeniuus
Copy link
Contributor

Zeniuus commented Nov 6, 2018

Scaling group

  • States
    • Pending request (put / get all)
    • Resource shares
      • get_available_shares(): available resource shares for each agent
      • get_minimum_shares(): margin to create a few kernel immediately without scaling out
      • get_required_shares(lang): required resource share for specific kernel
  • API
    • schedule(): create available requests
    • scale(event)

Design?

  • Each state is not a property, but a method. This is to calculate each state from DB by on-demand.
  • Scaling always preceed scheduling. When scaling down, AbstractScalingDriver should "mark" agents to terminate in the (near) future, and scheduler should avoid assigning kernels to those agents. This forces scaling down to be done before scheduling.

@achimnol

@Zeniuus
Copy link
Contributor

Zeniuus commented Nov 6, 2018

I think I need some more time to design interfaces. Please wait for my PR review request through offline.

@achimnol
Copy link
Member Author

achimnol commented Nov 8, 2018

  • I agree with you that the scaler should be executed before the scheduler for the same single event.

We also need to consider some corner cases like this:

  • A session creation request may demand a resource specification impossible to schedule with current hardware configurations even after scaling out. For instance, when we use only one type of GPU instances having 1.0 GPU share for each, scheduling 1.5 GPU share for a single-container session is impossible no matter how many instances the scaler adds.
  • In this case, we should let the user (client) immediately know the request is impossible to schedule without waiting for the pending-queue timeout. Also the scaler should not try to infinitely add new instances to fill the resource demand gap that cannot be filled.

@Zeniuus
Copy link
Contributor

Zeniuus commented Nov 8, 2018

Suggestion: I think AbstractScalingDriver contains two separate parts: scaling logic and vendor-specific implementation. For instance, scale_out and get_scale_out_info are independent of which vendor is used, and add_agents and get_ami_id are independent of scaling logic.

So, I think it is good to separate AbstractScalingDriver into AbstractScalingDriver and VenderScalingDriverMixin. Is it seemed to be over-engineering?

@achimnol achimnol modified the milestones: 18.12, 19.06 Mar 27, 2019
@achimnol achimnol mentioned this pull request Jul 25, 2019
15 tasks
@achimnol
Copy link
Member Author

This is now considered replaced with #167.
Adding auto-scaling driver will be done as a separate issue in the futureu.

@achimnol achimnol closed this Aug 28, 2020
Scalability Update automation moved this from Ongoing to Done Aug 28, 2020
@achimnol achimnol deleted the feature/scaling-group branch August 28, 2020 13:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants