Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does priority expander do fallbacks? #2075

Closed
TarekAS opened this issue May 30, 2019 · 12 comments
Closed

Does priority expander do fallbacks? #2075

TarekAS opened this issue May 30, 2019 · 12 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@TarekAS
Copy link

TarekAS commented May 30, 2019

If the ASG with highest priority fails to launch instances for some reason, does it fallback to lower priority ASGs?

My use-case is falling back from Spot ASGs to OnDemand ASGs in-case the there are no Spot instances available.

If not, then I would suggest something like a timeout on each priority.

@aleksandra-malinowska aleksandra-malinowska added area/provider/aws Issues or PRs related to aws provider area/cluster-autoscaler labels May 30, 2019
@aleksandra-malinowska
Copy link
Contributor

cc @Jeffwan

@Jeffwan
Copy link
Contributor

Jeffwan commented May 30, 2019

If spot request can not fulfilled over max-node-provision-time (by default 15min), CA should stop considering this node group in simulations and will attempt to scale up a different group. I have not tried priority plugins since it's merged recently. I can have a test for this case since many users like OnDemand fallback case and this is one way to do that.

As you said corresponding node group should be removed from priority list with a timeout, because spot capacity may be better after a while.

Will come back to you later.

@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 7, 2019

@TarekAS I notice spot may have some issues if request can not be filled. Community has a PR to track this issue. #2008

Depends on if you use LauchConfiguration or LaunchTemplate, use mixed Instance policy as an example, If you don't set fix price (let ASG to manage price), you may not see this issue.

Seems timer doesn't start if require can not be fulfilled.

@TarekAS
Copy link
Author

TarekAS commented Jun 10, 2019

We have a very simple use-case. For each AZ, we have 1 Spot and 1 OnDemand ASGs that are identical. How can we effectively prefer scaling Spot ASGs over OnDemand?

We do not want to use MixedInstancesPolicy due to the following concerns:

  1. It requires at least 2 instance types, but CA requires both types to have identical CPU/mem (which are rare).
  2. We think that using a specific percentage of OnDemand instances is arbitrary and wasteful. Does specifying 0% OnDemand allow for fallback in case Spot requests cannot be fulfilled?
  3. Some workloads actually require OnDemand instances, so the OnDemand ASG is not just for fallbacks. We cannot rely on a % chance to find an OnDemand instance.

How are other people doing this? I've done some research and all I could find was using a Spot Rescheduler to mitigate this issue. I hope priority expander can help solve it.

In the meantime, there should be a basic guide on how to set this up.

Thanks!

@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 11, 2019

  1. We think that using a specific percentage of OnDemand instances is arbitrary and wasteful. Does specifying 0% OnDemand allow for fallback in case Spot requests cannot be fulfilled?

I confirmed with EC2 ASG team, no, fallback is not supported in MixedInstancePolicy.

  1. Some workloads actually require OnDemand instances, so the OnDemand ASG is not just for fallbacks. We cannot rely on a % chance to find an OnDemand instance.

To me, I think you may better to use different ASG for these kind of jobs and use node affinity to schedule pods on that ASG?

How are other people doing this? I've done some research and all I could find was using a Spot Rescheduler to mitigate this issue. I hope priority expander can help solve it.
I will check it out. Pretty full recently and if you have ideas, please contribute or discuss with me. https://spotinst.com I think this company has this feature and I think that's achievable in CA. Need to come up with a solution to meet most of the case.

Priority expander can fallback to OnDemand, but there's no logic to fallback to Spot once Spot instance becomes cheaper.

In the meantime, there should be a basic guide on how to set this up.

Thanks. If we think there's limitation on Spot, we can reopen a closed PR and make ASG Spot available there. (But this solution won't guarantee lowest price, price model is kind of hacky)
Guidance will be provided only if these problem are resolved.

@TarekAS
Copy link
Author

TarekAS commented Jun 12, 2019

To me, I think you may better to use different ASG for these kind of jobs and use node affinity to schedule pods on that ASG?

Definitely, that's what we're doing right now for "OnDemand" workloads. Therefore, even with MixedInstancePolicy, one would still need to create a dedicated ASG for on-demand workloads to guarantee availability. We just keep things simpler by creating separate ASGs for Spot and OnDemand.

If only CA supported the preferredDuringSchedulingIgnoredDuringExecution node affinity, it would be able to scale spot instances while supporting fallback.

Priority expander can fallback to OnDemand, but there's no logic to fallback to Spot once Spot instance becomes cheaper.

We use Launch Templates (automatic spot pricing). We're more concerned Spot instances becoming completely unavailable rather than expensive (it actually happened to us for 30 minutes). In case Spots become available again, would the priority expander be able to switch back to the higher-priority Spot ASGs?

@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 12, 2019

We use Launch Templates (automatic spot pricing). We're more concerned Spot instances becoming completely unavailable rather than expensive (it actually happened to us for 30 minutes). In case Spots become available again, would the priority expander be able to switch back to the higher-priority Spot ASGs?

No, priority expander only works when there's a decision to made to choose right node groups. It doesn't actively change existing nodes.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 10, 2019
@Jeffwan
Copy link
Contributor

Jeffwan commented Oct 11, 2019

If user looks for fall back options, here's one example I think can be used to move workloads back to Spot once it's available.

https://github.com/pusher/k8s-spot-rescheduler

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants