Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outstanding Scale offers in mesos #1516

Closed
JohnPTobe opened this issue Mar 11, 2019 · 5 comments
Closed

Outstanding Scale offers in mesos #1516

JohnPTobe opened this issue Mar 11, 2019 · 5 comments
Assignees

Comments

@JohnPTobe
Copy link

The dev and prod clusters have gotten into states where the outstanding offers for scale keep on increasing and are not cleaned up. These offers pile up without being used and clog up the cluster so it's left with only approx. 25% of capacity.

@JohnPTobe
Copy link
Author

We have multiple areas that scheduling logic could be improved:

  1. The handler for RESCIND operations appears to not operating correctly. The default timeout in DCOS 1.10.9 is 2 minutes. In this case, the outstanding offers we are seeing should not live past 2 minutes - this is obviously not working as intended.
  2. We should DECLINE offers immediately in the scheduler thread if they don't match our configured Mesos Role.
  3. ResourceManager should DECLINE offers at minimum on a set timeout. This should be configurable and a reasonable default is probably the 2 minutes we are seeing in DCOS 1.10.

@JohnPTobe JohnPTobe added this to the Sprint 3-08-2019 milestone Mar 15, 2019
@JohnPTobe JohnPTobe self-assigned this Mar 15, 2019
@JohnPTobe
Copy link
Author

Rescind is working fine, the problem is that scale is gobbling up all offers as soon as it picks them up from mesos. The offers expire, but the next time mesos offers them up again scale grabs them all. The solution is three fold:

  1. Decline all offers while scale is paused. If scale is being a bit selfish and something else needs to run for a bit, pausing scale should decline all offers and thus another framework can pick those offers up.
  2. Decline all offers for other roles. Currently we're accepting role * and it's not an issue, but if we're set up with an assigned role we will be grabbing offers for other roles and never declining them so that needs to change.
  3. Decline offers that have not been allocated to a task at the end of the scheduling loop. After scheduling a set of tasks, decline unallocated offers.

@bald6354
Copy link
Contributor

Well done!

@JohnPTobe
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants