New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversubscription support #606

Open
c4milo opened this Issue Dec 18, 2015 · 20 comments

Comments

Projects
None yet
@c4milo
Contributor

c4milo commented Dec 18, 2015

I noticed scheduling decisions are being made based on what jobs are reporting as desired capacity. However, some tasks involved might not really use what they originally intended to or will become idle thus holding back resources that could be used by other tasks. Are there plans to improve this in the short to mid term?

@dadgar

This comment has been minimized.

Show comment
Hide comment
@dadgar

dadgar Dec 18, 2015

Contributor

It is definitely something we are aware of and will be doing with Nomad. However there are many more pressing improvements that it is not something we will be focusing on in the near term.

Contributor

dadgar commented Dec 18, 2015

It is definitely something we are aware of and will be doing with Nomad. However there are many more pressing improvements that it is not something we will be focusing on in the near term.

@c4milo c4milo changed the title from Are there any plans to measure real resource usage, optimizing bin packing to allow over provisioning? to Are there any plans to measure real resource usage, optimizing bin packing to allow oversubscription? Dec 18, 2015

@c4milo c4milo changed the title from Are there any plans to measure real resource usage, optimizing bin packing to allow oversubscription? to Oversubscription support Dec 18, 2015

@cbednarski

This comment has been minimized.

Show comment
Hide comment
@cbednarski

cbednarski Dec 21, 2015

Contributor

Thanks for the link. There are some tricky aspects with implementation, not least of which is applications running on the platform need to be aware of resource reclamation (for example, by observing memory pressure). In practice this is a complicated thing to implement.

In light of that there are probably a few different approaches to this, in no particular order (and I'm not sure how these work at all on non-linux kernels):

  1. "reclaim" memory that is never used by an application. For example if a job asks for 2gb but only uses 512mb, automatically resize the job.
  2. Provide a dynamic metadata-like endpoint so jobs can query their updated resource quotas.
  3. Mark jobs as "resize aware" to indicate they will respond to memory pressure.

Memory is probably the most complicated because it is a hard resource limit. For soft limits like CPU and Network it's fairly easy to over-provision without crashing processes, but it's more difficult to provide QOS or consistent / knowable behavior across the cluster.

In general this problem is not easy to implement or reason about without a lot of extra data that is out currently of Nomad's scope. For example we would need to look at historical resource usage for a particular application in order to resize it appropriately and differentiate between real usage vs. memory leaks and such, monitor the impact of resizing on application health (e.g. does resizing it cause it to crash more frequently), etc.

So while this is something we'd like to do, and we're aware of the value it represents, this is likely not something we will be able to get to in the short / medium term.

Contributor

cbednarski commented Dec 21, 2015

Thanks for the link. There are some tricky aspects with implementation, not least of which is applications running on the platform need to be aware of resource reclamation (for example, by observing memory pressure). In practice this is a complicated thing to implement.

In light of that there are probably a few different approaches to this, in no particular order (and I'm not sure how these work at all on non-linux kernels):

  1. "reclaim" memory that is never used by an application. For example if a job asks for 2gb but only uses 512mb, automatically resize the job.
  2. Provide a dynamic metadata-like endpoint so jobs can query their updated resource quotas.
  3. Mark jobs as "resize aware" to indicate they will respond to memory pressure.

Memory is probably the most complicated because it is a hard resource limit. For soft limits like CPU and Network it's fairly easy to over-provision without crashing processes, but it's more difficult to provide QOS or consistent / knowable behavior across the cluster.

In general this problem is not easy to implement or reason about without a lot of extra data that is out currently of Nomad's scope. For example we would need to look at historical resource usage for a particular application in order to resize it appropriately and differentiate between real usage vs. memory leaks and such, monitor the impact of resizing on application health (e.g. does resizing it cause it to crash more frequently), etc.

So while this is something we'd like to do, and we're aware of the value it represents, this is likely not something we will be able to get to in the short / medium term.

@c4milo

This comment has been minimized.

Show comment
Hide comment
@c4milo

c4milo Dec 21, 2015

Contributor

@cbednarski, indeed, it is a complex feature. I believe it can be implemented to some extent, though. This is the list of rough tasks I made while going through the paper, it is not by any means complete:

  • It should allow defining Service Level Objetives (SLO) for a given task in a job definition.
  • It should allow launching a task using a Best Effort (BE) policy, meaning having some way to tag the task as being BE in the job definition. Which will also mean the task is interruptible.
  • It should continuously monitor latency and latency slack of scheduled Latency Critical (LC) tasks, in order to determine whether or not a node is suitable for oversubscription.
  • It should be able to isolate Latency Critical tasks from Best Effort ones by pinning LC tasks to specific CPU cores and specific CPU cache partitions.
  • It should monitor memory bandwidth using performance counters, making sure LC tasks receive sufficient bandwidth.
  • It should scale down the number of CPU cores assigned to a BE task if memory bandwidth for co-located LC tasks is not sufficient.
  • It should limit outgoing network traffic for BE tasks
  • It should not limit LC tasks network traffic
  • It should guarantee BE tasks power consumption does not cause CPU frequencies for LC tasks to scale down. In other words, it has to guarantee LC task's desired CPU frequencies are honor at all times.

Finally, some personal remarks:

  • I don't think BE tasks need to declare desired hardware resources as they will be run as that, best effort tasks. If a node has available resources and latency of its current LC tasks is fine, it will be suitable for running BE tasks.
  • There won't be need to "reclaim" allocated resources as long as BE tasks are allowed to be scheduled on nodes suitable for oversubscription.
  • The above is a rough list of tasks as I mentioned, more details and hints about the implementation can be found directly in the paper.
  • Agreed with you on that for non-linux nodes we will have to find out how to get performance counters information, limit network traffic, pin tasks to specific CPU cores, etc.
  • Perhaps the most difficult part of all this work may be its evaluation and testing.
Contributor

c4milo commented Dec 21, 2015

@cbednarski, indeed, it is a complex feature. I believe it can be implemented to some extent, though. This is the list of rough tasks I made while going through the paper, it is not by any means complete:

  • It should allow defining Service Level Objetives (SLO) for a given task in a job definition.
  • It should allow launching a task using a Best Effort (BE) policy, meaning having some way to tag the task as being BE in the job definition. Which will also mean the task is interruptible.
  • It should continuously monitor latency and latency slack of scheduled Latency Critical (LC) tasks, in order to determine whether or not a node is suitable for oversubscription.
  • It should be able to isolate Latency Critical tasks from Best Effort ones by pinning LC tasks to specific CPU cores and specific CPU cache partitions.
  • It should monitor memory bandwidth using performance counters, making sure LC tasks receive sufficient bandwidth.
  • It should scale down the number of CPU cores assigned to a BE task if memory bandwidth for co-located LC tasks is not sufficient.
  • It should limit outgoing network traffic for BE tasks
  • It should not limit LC tasks network traffic
  • It should guarantee BE tasks power consumption does not cause CPU frequencies for LC tasks to scale down. In other words, it has to guarantee LC task's desired CPU frequencies are honor at all times.

Finally, some personal remarks:

  • I don't think BE tasks need to declare desired hardware resources as they will be run as that, best effort tasks. If a node has available resources and latency of its current LC tasks is fine, it will be suitable for running BE tasks.
  • There won't be need to "reclaim" allocated resources as long as BE tasks are allowed to be scheduled on nodes suitable for oversubscription.
  • The above is a rough list of tasks as I mentioned, more details and hints about the implementation can be found directly in the paper.
  • Agreed with you on that for non-linux nodes we will have to find out how to get performance counters information, limit network traffic, pin tasks to specific CPU cores, etc.
  • Perhaps the most difficult part of all this work may be its evaluation and testing.
@diptanu

This comment has been minimized.

Show comment
Hide comment
@diptanu

diptanu Dec 21, 2015

Collaborator

My thoughts around oversubscription are -

a. We need guaranteed QoS guarantees for a Task. Tasks which are guaranteed certain resources should get them whenever they need it. Trying to add and take away resources for the same task doesn't work well in all cases especially if it's memory resources. CPU and I/O probably can be tuned up and down unless the task has very burty resource usages.

b. What works well however is the concept that certain jobs which can be revocable are oversubscribed along with jobs which are are not revocalble. And we estimate how much we can oversubscribe and run some revocable jobs when capacity is available and revoke them when the jobs which have been guaranteed the capacity needs them.

Collaborator

diptanu commented Dec 21, 2015

My thoughts around oversubscription are -

a. We need guaranteed QoS guarantees for a Task. Tasks which are guaranteed certain resources should get them whenever they need it. Trying to add and take away resources for the same task doesn't work well in all cases especially if it's memory resources. CPU and I/O probably can be tuned up and down unless the task has very burty resource usages.

b. What works well however is the concept that certain jobs which can be revocable are oversubscribed along with jobs which are are not revocalble. And we estimate how much we can oversubscribe and run some revocable jobs when capacity is available and revoke them when the jobs which have been guaranteed the capacity needs them.

@doherty

This comment has been minimized.

Show comment
Hide comment
@doherty

doherty Jul 24, 2016

You may be interested in these videos about Google's Borg, which include a discussion of how oversubscription is handled with SLOs:

doherty commented Jul 24, 2016

You may be interested in these videos about Google's Borg, which include a discussion of how oversubscription is handled with SLOs:

@jippi

This comment has been minimized.

Show comment
Hide comment
@jippi

jippi Oct 25, 2016

Contributor

Any ETA or implementation details on this one? :)

Contributor

jippi commented Oct 25, 2016

Any ETA or implementation details on this one? :)

@dadgar

This comment has been minimized.

Show comment
Hide comment
@dadgar

dadgar Oct 25, 2016

Contributor

No, this feature is very far down the line

Contributor

dadgar commented Oct 25, 2016

No, this feature is very far down the line

@catamphetamine

This comment has been minimized.

Show comment
Hide comment
@catamphetamine

catamphetamine Jul 1, 2017

I don't understand: if I just use Docker to run containers it doesn't impose any restrictions or reservations on CPU or memory. This software, in contrast, requires the user to do so. If that's the case then I guess I'll just use Docker. And I don't need the clumsy "driver" concept just to support all those "vagrant" things afloat no one really needs in modern microservice architectures.
Procrustean bed this is called.

catamphetamine commented Jul 1, 2017

I don't understand: if I just use Docker to run containers it doesn't impose any restrictions or reservations on CPU or memory. This software, in contrast, requires the user to do so. If that's the case then I guess I'll just use Docker. And I don't need the clumsy "driver" concept just to support all those "vagrant" things afloat no one really needs in modern microservice architectures.
Procrustean bed this is called.

@dadgar

This comment has been minimized.

Show comment
Hide comment
@dadgar

dadgar Jul 1, 2017

Contributor

@halt-hammerzeit There are very different requirements for a cluster scheduler and a local container runtime. Nomad imposes resource constraints so it can bin-pack nodes and provide a certain quality of service for tasks it places.

When you don't place any resource constraints using Docker locally, you get no guarantee that container won't utilize all the CPU/Memory on your system and that is not suitable for a cluster scheduler! Hope that helps!

Contributor

dadgar commented Jul 1, 2017

@halt-hammerzeit There are very different requirements for a cluster scheduler and a local container runtime. Nomad imposes resource constraints so it can bin-pack nodes and provide a certain quality of service for tasks it places.

When you don't place any resource constraints using Docker locally, you get no guarantee that container won't utilize all the CPU/Memory on your system and that is not suitable for a cluster scheduler! Hope that helps!

@jzvelc

This comment has been minimized.

Show comment
Hide comment
@jzvelc

jzvelc Jul 5, 2017

@dadgar Certainty isn't required in some cases. For example we are currently running around 30 QA environments and for that we need a lot of servers (each environment (job) needs around 2GB of memory in order to cover memory spikes). Utilization of those servers is very low and we can't work around those memory spikes (e.g. PHP Symfony2 apps require cache warm-up at startup which consumes 3 times the memory that is actually needed for runtime). I should be able to schedule those QA environments on a single server and I don't care if QoS is degraded since it's for testing purposes only. Scheduler should still make decisions based on task memory limit provided but we should be able to define soft limit and disable hard memory limit on docker containers. Something like #2771 would be great.
Other container platforms such as ECS and Kubernetes handle this just fine.

jzvelc commented Jul 5, 2017

@dadgar Certainty isn't required in some cases. For example we are currently running around 30 QA environments and for that we need a lot of servers (each environment (job) needs around 2GB of memory in order to cover memory spikes). Utilization of those servers is very low and we can't work around those memory spikes (e.g. PHP Symfony2 apps require cache warm-up at startup which consumes 3 times the memory that is actually needed for runtime). I should be able to schedule those QA environments on a single server and I don't care if QoS is degraded since it's for testing purposes only. Scheduler should still make decisions based on task memory limit provided but we should be able to define soft limit and disable hard memory limit on docker containers. Something like #2771 would be great.
Other container platforms such as ECS and Kubernetes handle this just fine.

@tshak

This comment has been minimized.

Show comment
Hide comment
@tshak

tshak Nov 10, 2017

I think it is difficult to ask developers to define resource allocations for services especially for new services or when a service is running in a wide range of environments. I understand that Nomad's approach greatly simplifies bin packing, but this does us little good if we aren't good at predicting the required resources. One reason this is particularly challenging for us is because we have 100's of different production environments (multiple tiers of multi-tenancy + lots of single tenants with a wide range of hardware and usage requirements). Even if we can generalize some of these configurations, I believe that explicit resource allocation could be an undesirable challenge for us to take on for each service.

Clearly there was a lot of thought put behind the current design. The relatively low priority on this issue also indicates to me a strong opinion that the current offering is at least acceptable if not desirable for a wide range of use cases. Maybe some guidance on configuring resource allocation, especially in cases where we lack a-priori knowledge of the load requirements of a service would be helpful.

Ultimately my goal is to provide developers a nearly "Cloud Foundry" like experience. "Here's my service, make it run, I don't care how". I really like Nomad's simplicity compared to other solutions like Kubernetes, but this particular issue could be an adoption blocker. I'm happy to discuss further or provide more detail about my particular scenario here or on Gitter.

tshak commented Nov 10, 2017

I think it is difficult to ask developers to define resource allocations for services especially for new services or when a service is running in a wide range of environments. I understand that Nomad's approach greatly simplifies bin packing, but this does us little good if we aren't good at predicting the required resources. One reason this is particularly challenging for us is because we have 100's of different production environments (multiple tiers of multi-tenancy + lots of single tenants with a wide range of hardware and usage requirements). Even if we can generalize some of these configurations, I believe that explicit resource allocation could be an undesirable challenge for us to take on for each service.

Clearly there was a lot of thought put behind the current design. The relatively low priority on this issue also indicates to me a strong opinion that the current offering is at least acceptable if not desirable for a wide range of use cases. Maybe some guidance on configuring resource allocation, especially in cases where we lack a-priori knowledge of the load requirements of a service would be helpful.

Ultimately my goal is to provide developers a nearly "Cloud Foundry" like experience. "Here's my service, make it run, I don't care how". I really like Nomad's simplicity compared to other solutions like Kubernetes, but this particular issue could be an adoption blocker. I'm happy to discuss further or provide more detail about my particular scenario here or on Gitter.

@dadgar

This comment has been minimized.

Show comment
Hide comment
@dadgar

dadgar Nov 10, 2017

Contributor

@tshak I would recommend you over-allocate resources for new jobs and then monitor the actual resource usage and adjust the resource ask to be inline with your actual requirement.

Contributor

dadgar commented Nov 10, 2017

@tshak I would recommend you over-allocate resources for new jobs and then monitor the actual resource usage and adjust the resource ask to be inline with your actual requirement.

@catamphetamine

This comment has been minimized.

Show comment
Hide comment
@catamphetamine

catamphetamine Nov 10, 2017

@tshak Or look into Kubernetes which seems to claim to support such a feature

catamphetamine commented Nov 10, 2017

@tshak Or look into Kubernetes which seems to claim to support such a feature

@tshak

This comment has been minimized.

Show comment
Hide comment
@tshak

tshak Nov 11, 2017

Thanks @dadgar. Unfortunately this is a high burden since we have many environments of varying sizes and workloads that these services run on. We've got a good discussion going on Gitter if you're interested in more. As @catamphetamine said, we may have to use Kubernetes. The concept of "Guarunteed" vs. "Burstable" vs. "BestEffort" seems to better fit our use case. I was hoping for Nomad to have (or plan for the near future) something similar since Nomad is otherwise preferable to me!

tshak commented Nov 11, 2017

Thanks @dadgar. Unfortunately this is a high burden since we have many environments of varying sizes and workloads that these services run on. We've got a good discussion going on Gitter if you're interested in more. As @catamphetamine said, we may have to use Kubernetes. The concept of "Guarunteed" vs. "Burstable" vs. "BestEffort" seems to better fit our use case. I was hoping for Nomad to have (or plan for the near future) something similar since Nomad is otherwise preferable to me!

@catamphetamine

This comment has been minimized.

Show comment
Hide comment
@catamphetamine

catamphetamine Nov 11, 2017

I was too giving it a second thought yesterday, after abandoning Nomad in summer due to it lacking this feature.
Containers are meant to be "stateless" and "ephemeral" so if a container crashes due to an Out Of Memory error then ideally it would have no difference as the code should automatically retry the API query.
In the real world though there's no "auto retry" feature in any code so if an API request fails the whole transaction may be left in an inconsistent state possibly corrupting application's data.

catamphetamine commented Nov 11, 2017

I was too giving it a second thought yesterday, after abandoning Nomad in summer due to it lacking this feature.
Containers are meant to be "stateless" and "ephemeral" so if a container crashes due to an Out Of Memory error then ideally it would have no difference as the code should automatically retry the API query.
In the real world though there's no "auto retry" feature in any code so if an API request fails the whole transaction may be left in an inconsistent state possibly corrupting application's data.

@DanielFallon

This comment has been minimized.

Show comment
Hide comment
@DanielFallon

DanielFallon Nov 14, 2017

I think what kubernetes calls "burstable" is the most intractable use case here, many services that we run use a large heap during start up and then have relatively low memory usage after startup.

one of the services I've been monitoring requires as much as 4 gigs during startup to warm up its cache and then it typically runs around 750mb of ram during normal operation. With nomad I must allocate 4 gigs of ram for each of these microservices which is really expensive.

DanielFallon commented Nov 14, 2017

I think what kubernetes calls "burstable" is the most intractable use case here, many services that we run use a large heap during start up and then have relatively low memory usage after startup.

one of the services I've been monitoring requires as much as 4 gigs during startup to warm up its cache and then it typically runs around 750mb of ram during normal operation. With nomad I must allocate 4 gigs of ram for each of these microservices which is really expensive.

@CumpsD

This comment has been minimized.

Show comment
Hide comment
@CumpsD

CumpsD Apr 16, 2018

Is it possible with Nomad at this time to support this burstable feature?

I have a job which consumes a lot of CPU when launching and then falls back to almost no CPU.

However, Nomad (using Docker driver) kills off this job before it can get over its initial peak.

I cannot allocate enough CPU to get this started or it doesnt find an allocation target

CumpsD commented Apr 16, 2018

Is it possible with Nomad at this time to support this burstable feature?

I have a job which consumes a lot of CPU when launching and then falls back to almost no CPU.

However, Nomad (using Docker driver) kills off this job before it can get over its initial peak.

I cannot allocate enough CPU to get this started or it doesnt find an allocation target

@schmichael

This comment has been minimized.

Show comment
Hide comment
@schmichael

schmichael Apr 16, 2018

Contributor

However, Nomad (using Docker driver) kills off this job before it can get over its initial peak.
-- @CumpsD

Nomad should not be killing a job due to its CPU usage. Just so you know: by default Docker/exec/java/rkt CPU limits are soft limits meaning they allow bursting CPU usage today. If the count is >1 you may want to use the distinct_hosts constraint on the initial run to make sure multiple instances aren't contending for resources on the same host, but beyond the initial run the deployments feature can prevent instances from starting at the same time during their warmup period.

While we still plan on first class oversubscription someday, CPU bursting should be beneficial for your use case today. Please open a new issue if you think there's a bug.

Contributor

schmichael commented Apr 16, 2018

However, Nomad (using Docker driver) kills off this job before it can get over its initial peak.
-- @CumpsD

Nomad should not be killing a job due to its CPU usage. Just so you know: by default Docker/exec/java/rkt CPU limits are soft limits meaning they allow bursting CPU usage today. If the count is >1 you may want to use the distinct_hosts constraint on the initial run to make sure multiple instances aren't contending for resources on the same host, but beyond the initial run the deployments feature can prevent instances from starting at the same time during their warmup period.

While we still plan on first class oversubscription someday, CPU bursting should be beneficial for your use case today. Please open a new issue if you think there's a bug.

@CumpsD

This comment has been minimized.

Show comment
Hide comment
@CumpsD

CumpsD Apr 17, 2018

@schmichael it was a wrong assumption, the cpu spike at start and the stopping were unrelated. After debugging for a day it turned out the inner process stopped, causing the job to stop and restart

Sorry for the confusion. After fixing it, it is obvious the limit is indeed a soft one :)

CumpsD commented Apr 17, 2018

@schmichael it was a wrong assumption, the cpu spike at start and the stopping were unrelated. After debugging for a day it turned out the inner process stopped, causing the job to stop and restart

Sorry for the confusion. After fixing it, it is obvious the limit is indeed a soft one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment