Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers startup throttling #3312

Closed
lhuard1A opened this issue Jan 8, 2015 · 40 comments
Closed

Containers startup throttling #3312

lhuard1A opened this issue Jan 8, 2015 · 40 comments

Comments

@lhuard1A
Copy link
Contributor

@lhuard1A lhuard1A commented Jan 8, 2015

Throttling study

Summary

The purpose of Kubernetes is to orchestrate many, potentially big, applications that could be co-located on shared minions. Kubernetes’ current policy when starting PODs is to start them asap.

This can be sub-optimal for OLTP applications where many processes/threads are started. During normal operation, those processes/threads are mostly idling, waiting for the next transaction, but when they are starting, they are CPU and IO greedy.

Starting all those processes simultaneously can cause start-up time degradation due to context switching and I/O saturation. In that case, starting them in a controlled manner can improve the start-up time and its predictability.

This study proposes to implement a throttling mechanism to improve start-up time predictability. It covers different possible policies, their pros/cons and implications.

Table of Contents

Problematic

We have applications that require many processes to be spawned on the same machine.

In order to port those applications to Kubernetes, we plan to have many containers (with only one process inside) in a POD.

Running does not imply Ready

It is not because a process is started that it is ready to work and process traffic.

For example, we can have a process that starts its life by fetching some configuration elements from somewhere and doing some expensive configuration stuff before being really ready to work.

Further in this document, we’ll make a distinction between starting and ready states.

  • starting means that the container is running, the process is live, but it is not ready to do its real “work”. For example, it may still be loading libraries, waiting for a database connection, building some internal caches, whatever…
  • ready means that the container is running, fully functional and ready to process incoming traffic.

By extension, we can define POD states as:

  • starting if a POD has at least one container in the starting state;
  • ready if all the containers of a POD are ready.

In today’s Kubernetes terminology, we don’t distinguish those states and both are named running.

Need for throttling

The initialization of the containers might be expensive in terms of CPU or I/O.
We have examples of applications that:

  • load a huge number of dynamic libraries and the symbol relocation takes a significant amount of time;
  • retrieve some configuration from a file or a remote DB and denormalize that configuration, this can be expensive as well;
  • create some local caches that need to be fed before the application can work;
  • etc.

A starting container which consumes a lot of resources can have two kind of consequences:

  1. It slows down the ready containers which are already running on the same machine and cause a response time degradation of the services provided by PODs that are running on the same machine than the starting one.
  2. It slows down other starting containers which take longer to become ready.

1. concurrency between the starting POD and the ready ones which are running on the same machine

This issue can be solved by adjusting the priorities of the different PODs so that one POD cannot starve others.

2. concurrency between the starting containers and the other starting ones of the same POD

This increases the starting time of the processes themselves (resource starvation).

Our processes have internal health checks that check that the starting phase does not last longer than a pre-defined time-out.

In case of resource starvation, the time-out expires and the startup is considered failed.

We previously solved this by limiting the number of processes that can start simultaneously in order to avoid resource starvation and have more predictable start-up times.

Protect the ready PODs from the starting ones on the same machine

Let’s consider the situation where, on a given machine, we have:

  • a POD in the ready state which is not consuming a lot of CPU, but which is handling requests and their response times are critical.
  • a POD in the starting state which is very CPU greedy.

We want to prevent the starting POD from starving the ready one in order to protect the response times of the processes of the ready POD. We must guarantee that even if the starting POD has many more containers than the ready one.

Today, docker containers are spawned in their own cgroups, all children of /system.slice. As a consequence, all the containers have the same weight. If the starting POD has ten times more containers than the ready one, it will be allocated ten times more CPU.

Having an additional layer with a slice per POD would allow to have a fair resource allocation per POD instead of having a fair resource allocation per container.

We propose to enhance docker to be able to specify the parent slice of containers’ cgroup (Pull request #9436, Pull request #9551) and to enhance kubernetes to create one slice per POD.

Before:

systemd-cgls
├─1 /sbin/init
├─system.slice
│ ├─docker-123456….scope
│ │ └─100 /foo/bar/baz
│ ├─docker-123457….scope
│ │ └─101 /foo/bar/baz
│ ├─docker-123458….scope
│ │ └─103 /foo/bar/baz
│ ├─docker-123459….scope
│ │ └─104 /foo/bar/baz

After:

systemd-cgls
├─1 /sbin/init
├─system.slice
│ ├─kubernetes.slice
│ │ ├─k8s_pod_X.slice
│ │ │ ├─docker-123456….scope
│ │ │ │ └─100 /foo/bar/baz
│ │ │ └─docker-123457….scope
│ │ │   └─101 /foo/bar/baz
│ │ └─k8s_pod_Y.slice
│ │   ├─docker-123458….scope
│ │   │ └─103 /foo/bar/baz
│ │   └─docker-123459….scope
│ │     └─104 /foo/bar/baz

Limit the number of processes that are starting at a given time

To avoid overloading a machine, rather than start a container as soon as its image has been pulled, we propose to control that start with a policy.

When a POD is assigned to a minion, kubelet creates the following FSM for each container:

container FSM

When an image is pulled, the container is not started immediately. Instead, it enters a pending state.

The pending to starting transition is triggered when the throttling policy authorizes it. The throttling policy is described below.

The starting to ready transition is triggered when the container is considered ready.

Readiness detection

The starting to ready transition is new. This transition can be triggered in two ways:

  • either the containers send a notification to say they are ready;
  • or the containers are polled by docker/kube to check their readiness.

ready notification

This solution requires the containers to notify their readiness.

systemd has a similar requirement and lets services notify their readiness via different means:

  • simple: the service is immediately ready;
  • forking: the service is ready as soon as the parent process exits;
  • dbus: the service is ready as soon as it acquires a name on D-Bus
  • notify: the service is ready as soon at it has explicitly notified systemd about it by posting a message on a dedicated UNIX socket via the sd_notify function.

For containers, the notify service type sounds to be the most suitable.

Pros
  • Notification is sent as soon as possible.
Cons
  • Either it introduces some “systemd” dependency, or it requires to implement another notification mechanism inspired by the sd_notify feature. In all cases, it is intrusive since it requires to implement something in the containerized processes.
  • This may be perceived as an advantage because existing programs may already implement this sd_notify call.
    In practice, when possible, it is preferable to decouple the public resource allocation (socket binding for example) from the program start-up. Concretely, the sockets are bound by systemd itself, the program is Type=simple (considered as ready immediately) and the program receives the file descriptor of the socket.

ready polling

This solution consists in having docker/kube regularly check the readiness of containers.

Such a mechanism already exists in Kubernetes as LivenessProbe. There are already different flavours of LivenessProbes:

  • HTTP probe: try to do an HTTP GET on a given URL
  • TCP probe: try to connect on a given port
  • Exec: try to execute an arbitrary command inside the container

The last one sounds generic enough to implement anything.

Pros
  • LivenessProbe is a mechanism that already exists;
  • It is not intrusive since it doesn’t require to implement an sd-notify-like call in the programs.
Cons
  • It is based on “polling”.
  • More fork/exec.

Preferred solution

Reusing the LivenessProbe mechanism.

Throttling policy

This is what triggers the pending to starting transition.

Several strategies are possible:

limit the number of processes in the starting state

This policy allows a container to go from pending to starting as soon as the number of containers in starting state on the minion (whatever the POD it belongs to) drops below a pre-defined threshold.

Pros
  • Simple
Cons
  • It’s non-trivial to set the threshold.

    If the processes are CPU bound and are consuming 100% of a CPU when they are in starting state, then, the optimal threshold would be the number of cores of the machine.

    If the processes are mostly waiting for external resources, then, the above recommendation is suboptimal.

    If the processes are multi-threaded and are consuming 100% of 3 CPUs, then, the optimal threshold is the third of the number of cores of the machine.

  • If processes are stuck in the starting state, they will prevent other containers from being started. It is thus mandatory to implement a time-out mechanism that passes the starting containers in failed state if they stay in starting for too long.

time throttling

This policy allows only a maximum number of containers to become starting per a given amount of time. Ex: 2 containers per minion can be started per second.

Pros
  • Simple
  • Ensures that all containers startup will be triggered within a predictible time, giving a chance to all containers to become ready.
Cons
  • Does not guarantee that:
    • The machine is never overloaded
    • The resources of the machines are optimized (we don’t uselessly throttle a container whereas the CPU is idle, there is no I/O, etc.)

resource monitoring

This policy allows a container to become starting only if:

  • the CPU consumption drops below a given threshold during a given amount of time
  • the I/O usage drops below a given threshold during a given amount of time
  • the load average of the machine drops below a given threshold
Pros
  • Really takes into account the available resource of the machine and its usage to optimize usage without overloading it
Cons
  • Implementation is more complex
  • If the machine is loaded by things other than starting containers (like ready containers or even processes running on the machine that are not docker containers), it will prevent containers from starting.

Composite policy

The “resource monitoring” policy is the one that better uses the resources but it relies on resource consumption averaged for a given amount of time.

It should be combined with a maximum number of starting containers and a maximum start-up rate in order to not start too many containers before the “average CPU for last 10s” or “average I/O for last 10s” or “load average for 1min” increases.

If we limit the maximum number of starting containers, we must have a time-out mechanism that prevents containers from staying in starting for too long.

If the resource consumption doesn’t drop when containers leave the starting state — either because the ready containers also consume resources, or because some resources are consumed by processes outside Kubernetes — it will prevent pending containers from starting for ever. In order to avoid that, we need to have a minimum start-up rate that guarantees that we will eventually start all the containers.

Pros
  • The only one that works in all cases?
Cons
  • Complex

Configuration example

The throttling mechanism described in this section is about avoiding resource starvation. The resources are global to the machine. As a consequence, the settings cannot be at the the POD level. They need to be at the minion level

We could have a configuration file attached to minions.

m1_config.json:

{
  "kind": "MinionConfig",
  "apiVersion": "v1beta1",
  "throttling": {
    "maxStartingContainersPerCore": "3",
    "maxLoadAverageMulitplier": "1.5",
    "maxCPU": "80%",
    "minRate": "0.1",
    "maxRate": "10"
  }
}
kubecfg -c m1_config.json update minions/192.168.10.1
  • maxStartingContainersPerCore: The maximum number of containers per core that could be in the starting state on the machine.
  • maxLoadAverageMultiplier: Can be a factor multiplied by the number of cores of the machine
  • minRate: Whatever the other settings, we’ll start at least one container every 10s (i.e: 0.1 container per second)
  • maxRate: Whatever the other settings, we’ll start at most 10 containers per second (i.e: 1 container every 100ms)

Dependency management proposal

“A POD is a collocated group of containers […] which are tightly coupled — in a pre-container world, they would have executed on the same physical or virtual host.” (extract from the PODs definition)

As they are tightly coupled, there might be dependencies between containers. In a “pre-container world”, they would have been spawned on a host by an init system able to handle dependencies.

We propose to enhance Kubernetes to use a dependency graph inside PODs to decide which containers can be started and which containers must wait for others.

Let’s consider, as an example, a POD with 5 containers linked together by the following dependencies:

dependency graph

Such a dependency graph could be described in the POD json like this:

{
  "kind": "Pod",
  "apiVersion": "v1beta1",
  "id": "app",
  "desiredState": {
    "manifest": {
      "version": "v1beta1",
      "id": "app",
      "containers": [{
        "name": "app_a",
        "image": "me/app_a",
        "livenessProbe": {
          "exec": {
            "command": "/check_A_health"
          }
        }
      },{
        "name": "app_b",
        "image": "me/app_b",
        "livenessProbe": {
          "exec": {
            "command": "/check_B_health"
          }
        },
        "dependsOn": [
          "mag"
        ]
      },{
        "name": "app_c",
        "image": "me/app_c",
        "livenessProbe": {
          "exec": {
            "command": "/check_C_health"
          }
        },
        "dependsOn": [
          "app_b"
        ]
      },{
        "name": "app_d",
        "image": "me/app_d",
        "livenessProbe": {
          "exec": {
            "command": "/check_D_health"
          }
        },
        "dependsOn": [
          "app_b"
        ]
      },{
        "name": "app_e",
        "image": "me/app_e",
        "livenessProbe": {
          "exec": {
            "command": "/check_E_health"
          }
        },
        "dependsOn": [
          "app_c",
          "app_d"
        ]
      }]
    }
  },
  "labels": {
    "name": "app"
  }
}

The POD start-up sequence is amended so that, when a POD is assigned to a minion, kubelet creates the following FSM for each container:

container FSM

The containers are initially in the downloading state. Once the image is pulled, they move to the new blocked state.

When a container X reaches the blocked state, the state of all its dependencies are checked. If all of them are ready, the container X moves to starting immediately. This is for example always the case for the containers which have no dependency.

Then, when a container X passes the starting to ready transition, for each container Yi in the blocked state that depends on X, we check the state of all the dependencies of Yi. If all of them are ready, then the Yi container becomes starting.

In the example above, when the app_C container becomes ready, the state of app_D is checked. It it’s ready, then the state of app_E moves from blocked to starting.

Cycle detection

We must ensure that the dependency graph is a Directed Acyclic Graph (DAG), that is to say that we do not have circular dependencies among our containers.

This could, for example, be checked when parsing the json if we enforce a rule saying that the dependsOn list of a container can only contain containers previously declared in the file.

Combining throttling with dependency management

Example of why dependency management needs to be combined with throttling

Let’s imagine a POD with 3 containers A, B and C. Those 3 containers can be started in any order in the sense that they won’t fail, crash or prematurely exit if the others are not there.

However, B and C needs to communicate with A in order to become ready.

For example:

  • A is a database. It notifies its readiness as soon as it is ready to process requests.
  • B is an application that needs to connect to the database to configure itself. It notifies its readiness as soon as it is configured.
  • C is similar to B

The throttling settings limit the number of containers authorized to be in starting at the same time to 2.

We have no dependency expressed in the POD json.

If we are lucky, things can happen in this order:

  • B starts. It cannot configure itself because A is not there. It is waiting for A.
  • A starts.
  • A is ready.
  • As A becomes ready, there is one “starting” slot available to make C start.
  • B connects to A. B configure itself and eventually becomes ready
  • C connects to A, configure itself and becomes ready

Lucky flow

Note that even if this works, before A becomes ready, B “consumes” a starting slot although it is not consuming resources. It is sub-optimal as, if we knew that B depends on A, we could have started a container of an other POD which doesn’t depend on A.

If we are unlucky, things can happen in this order:

  • B and C are started. They are waiting for being able to connect to A
  • A is not started because we already have two processes in the starting mode which is the maximum allowed by our policy.
  • We’re dead-locked.

Unlucky flow

This trivial example shows that

  • limiting the number of processes starting at a time and
  • having process readiness conditioned by other process readiness

is not possible if we cannot enforce a start-up order. That’s why if the readiness of some containers depends on the readiness of others, this dependency needs to be handled as described in the previous section.

Final picture

Here is the complete FSM of containers:

Unluck flow

Some states have associated actions that are triggered when the state is entered:

  • downloading: the image is fetched with a docker pull;
  • blocked: the container can be created with docker create but not started;
  • starting: the container is started with docker start.

The transitions are defined as followed:

  • When a POD is assigned to a minion, all its containers are put in the downloading state.
  • When the image is pulled, the container becomes blocked.
  • Immediately, the dependencies of that container are checked. If they are all in the ready state, then the container becomes pending immediately, otherwise, it stays in blocked state.
  • When there are containers in the pending state, the throttling mechanism regularly checks if conditions are met to take a pending container and make it progress to starting.
  • LivenessProbes regularly check if starting containers are ready. The first time a LivenessProbe reports a success, the container becomes ready.
  • If a container remains starting for longer time than a pre-defined threshold without its LivenessProbe reporting success, the container becomes failed.
  • Each time a container leaves the starting state, we check all the dependencies of the blocked containers that depend on it. If they are all ready, then that blocked container becomes pending.
@thockin

This comment has been minimized.

Copy link
Member

@thockin thockin commented Jan 8, 2015

Given the length and girth of this, I am going to respond here as I read it, then come back and edit.

First, there's clearly a lot of thought that went into this, so thank you.

I do not get a strong statement of problem. I ACK that starting a lot of stuff at once can make everything slow, but I do not have a sense of the real impact of this, and what the desired behavior should be. Are you shooting for determinism? Good luck. Are you shooting for some desired distribution of start-times, with a 95p less than X? Are you shooting for a priority-informed mechanism?

Running does not imply Ready

ACK. #620 agrees with this.

It slows down the ready containers which are already running

This is first and foremost an isolation failure. It's the sort of thing that strong isolation (such as Docker does not provide) prevents. By way of comparison, this practically can not happen in Google-internal clusters because we implement strong isolation (see https://github.com/google/lmctfy for more details). We may be able to mitigate this effect through heuristics as you describe, but these are not guarantees. I believe the longer term path MUST involve strong isolation for these guarantees to be met.

Strong isolation means that EVERYONE is isolated. As soon as someone is not isolated, all guarantees are moot.

This issue can be solved by adjusting the priorities of the different PODs

What sort of priority do you mean? The kernel concept "nice"? CPU cgroup "shares"? Or other (there are many knobs that determine the relationship between process scheduling behavior.

Our processes have internal health checks that check that the starting phase does not last longer than a pre-defined time-out.

This is really only guaranteeable with strong isolation. It might work some times, but as you experience, it fails, too.

Having an additional layer with a slice per POD

We are proponents of moby/moby#8551 (you said 9551).

When a POD is assigned to a minion, kubelet creates the following FSM

The net result here is going to be longer startup times for some containers and shorter for others. Is that really better? This will not fix the non-determinism of starting a container on a machine which has unisolated actors. It will penalize pods that are isolated - I think no matter what heuristic we decide on, those pods should get a free pass to the front of the line. That said, we need to REALLY understand the live interaction between unisolated containers and isolated ones - I assert that it should be sane, and if not, it is a bug.

We also need to be very careful how we handle DoS scenarios here - this makes it easy for someone to (accidentally, of course) bombard the system with low-priority, slow-starting pods, and flood out real work.

On policies: They are all either heuristic and bound to fail or incredibly complicated. We could maybe start simpler, assuming we follow something like this at all, and say that isolated pods (assuming they "fit") go first and get the resources they asked for. Un-isolated pods get whatever is left, and they are gated by some simple policy (which is heuristic and bound to fail). The point here being that we should start with SIMPLE until we really understand the problem, which I claim we don't yet (because of the use of cgroups being totally busted as-is).

configuration file attached to minions

We may get here, but I think we're a long way from this level of granularity of config. And we should go here with GREAT CAUTION. Every knob you add brings additional cognitive burden on admins and complexity to the code. We should ONLY expose knobs that we really expect people to turn with significant effect.

dependency graph inside PODs

We really REALLY want to not do this. It brings a lot of complexity to the API and to the code overall, including lots of new failure modes. Given that containers can fail independently, we either react to failures by ensuring graph ordering and killing dependent containers or we assert that unsatisfied dependencies must be handled at runtime, in which case startup time is just a special case of runtime. That aside, you can implement dependency ordering yourself through a volume - I do not want this in the core.

Now, I can see an argument to be made for simpler startup ordering, but it's more about stages than dependencies. Do everything is stage 1 before anything in stage 2. This allows people to express situations where a pod has a "data sync" container that really wants to complete before apps start. But this can also be done without core system support, so we have not pursued it yet.

@danmcp

This comment has been minimized.

Copy link
Contributor

@danmcp danmcp commented Jan 14, 2015

@thockin Some comments:

I do not get a strong statement of problem. I ACK that starting a lot of stuff at once can make everything slow, but I do not have a sense of the real impact of this, and what the desired behavior should be. Are you shooting for determinism? Good luck. Are you shooting for some desired distribution of start-times, with a 95p less than X? Are you shooting for a priority-informed mechanism?

In addition to making slow, it would make a lot of cases not work at all. Timeouts for many components may start to kick in.

I don't think the goal is to make a concrete statement on exact performance of lots of parallel startups, but more to allow the amount of parallelism/load to be tuned. And a particular installation of kubernetes could tune to their needs.

This is first and foremost an isolation failure. It's the sort of thing that strong isolation (such as Docker does not provide) prevents. By way of comparison, this practically can not happen in Google-internal clusters because we implement strong isolation (see https://github.com/google/lmctfy for more details). We may be able to mitigate this effect through heuristics as you describe, but these are not guarantees. I believe the longer term path MUST involve strong isolation for these guarantees to be met.

Strong isolation means that EVERYONE is isolated. As soon as someone is not isolated, all guarantees are moot.

Isolation comes at a cost with the lack of overcommit. Not all installations are willing to pay that. Allowing overcommit with throttling is the desired alternative. If the selected tuning still results in failures, then retries are acceptable.

@smarterclayton Any thoughts here?

@erictune

This comment has been minimized.

Copy link
Member

@erictune erictune commented Jan 14, 2015

It is possible to have a both isolation and overcommit. There are a bunch of ways to achieve it.

One way that fits this problem well is for pods/containers to update their resource requirements as they go through phases of their lifetime. This works particularly well if the requirements are initially high, and then drop. Which is the case here. Basically, when the client makes the transition from starting to ready, you want something to send an update to the pod description to reduce the resource requirements of the pod. This can be done without introducing any new concepts into the core of kubernetes.

@danmcp

This comment has been minimized.

Copy link
Contributor

@danmcp danmcp commented Jan 14, 2015

@erictune Isn't that just shifting resource around? At any one point in time no process can get more than its portion of a completely divided space and there is never any overcommit.

I don't think it solves the problem either. Let's say you have an environment where you have 20% of CPU of available on a node. But you need to start another 10 containers. Doing so would take all the CPU if it could. And at the same time you don't want to hurt the other 80% of what's going on (as much as possible). So at best you could give the pod(s) that need to start the 20%. But that might not be enough to start the pods and all their containers at the same time without them timing out. Hence the desire to let the pods and their containers have a throttled/staggered start.

@erictune

This comment has been minimized.

Copy link
Member

@erictune erictune commented Jan 14, 2015

You are right that is not overcommit. But it is going to allow for good
utilization and isolation.

I'm unsure as to why you want to start so many containers in a single pod.
That prevents the scheduler from putting the your extra containers onto
another node that has free CPU. Would it work to have one container per
pod and lots more pods?

On Wed, Jan 14, 2015 at 11:16 AM, Dan McPherson notifications@github.com
wrote:

@erictune https://github.com/erictune Isn't that just shifting resource
around? At any one point in time no process can get more than its portion
of a completely divided space and there is never any overcommit.

I don't think it solves the problem either. Let's say you have an
environment where you have 20% of CPU of available on a node. But you need
to start another 10 containers. Doing so would take all the CPU if it
could. And at the same time you don't want to hurt the other 80% of what's
going on (as much as possible). So at best you could give the pod(s) that
need to start the 20%. But that might not be enough to start the pods and
all their containers at the same time without them timing out. Hence the
desire to let the pods and their containers have a throttled/staggered
start.


Reply to this email directly or view it on GitHub
#3312 (comment)
.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

@smarterclayton smarterclayton commented Jan 14, 2015

In general simultaneous starts that are roughly >= the number of available cores / some factor for cache startups is effectively wasting resources due to context switches and cache contention. The start throttle could probably be correlated with cores alone and result in an effective outcome for any overcommit (containers >> cores) scenario. For under commit / near commit scenarios (containers ~= cores), you can get away without throttling.

----- Original Message -----

@erictune Isn't that just shifting resource around? At any one point in time
no process can get more than its portion of a completely divided space and
there is never any overcommit.

I don't think it solves the problem either. Let's say you have an
environment where you have 20% of CPU of available on a node. But you need
to start another 10 containers. Doing so would take all the CPU if it
could. And at the same time you don't want to hurt the other 80% of what's
going on (as much as possible). So at best you could give the pod(s) that
need to start the 20%. But that might not be enough to start the pods and
all their containers at the same time without them timing out. Hence the
desire to let the pods and their containers have a throttled/staggered
start.


Reply to this email directly or view it on GitHub:
#3312 (comment)

@danmcp

This comment has been minimized.

Copy link
Contributor

@danmcp danmcp commented Jan 14, 2015

@erictune I agree most cases don't have a lot of containers per pod. This particular case uses containers as workers and needs the collocation for sharing resources between those containers. I have struggled to find good use cases for collocation within pods as most use cases tend to want to scale on their own. So I hate to try and design around the cases that really want the pod collocation. The issues are less pronounced when dealing with just pods (1:1 with containers) but many pods starting still means many containers. And so even with 1:1 scenarios, throttling has a use case.

@erictune

This comment has been minimized.

Copy link
Member

@erictune erictune commented Jan 14, 2015

@smarterclayton

I'd be happy if you could show an realistic example where overheads were high. I don't think you'll find one.

Say you have 4 containers each trying to use a whole CPU for every actual (v)cpu. A long context switch time would be 100 microseconds. A typical wait-to-run time for a thread in a fair-share scenario could be 50 or 100ms. So, 0.1 to 0.2% overhead. Cache varies considerably but on average it is going to be in the 1% to 10% range.

Even if the overhead is 10%, I think you will easily recover this loss and more from having good isolation. Good isolation allows mixing workloads from disparate tenants or teams, which reduces bin-packing overhead, and predictable performance reduces resource hoarding behavior in organizations and allows operators to run confidently with smaller reserves..

@smarterclayton

This comment has been minimized.

Copy link
Contributor

@smarterclayton smarterclayton commented Jan 14, 2015

In tests, running ~100 containers per host (where individual containers during steady state are consuming percentages of CPU and blocked on IO), startup of the JVMs in parallel roughly doubled the total time for startup, whereas an ordered startup (allow 1 per core to start at a time) was 25-40% faster due to other contention.

In our Online development test cases (equivalent scenarios) with a CPU overcommit of around 4000% to 5000% (40x-50x) and memory overcommit of around 800% (8x), host reboots result in startup times that were 5-8x longer than when the containers were started at 1 per core at a time.

Some of this is IO contention, but I think you're assuming that steady state CPU is equivalent to startup CPU, and in our experience that is not the case for service heavy workloads (many relatively lightly used services). Development focused environments with long periods of very low traffic are extremely dense, and that's pretty common. JVM startup CPU for most of the apps we see in the wild is disproportionate to even moderate use. And during reboots, when MTTR is important, we try to ensure startup is as fast as possible.

----- Original Message -----

@smarterclayton

I'd be happy if you could show an realistic example where overheads were
measurably high. I don't think you'll find one.

Say you have 4 containers each trying to use a whole CPU for every actual
(v)cpu. A long context switch time would be 100 microseconds. A typical
wait-to-run time for a thread in a fair-share scenario could be 50 or 100ms.
So, 0.1 to 0.2% overhead. Cache varies considerably but on average it is
going to be in the 1% to 10% range.

Even if the overhead is 10%, I think you will easily recover this loss and
more from having good isolation. Good isolation allows mixing workloads
from disparate tenants or teams, which reduces bin-packing overhead, and
predictable performance reduces resource hoarding behavior in organizations
and allows operators to run confidently with smaller reserves..


Reply to this email directly or view it on GitHub:
#3312 (comment)

@erictune

This comment has been minimized.

Copy link
Member

@erictune erictune commented Jan 14, 2015

If the individual containers are blocked on i/o rather than blocked on CPU,
then it sounds like the problem is mostly, as you say, around I/O
contention.

Staggering the restart of pods after a reboot event with the intent to
reduce i/o contention seems like a reasonable thing to do, if that can be
done without making the system reason about dependencies.

On Wed, Jan 14, 2015 at 2:24 PM, Clayton Coleman notifications@github.com
wrote:

In tests, running ~100 containers per host (where individual containers
during steady state are consuming percentages of CPU and blocked on IO),
startup of the JVMs in parallel roughly doubled the total time for startup,
whereas an ordered startup (allow 1 per core to start at a time) was 25-40%
faster due to other contention.

In our Online development test cases (equivalent scenarios) with a CPU
overcommit of around 4000% to 5000% (40x-50x) and memory overcommit of
around 800% (8x), host reboots result in startup times that were 5-8x
longer than when the containers were started at 1 per core at a time.

Some of this is IO contention, but I think you're assuming that steady
state CPU is equivalent to startup CPU, and in our experience that is not
the case for service heavy workloads (many relatively lightly used
services). Development focused environments with long periods of very low
traffic are extremely dense, and that's pretty common. JVM startup CPU for
most of the apps we see in the wild is disproportionate to even moderate
use. And during reboots, when MTTR is important, we try to ensure startup
is as fast as possible.

----- Original Message -----

@smarterclayton

I'd be happy if you could show an realistic example where overheads were
measurably high. I don't think you'll find one.

Say you have 4 containers each trying to use a whole CPU for every actual
(v)cpu. A long context switch time would be 100 microseconds. A typical
wait-to-run time for a thread in a fair-share scenario could be 50 or
100ms.
So, 0.1 to 0.2% overhead. Cache varies considerably but on average it is
going to be in the 1% to 10% range.

Even if the overhead is 10%, I think you will easily recover this loss
and
more from having good isolation. Good isolation allows mixing workloads
from disparate tenants or teams, which reduces bin-packing overhead, and
predictable performance reduces resource hoarding behavior in
organizations
and allows operators to run confidently with smaller reserves..


Reply to this email directly or view it on GitHub:

#3312 (comment)


Reply to this email directly or view it on GitHub
#3312 (comment)
.

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Jan 14, 2015

I was chatting with someone about making kubelet use our QPS rate limiter to rate limit pod starts. Specifically, I was thinking about having one with a fairly high rate limit for ordinary pod starts, and a second one with a fairly low rate limit for restarting pods that have crashed.

@danmcp

This comment has been minimized.

Copy link
Contributor

@danmcp danmcp commented Jan 15, 2015

@lavalamp I think the rate limiting needs to be at a container level to solve the problems outlined in the study.

@erictune @thockin @smarterclayton
Regarding the rest of the comments. Is there an argument against the need for some sort or rate limiting on starts being valuable in a multi-tenant(not completely isolated) environment? The request seems like a fairly common use case for any cost conscious (most real life) deployments.

Regarding dependencies, I don't disagree at all that dependency ordering/tracking should be the last resort. But I would at least like to explore how the problem can be offloaded cleanly as a user problem to solve.

@bgrant0607

This comment has been minimized.

Copy link
Member

@bgrant0607 bgrant0607 commented Feb 14, 2015

Thanks for the thorough proposals.

We have added readiness probes: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/v1beta3/types.go#L367

We plan to create cgroups for pods, as described here: moby/moby#8551 (comment)

As for "priorities", possibly you want differentiated quality of service? #147

I'm not enthusiastic about supporting inter-container dependencies within the pod. See #1996 (comment) . The interaction between throttling and dependencies that you point out makes me even more wary of allowing dependencies to be specified. @thockin also pointed out several additional subtle consequences.

DoS is just as possible by putting a number of containers with very large images into pods, so I'm less concerned about that specific issue, though.

I could buy throttling simultaneous docker pulls if that's part of the problem, since those are not isolated. We have had similar problems with package installation internally.

I'm also not opposed to limiting container starts. We've also had issues with JVM startup time, and I could easily believe that simultaneous starts exacerbates the problem.

Have you thought about heuristics for detecting problematic containers? For example, is image size a good indicator? I'd rather penalize the "troublemakers" than slow down everything. In particular, in general, except for crashlooping containers, I'd want to prioritize restarts of previously running containers over new ones. If we were to introduce differentiated quality of service, I'd also want to use that in the priority function, also.

Right now, however, the sync code is pretty simple:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1119
Do you have a specific suggestion about how you'd restructure it? If we could bound the complexity and exposed knobs, and make the feature optional, then we could merge the change and try it out. Or, you could just patch your own fork, try it, and send us details about your performance results.

How quickly do newly started "troublemakers" start to consume lots of resources? Maybe we could figure out a resource-based approach that would be more general/principled than limiting simultaneous container starts. We plan to eventually make Kubelet do admission control on new pods based on available resources. The implication here is that during startup there aren't available resources (at least cpu and iops) and new containers are started anyway. Doing admission control would require that Kubelet rate-limit pod starts somewhat in order to observe resource usage, and it could reject or delay subsequent pods until resources were available. Once this data is propagated to the scheduler, it should back off until usage subsides, as well.

@smarterclayton How are you configuring overcommit? The standing proposal is to add Requests to ResourceRequirements, to enable scheduling requests that are lower than the limits, but this hasn't been implemented yet.

Also, note that we don't really have a mechanism for Kubelet configuration yet. Discussed in #1627.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

@smarterclayton smarterclayton commented Sep 8, 2016

@derekmahar

This comment has been minimized.

Copy link
Contributor

@derekmahar derekmahar commented Sep 8, 2016

@smarterclayton Okay, so in my example, I could run Liquibase in an init container which will restart until the database is available. How might the Liquibase container inform Kubernetes that it is complete so that Kubernetes may then proceed to start the application container?

@derekmahar

This comment has been minimized.

Copy link
Contributor

@derekmahar derekmahar commented Sep 8, 2016

I think @thockin answered my questions in #127 (comment), though in the case of a wrapper script, unless a Docker file is available, it's not always clear how the wrapper script can execute an application in a third-party container.

@derekmahar

This comment has been minimized.

Copy link
Contributor

@derekmahar derekmahar commented Sep 8, 2016

@smarterclayton On second thought, I couldn't run the Liquibase container as an init container because it requires the database which won't run until all of the init containers are complete, correct?

@derekmahar

This comment has been minimized.

Copy link
Contributor

@derekmahar derekmahar commented Sep 8, 2016

@thockin @bgrant0607 @smarterclayton

I think lifecycle event hooks and shared volumes are sufficient to control container execution order, but the documentation does not make this very obvious. This use case seems common and difficult enough that it deserves to be highlighted in the documentation.

Once I've come up with a solution to the example problem that I raised, I'll publish the solution. In which forum is it best to publish such solutions? Stack Overflow?

@thockin

This comment has been minimized.

Copy link
Member

@thockin thockin commented Sep 8, 2016

We'd be happy to take a doc and/or host a Kube blog post!

On Thu, Sep 8, 2016 at 2:56 PM, Derek Mahar notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton

I think lifecycle event hooks and shared volumes are sufficient to control
container execution order, but the documentation does not make this very
obvious. This use case seems common and difficult enough that it deserves
to be highlighted in the documentation.

Once I've come up with a solution to the example problem
#3312 (comment)
that I raised, I'll publish the solution. In which forum is it best to
publish such solutions? Stack Overflow?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3312 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVLJVm9UIOQtKIsIOIFtPuNrdQGi-ks5qoISCgaJpZM4DPzMF
.

@derekmahar

This comment has been minimized.

Copy link
Contributor

@derekmahar derekmahar commented Sep 9, 2016

The missing piece to this puzzle is still Kubernetes continuously restarting the Liquibase container. The simplest workaround that I could think of to resolve this issue is to simply prevent the Liquibase container from exiting. In its PreStop container lifecycle hook, the container would sleep indefinitely. I think this would be acceptable as a container that has completed its task, but is dormant, would likely hold minimal resources.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

@smarterclayton smarterclayton commented Sep 9, 2016

We really don't intend for "run once" containers to be mixed with "run forever" containers like this, without modeling it as an init container (before run forever containers start). However, a variant of your idea is simply that liquibase at the end of its "job" simply exec out to a much smaller process (replacing itself with a very small stub) and stay alive that way.

The real gap is ultimate pre-start hooks that can block. That's something Docker is not capable of today but rkt and OCI containers are. It's possible that we could implement a pre-start hook that is started exactly like its target container, but does not share a filesystem. That would be surprising to end users in some cases, but it could be viewed simply as a limitation of docker. If pre-start fails, we would still have to retry (like an init container).

@derekmahar

This comment has been minimized.

Copy link
Contributor

@derekmahar derekmahar commented Sep 9, 2016

However, a variant of your idea is simply that liquibase at the end of its "job" simply exec out to a much smaller process (replacing itself with a very small stub) and stay alive that way.

Could this "much smaller process" simply be an infinite sleep loop in a shell script?

The real gap is ultimate pre-start hooks that can block. That's something Docker is not capable of today but rkt and OCI containers are.

Are you referring to the distinction between moby/moby#6982 and Docker events? It's unfortunate that moby/moby#6982 is still not resolved (nor is it assigned, actually).

@derekmahar

This comment has been minimized.

Copy link
Contributor

@derekmahar derekmahar commented Sep 12, 2016

However, a variant of your idea is simply that liquibase at the end of its "job" simply exec out to a much smaller process (replacing itself with a very small stub) and stay alive that way.

Could this "much smaller process" simply be an infinite sleep loop in a shell script?

I stumbled across a sidecar container example that simply invokes tail -f /dev/null to stay alive after it's done its job. See third item in the explanation. I think I'll use that trick instead of an infinite sleep loop.

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Dec 17, 2017

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@cesartl

This comment has been minimized.

Copy link

@cesartl cesartl commented Dec 17, 2017

This is an old issue but this is still relevant for me today. I mostly run Java process on Spring boot which are very CPU hungry when they start. If too many pods start on the same node they compete for CPU and end up being killed for not starting fast enough.

My workaround is to have a script that scale the pods a little at a time, rather than all at once.

Another issue I face, not directly related to this one, is the scheduler not being aware of CPU on the node. When a few pods are starting one node with little CPU spare, I try to kill one of them but the scheduler puts it back right not the same pod (this is because according to CPU limits and requests this is the best node, but in practice this is not the case as this node has no CPU left).

What would be useful is a command that kills a pod and forces the scheduler to start it on a different node. If someone else finds that useful I could create a different issue

@wwadge

This comment has been minimized.

Copy link

@wwadge wwadge commented Dec 28, 2017

@cesartl My workaround is to call this as the very first thing before anything else:

public static void kubernetesStartupSleep() {
          if (!isRunningLocal()){
            String sleepRandom = System.getenv("SLEEP_RANDOM");
            int randSleep = 7+(new Random().nextInt((sleepRandom == null) ? 30 : Integer.parseInt(sleepRandom)));
            log.info(String.format("Kubernetes detected. Sleeping for a random time (%d seconds) to avoid saturating all CPUs", randSleep));
            try {
                Thread.sleep((int)MILLISECONDS.convert(randSleep, SECONDS));
            } catch (InterruptedException e) {
                // not important
            }
        }
    }

where isRunningLocal() just checks: System.getenv("KUBERNETES_SERVICE_HOST") == null

The idea here is that the containers get some random waits to avoid saturating CPU by all trying to startup at the same time. It's crude, but effective.

@jordanjennings

This comment has been minimized.

Copy link

@jordanjennings jordanjennings commented Dec 28, 2017

@wwadge Good suggestion, taking it one step farther you could do a random wait in an init container so that you don't have to modify application code.

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Jan 27, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Feb 26, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@kareiva

This comment has been minimized.

Copy link

@kareiva kareiva commented May 10, 2018

The content of this study resonates pretty well with our production deployment. Therefore I'd like to add a +1 here.

@borekb

This comment has been minimized.

Copy link

@borekb borekb commented Aug 28, 2018

In our case, we have nodes that host ~100 pods (~300 containers) and those pods are bound to them due to a dependency on local disk. If the node restarts, most pods don't manage to start within the initialDelay of the liveness probe (we could probably adjust that...), kubelet kills them and the cycle repeats. It takes quite some time for all pods to become healthy.

Our current workaround is have init containers that uses a shared host-mounted folder and:

  • Writes a file there whenever a "startup slot" is available and a pod can be started. The file written contains a timestamp.
  • Checks the directory for a number of files with timestamp lower than x seconds.

Plus, there is some cleaning up of old files. This allows us to specify that e.g. 5 pods can start each 60 seconds.

I'm not sure how uncommon our scenario is but we wished Kubernetes implemented some sort of throttling out of the box.

@henrydcase

This comment has been minimized.

Copy link

@henrydcase henrydcase commented Dec 5, 2018

Interesting concept. Also quite familiar for some reason

@SleepyBrett

This comment has been minimized.

Copy link

@SleepyBrett SleepyBrett commented Apr 30, 2019

We see this every time a node reboots. A new node takes probably 6-9m to get started and get all the daemonsets running. When the node reboots the stampede becomes a blocker. Exacerbated by the fact that none of the other pods can start before our flannel pod starts. So it's a crapshoot as to when it will become stable. After a reboot it can take 30-45 minutes for the node to stabilize.

(110 pod limit m4.10xlarge, avg containers per pod around 1.5ish)

@gordonbondon

This comment has been minimized.

Copy link

@gordonbondon gordonbondon commented Apr 30, 2019

We've been using this https://github.com/serhii-samoilenko/pod-startup-lock project to fix this problem for some time.

@odays

This comment has been minimized.

Copy link

@odays odays commented Aug 16, 2019

I would like to add a +1 to this. This is an issue I am seeing on a number of clusters, like most above, when running java based services the start up time of the JVMs seems to increase exponentially when a node is restarted and lots of JVMs try to start. It is as described above ok when it is restricted to 1 startup per cpu on the node. As soon as it exceeds this I am seeing the pods never starting due to timeouts and it required human intervention to scale RC's down to start them slowly.

I am currently trying to write an init container to call the kube api to check what is starting on the node but as there is no differential between initializing (init containers running but not yet complete), starting and running this is hard to determine.

@sharkymcdongles

This comment has been minimized.

Copy link

@sharkymcdongles sharkymcdongles commented Nov 30, 2019

This needs to be added for sure when things like springboot peg the CPU so hard on initial boot then just idle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.