Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
On out-of-memory (OOM) restarts all Swarm-based containers, plus some inconsistencies #29941
tl;dr : OOM causes:
Possibly related to #29854, but not sure.
How to reproduce:
Issues occur with:
Steps to reproduce:
(Also weird thing is that sometimes after "Killed" stdin gets left open to the original
I have observed a couple possible outcomes happening randomly:
Outcome a) nothing special happens
Memory hog gets oomkilled of course, and
Just run the memory hog command again to observe the more interesting outcomes.
Outcome b) all Swarm tasks restart (most probable) but alpine container survives
Pretty much instantly you can see the effect of Swarm-based tasks restarting:
Logs (09:30:25 was when Kernel OOM'd):
Outcome c) memory hog killed, but still listed as "running"
Memory hog got "Killed" but still listed as running:
Outcome d) Swarm tasks restarted AND with inappropriate amount of replicas
The running container list should be: one Alpine image and two Whoami images, but:
My environment Host Win10 -> Vagrant -> VirtualBox -> Ubuntu VM.
Keywords for search:
This looks to be what should happen; an OOM kill is done by the kernel. Swarm is meant to make sure that your services / tasks stay "up". If a task was killed by the kernel, Swarm will create a new task to replace it.
Anything is possible if you're system is OOM. The kernel will start killing off random processes. In this case it decided to not kill the alpine container.
There's not really anything that docker can do about that; by default, a negative
This sounds like a bug, however, strange things can happen if a machine completely runs out of memory <-- @mlaventure
Can you check if
Note that Swarm will stop monitoring a container if it's stopped/restarted out-of-bound (e.g. If I manually
In this case, it's possible that Swarm was temporarily not able to monitor the container (due to the OOM), or possibly another process (such as
The reason for not forcibly stopping the container, is that exited tasks are kept around, to allow you to inspect/analyze them (e.g. to find out why they exited), which may involve starting them.
In general it is recommended to;
Swarm tasks were never killed by the kernel. oom-killer only always killed the most offensive process: the container from image
All the commands I used were listed in the issue description - the service was created with
According to logs, oomkiller always killed the most offensive process - the memory hog.
Those are sound recommendations, but I feel that this kind of infrastructure should be robust even under the most hostile of circumstances. And no matter what you recommend, dedicating an entire VM (or 3 for a HA setup) for manager role will be a money issue for hobby use/small businesses, so in practice it is not always feasible to do what's recommended but instead what we can afford.
I didn't touch the Swarm-based containers manually. All I used was the commands I mentioned in the issue and I repeated these tests so many times by tearing down the VM and starting fresh again that I am confident that this is not a user error. These symptoms are really easy to replicate - I replicated by following this issue description 1:1.
I will check this and report back!
I suspect Swarm was temporarily not able to monitor the processes, resulting in falsely regarding the containers to be not running. Perhaps @aaronlehmann has more insight in this.
It took me about four tries to produce a situation where the same Swarm tasks now has two replicas (outcome d). Here's all that I did:
edit: logs of above in separate gist
Perhaps something to do with this?
It also does look like
Given that the message is
@DanielBodnar unfortunately nope, I didn't figure out a solution.
And I havent't tested this scenario with recent Docker versions, but sad to hear if this still happens because if we could be left with an incorrect amount of replicas running, that sounds like a serious issue that should be addressed.
As a "fix" for this I've been stressingly obsessed with the memory use of my containers to try to avoid OOM issue in production. It's a good practice to set max memory limits per container.