Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker 1.9.1: No available IPv4 addresses on this network's address pools: bridge #21523

Closed
yujuhong opened this issue Feb 19, 2016 · 59 comments
Closed
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@yujuhong
Copy link
Contributor

I ran into this issue twice today by running pod creation/deletion tests (100 pods per node).

After reading @aboch's comments in the issues (e.g., moby/moby#18535 (comment)), all three issues below have the same root cause.

After screening the fixes in libnetwork during that time, it seems like moby/libnetwork#771 might be the fix (though I could be totally wrong given my limited knowledge on networking).

Since we may go with docker 1.9.1 for v1.2, any thoughts on how to proceed with this problem?

(BTW, I couldn't find any existing issue on this. Feel free to close this one if there is already one)

/cc @dchen1107

@yujuhong yujuhong added sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster labels Feb 19, 2016
@bprashanth
Copy link
Contributor

I think the meta problem is docker made bridge networks persistent across restarts in 1.9 and fixed several bugs between then and 1.10. If we just want the old behavior, we might be able to get away with nuking local-kv.db on each restart. Some questions:

  • What does docker network inspect bridge show, 255 ips actually used?
  • How often does one hit it with 30 pods per node?
  • How hard do we need to hammer things to fix it? (gc all containers, restart docker, delete /var/lib/docker/network).
  • Any chance we can pull one of the 10.rc s that don't have content-addressable storage?

There are some intense solutions like writing our own IPAM plugin or cloning libnetwork IPAM from 1.10 as a "plugin" (assuming it has the fix), but it's too high risk for 1.2.0 IMO without pushing the release date.

@kubernetes/goog-cluster

@bprashanth bprashanth added this to the v1.2-candidate milestone Feb 19, 2016
@thockin
Copy link
Member

thockin commented Feb 19, 2016

I'm all for nuking the DB - there's no real value to us in having
consistent IPs across docker restarts, certainly not if it is as buggy as
it sounds.

Is it fixed in 1.10?

On Thu, Feb 18, 2016 at 7:23 PM, Prashanth B notifications@github.com
wrote:

I think the meta problem is docker made bridge networks persistent across
restarts in 1.9 and fixed several bugs between then and 1.10. If we just
want the old behavior, we might be able to get away with just nuking
local-kv.db on each restart. Some questions:

  • What does docker network inspect bridge show, 255 ips actually used?
  • How often does one hit it with 30 pods per node?
  • How hard do we need to hammer things to fix it? (gc all containers,
    restart docker, delete /var/lib/docker/network).
  • Any chance we can pull one of the 10.rc s that don't have
    content-addressable storage?

There are some intense solutions like writing our own IPAM plugin or
cloning libnetwork IPAM from 1.10 as a "plugin" (assuming it has the fix),
but it's too high risk IMO without pushing the release.

@kubernetes/goog-cluster
https://github.com/orgs/kubernetes/teams/goog-cluster


Reply to this email directly or view it on GitHub
#21523 (comment)
.

@bprashanth
Copy link
Contributor

Is it fixed in 1.10?

According to bug reports most persisting-networks-cross-restart problems are fixed in 1.10, but 1.10 also has content-addressable storage. It sounds like quite a few fixes made it into "1.9.2", which was later just converted into a 1.10-rc.

I'm all for nuking the DB - there's no real value to us in having
consistent IPs across docker restarts, certainly not if it is as buggy as
it sounds.

I believe the only real reason they persist the bridge network cross restart is so cross host networks work consistently. This is a feature we don't care about.

IMO nuking a file across O(100) nodes is too hard to do manually, but we can consider just documenting this if it's rare enough with a lower number of pods per node.

@yujuhong
Copy link
Contributor Author

What does docker network inspect bridge show, 255 ips actually used?

No, only ~92 ips were used around that time.

How often does one hit it with 30 pods per node?

I have no idea how often this hits with 30 pods per node.

How hard do we need to hammer things to fix it? (gc all containers, restart docker, delete /var/lib/docker/network).

I did all three: GC'd all containers, deleted /var/lib/docker/network, and restarted docker. It didn't work.
Someone said nuking everything in /var/lib/docker worked: moby/moby#18527 (comment)

According to moby/moby#18535, two containers can have the same IP addresses. I didn't check if it happened in my cluster, but it'd have been harder to notice.

@bprashanth
Copy link
Contributor

We need to figure out a reliable repro, with whatever max-pods we're going to ship with, to understand the impact. @kubernetes/goog-node what's the chance of just going back to 1.8?

@a-robinson
Copy link
Contributor

@zmerlynn
Copy link
Member

What are the chances of validating 1.10? :)

@onorua
Copy link
Contributor

onorua commented Feb 20, 2016

1.8 has the same issue, we use 1.8.3 and got nasty problems like this. You can reproduce it easily with the pod with "wrong" liveness probe. It will be rescheduled between nodes until IP pool got used completely. So it actually doesn't matter how many pods do you have, but rather how often do you restart docker daemon or node completely.
Docker asked to upgrade to 1.9 in moby/moby#14788, which apparently not fixed. Any way you don't want to go libnetwork way - why do you need broken ip allocator in this case?

@bprashanth
Copy link
Contributor

@onorua 1.8 had its problems (eg: #19477) but those are just bugs that we need to live with, most of them have workarounds, and only occur with high number of pods per node or some obscure case like HostPort. 1.9 made the leap to using libnetwork with persistent storage for bridge networks and that brough a host of other networking issues.

Whether we use libnetwork or not in the long run, we need to ensure that this release (which will use the default docker bridge plugin) is stable. Can you please file a different bug with your exact repro so we can consider the impact?

@bprashanth
Copy link
Contributor

What are the chances of validating 1.10? :)

Isn't 1.10 going to be a major leap for hosted offerings like GKE because of the overhead involved in migrating to content-addressable storage?

@thockin
Copy link
Member

thockin commented Feb 21, 2016

So to catch up a bit: the bug exists in d 1.9, probably not in d 1.10, but
d 1.10 is probably too late to validate for k 1.2. The suggested
workarounds don't actually work (can we re-confirm that?). It's possible
for pods to get the same IP.

We can and should add logic to Kubelet to detect multiple pods with the
same IP, even if it is incomplete (wrt non-kube containers).

The rest is pretty scary, if I understand correctly. Someone tell me I am
misunderstanding?

On Sat, Feb 20, 2016 at 3:01 PM, Prashanth B notifications@github.com
wrote:

What are the chances of validating 1.10? :)

Isn't 1.10 going to be a major leap for hosted offerings like GKE because
of the overhead involved in migrating to content-addressable storage?


Reply to this email directly or view it on GitHub
#21523 (comment)
.

@bprashanth
Copy link
Contributor

The rest is pretty scary, if I understand correctly. Someone tell me I am
misunderstanding?

We haven't invested the effort in getting a reliable repro. From what I can tell 1.9 is just bad for networking. @dchen1107 mentioned that we might still go back to 1.8.

We can and should add logic to Kubelet to detect multiple pods with the
same IP, even if it is incomplete (wrt non-kube containers).

Sounds practical. On detecting an IP conflict we should probably delete all participant pods, or sort by creation timestamp and leave the first. Not sure I'd want to put this in each podworker, @kubernetes/goog-node will have better suggestions.

If we're going to code a workaround, we can also detect when docker run starts returning 500s in a goroutine:

while true:
    if docker run != 0 
         if ++count > threshold {
           node.unschedulable = true
         }
    else 
        docker kill
        count = 0

@thockin thockin added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 22, 2016
@thockin thockin modified the milestones: v1.2, v1.2-candidate Feb 22, 2016
@dchen1107
Copy link
Member

I could reproduce the issue only during docker reboot test so far. Adding a little bit logging, one can see the docker daemon received SIGTERM and start to persistent current state, and refuse to start other containers by throwing such error message. Meanwhile, kubelet could pick up such error during next syncLoop, which is misleading here:

Mon Feb 22 18:35:56 UTC 2016===========================start shutdown 
time="2016-02-22T18:36:01.360863832Z" level=warning msg="Your kernel does not support CPU cfs period. Period discarded." 
time="2016-02-22T18:36:01.360915500Z" level=warning msg="Your kernel does not support CPU cfs quota. Quota discarded." 
time="2016-02-22T18:36:01.574372128Z" level=warning msg="Your kernel does not support CPU cfs period. Period discarded." 
time="2016-02-22T18:36:01.574404895Z" level=warning msg="Your kernel does not support CPU cfs quota. Quota discarded." 
time="2016-02-22T18:36:07.085757941Z" level=error msg="Handler for POST /containers/create returned error: sql: database is closed" 
time="2016-02-22T18:36:07.085813206Z" level=error msg="HTTP Error" err="sql: database is closed" statusCode=500 
time="2016-02-22T18:36:07.132089270Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/6f7b897221406a714859f7e68376095952756c04c358e363cc00d5ee510f4f3e/shm: no such file or directory\nfailed to umount /var/lib/docker/containers/6f7b897221406a714859f7e68376095952756c04c358e363cc00d5ee510f4f3e/mqueue: no such file or directory" 
time="2016-02-22T18:36:07.148072161Z" level=error msg="Handler for POST /containers/6f7b897221406a714859f7e68376095952756c04c358e363cc00d5ee510f4f3e/start returned error: Cannot start container 6f7b897221406a714859f7e68376095952756c04c358e363cc00d5ee510f4f3e: no available IPv4 addresses on this network's address pools: bridge (3d1b6ce1539ccaf0b6359dba070bd8ba91695f33a100805fda4e2934df239f13)" 
time="2016-02-22T18:36:07.148122284Z" level=error msg="HTTP Error" err="Cannot start container 6f7b897221406a714859f7e68376095952756c04c358e363cc00d5ee510f4f3e: no available IPv4 addresses on this network's address pools: bridge (3d1b6ce1539ccaf0b6359dba070bd8ba91695f33a100805fda4e2934df239f13)" statusCode=500 
time="2016-02-22T18:36:08.087058337Z" level=error msg="Handler for POST /containers/create returned error: sql: database is closed" 
time="2016-02-22T18:36:08.087097389Z" level=error msg="HTTP Error" err="sql: database is closed" statusCode=500 
time="2016-02-22T18:36:08.112261109Z" level=error msg="Error during graph storage driver.Cleanup(): device or resource busy" 
Mon Feb 22 18:36:16 UTC 2016 =====================start startup 
time="2016-02-22T18:36:17.191851817Z" level=warning msg="Your kernel does not support swap memory limit." 
time="2016-02-22T18:36:17.191956195Z" level=warning msg="Your kernel does not support kernel memory limit." 
time="2016-02-22T18:36:17.192046995Z" level=warning msg="Your kernel does not support cgroup cfs period" 
time="2016-02-22T18:36:17.192067907Z" level=warning msg="Your kernel does not support cgroup cfs quotas" 
time="2016-02-22T18:36:18.218491756Z" level=warning msg="Your kernel does not support CPU cfs period. Period discarded." 
time="2016-02-22T18:36:18.218535622Z" level=warning msg="Your kernel does not support CPU cfs quota. Quota discarded." 
time="2016-02-22T18:36:18.548414290Z" level=warning msg="Your kernel does not support CPU cfs period. Period discarded." 
time="2016-02-22T18:36:18.548480597Z" level=warning msg="Your kernel does not support CPU cfs quota. Quota discarded." 
time="2016-02-22T18:36:18.786482291Z" level=warning msg="Your kernel does not support CPU cfs period. Period discarded." 
time="2016-02-22T18:36:18.786548580Z" level=warning msg="Your kernel does not support CPU cfs quota. Quota discarded." 
time="2016-02-22T18:36:19.025413435Z" level=warning msg="Your kernel does not support CPU cfs period. Period discarded." 
time="2016-02-22T18:36:19.025475206Z" level=warning msg="Your kernel does not support CPU cfs quota. Quota discarded." 
time="2016-02-22T18:36:19.881451761Z" level=warning msg="Your kernel does not support CPU cfs period. Period discarded." 

Second experimental I had is after docker back to normal stage, check if docker daemon will cap the number of containers due to mistaken No available IPv4 addresses issue:

  1. before experiemental, on the same node, there are 17 kube-docker containers running.
# docker ps | wc -l
17
  1. I quickly verify docker daemon by starting more than 100 containers manually
# docker ps | wc -l
160
# docker ps -a | wc -l
195

But I didn't see we never run into such issue when docker is not at restart stage, but it is rare. I didn't see such issue reported through e2e tests at all.

@dchen1107
Copy link
Member

Sent too quickly...

Since it is rare, and it only failed when a docker creation state, not affect all running containers, I inclined to not revert back docker 1.9.1 release. Instead, we should document this as known issue.
The workaround I have is removing docker network checkpoint file.

EDIT: s/1.9.1/1.8.3

@bprashanth
Copy link
Contributor

Overall I agree with 1.9+workaround+documentation. I'm just trying to figure out the workarounds.

I didn't see such issue reported through e2e tests at all.

I'm not sure how conclusive this is since we don't pay much attentiont to soak and regular e2es are running < 30 pods per node.

The workaround I have is removing docker network checkpoint file.

That is also bound to help with: #20916 (comment), but I think ordering will be imporant:

  • stop docker
  • docker starts persisting state
  • docker finished persisting state
  • we delete checkpoint
  • start docker

How do we detect step 2/3?

Second experimental I had is after docker back to normal stage

In the initial report, docker never came back to a normal stage. In fact it didn't even return after removing the checkpoint file. So do we need to totally avoid any docker request while it's persisting state, or risk that it gets wedged? If so, how do we transmit this to the kubelet?

@bprashanth
Copy link
Contributor

How do we detect step 2/3?

I guess just stop, poll till docker pid is gone, rm -rf /var/lib/network/files/*, start; should work.

@vishh
Copy link
Contributor

vishh commented Feb 22, 2016

As mentioned a couple of time earlier in this issue, can we instead switch to v1.8.3 and instead focus our energy on validating docker v1.10 and fixing possible issues with that release?

@dchen1107
Copy link
Member

I guess just stop, poll till docker pid is gone, rm -rf /var/lib/network/files/*, start; should work.

The users could have configured their docker network checkpoint directory. Are we going to detect that to complicate our simple babysitter script? And why if it is rare issue?

@yujuhong
Copy link
Contributor Author

As mentioned a couple of time earlier in this issue, can we instead switch to v1.8.3 and instead focus our energy on validating docker v1.10 and fixing possible issues with that release?

+1 docker itself gave up on v1. and just moved on with v1.10. I think we've spent too much time investigating and working around the known issues of docker v1.9.

@bprashanth
Copy link
Contributor

The users could have configured their docker network checkpoint directory. Are we going to detect that to complicate our simple babysitter script? And why if it is rare issue?

I'm not comfortable shipping 1.9 without a workaround if there is potential that the node is hosed (i.e docker is wedged and no amout of restarting brings it back). Running into this on O(10) nodes in a 100-1000 node cluster will be a bad experience.

@yujuhong
Copy link
Contributor Author

I could reproduce the issue only during docker reboot test so far.

In my case, this is how I reproduced it:

  1. build fresh test cluster
  2. start 100 pause pods, wait until the pods are running and delete them.
  3. repeat (2) if everything works fine

The test usually fails the second time if it's going to fail. Note that I didn't explicitly kill docker, and supvervisord showed no sign of docker being killed unexpectedly.
However, I think @dchen1107 may be right about the docker restart being the cause. The thing is that kubelet restarts docker to configure cbr0 so it's possible that the network db has already in a bad state during that time.

@dchen1107
Copy link
Member

I am totally open to any opinion on this, but @vishh @bprashanth and @yujuhong how much confidence you guys have with docker 1.10 release here?

We validated docker 1.9.X release since Oct, 2015 (#16110). We performed functional tests, integration tests and performance validation tests.

Our entire jenkins project moves to docker 1.9.1 test more than 2 weeks for continuous and soaking tests. We used to find tons of issues caused by upgrade through salt and restart docker initially, especially corrupted docker network / storage checkpoint files (#20995). After we baked docker binary into our image, we didn't run into that problem.

On another side, we have docker issues with all previous release. For example, with kubelet 1.1 release, we have several documented docker issues against docker 1.8.3:

@gmarek
Copy link
Contributor

gmarek commented Mar 1, 2016

I run into it on one of 3-cluster Nodes after running MaxPods test a couple of times (which seems consistent with repro scenario that @yujuhong suggested). I nuked the cluster, but I have all the logs if someone is interested.

 2m 2m  1   {kubelet e2e-test-gmarek-minion-7ss7}       FailedSync  Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container d6333a08b316350dae00931729e33e59498915c751a5673a9588c08204b4a863: no available IPv4 addresses on this network's address pools: bridge (a2cece0bb5703fe5c62b73b8cc59c4d58a8e0560926916a54a6648ed71bdad35)\n"

@gmarek
Copy link
Contributor

gmarek commented Mar 1, 2016

What's worst in this, is that Kubelet keeps claiming that it's healthy when this happens, so more pods are scheduled on it.

@thockin
Copy link
Member

thockin commented Mar 2, 2016

Dawn or Yu-Ju - anything I can do to help? This is tough one...

@dchen1107
Copy link
Member

@gmarek Can you send the logs on my way? Thanks!

@bprashanth
Copy link
Contributor

We merged #22293 and expect this to be v. rare. If that's not the case (because honestly we haven't 100% understood the core problem and haven't invested time in a reliable repro) we need to discuss.

@dchen1107
Copy link
Member

#21703 and #22293 are merged to remove potential corrupted docker checkpoint files on every docker daemon restart. I can reproduce the issue very easily at node bootup time without those two prs.

But like I mentioned above, we might still run into this issue with churn of massive docker container creation and deletion. It should be very rare, and when we run into that issue:

  • All pre-existing containers are running happily without any issue
  • When docker run into this issue, the number of max IPs on a node is capped at a fix number N. Any running container requesting a IP allocation (PodInfra container in our case) exit, a new container with IP allocation request will succeed.
  • When docker run into this issue, the error message of "no available IPv4 addresses on this network's address pools" is properly propagated to the client by kubelet. It is not a hidden issue.
  • To workaround the issue, one can shutdown the docker daemon and remove the corrupted checkpoint file. But this means all running containers on a node will be killed.

In this case, I am going to close this issue and document it as known docker issue for 1.2 release.

@dchen1107
Copy link
Member

cc/ @thockin @bgrant0607 @bprashanth

@thockin
Copy link
Member

thockin commented Mar 3, 2016

I want to see a doc that covers Docker versions and known issues, so that
WHEN this blows up, we can point people to it. Sadly, we're stuck with a
string of Docker releases, none of which are "good". It sounds like 1.9 is
the "least bad".

On Wed, Mar 2, 2016 at 3:52 PM, Dawn Chen notifications@github.com wrote:

#21703 #21703 and #22293
#22293 are merged to
remove potential corrupted docker checkpoint files on every docker daemon
restart. I can reproduce the issue very easily at node bootup time without
those two prs.

But like I mentioned above, we might still run into this issue with churn
of massive docker container creation and deletion. It should be very rare,
and when we run into that issue:

  • All pre-existing containers are running happily without any issue
  • When docker run into this issue, the number of max IPs on a node is
    capped at a fix number N. Any running container requesting a IP allocation
    (PodInfra container in our case) exit, a new container with IP allocation
    request will succeed.
  • When docker run into this issue, the error message of "no available
    IPv4 addresses on this network's address pools" is properly propagated to
    the client by kubelet. It is not a hidden issue.
  • To workaround the issue, one can shutdown the docker daemon and
    remove the corrupted checkpoint file. But this means all running containers
    on a node will be killed.

In this case, I am going to close this issue and document it as known
docker issue for 1.2 release.


Reply to this email directly or view it on GitHub
#21523 (comment)
.

@mml
Copy link
Contributor

mml commented Mar 3, 2016

I want to see a doc that covers Docker versions and known issues, so that
WHEN this blows up, we can point people to it.

+1 @thockin

@bgrant0607
Copy link
Member

ref #21000

@gmarek
Copy link
Contributor

gmarek commented Mar 30, 2016

This causes failures in SchedulerPredicates MaxPods test (#19681) - all recent failures are caused by 'No available IPv4 addresses on this network's address pools: bridge' error.

@gmarek
Copy link
Contributor

gmarek commented Apr 1, 2016

It's causing test flakyness, so adding flake label.

@bprashanth
Copy link
Contributor

@gmarek can you post logs? I looked through #19681 (comment) but nothing jumped out. The ipv4 error can happen on startup, and might show up in events, but it shouldn't persist. It shouldn't cause the node to flip to not ready, which is the failure mode I'd expect to cause the scheduler to fail to find a fit for 110 x 3 pods.

@gmarek
Copy link
Contributor

gmarek commented Apr 1, 2016

It's e.g. in http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-serial/883/consoleText

Mar 29 14:40:06.387: INFO: At 2016-03-29 14:30:35 -0700 PDT - event for maxp-224: {kubelet gke-jenkins-e2e-3f0cce14-node-vjhz} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container 7e4f8063bfb18520933868875fe510701457ae3860c0e42fc9d832cc7c6e7008: no available IPv4 addresses on this network's address pools: bridge (3841504ca2d560fe54163ebd4832ad2a00701b217c6264575a291989571010e6)\n"

@yujuhong
Copy link
Contributor Author

yujuhong commented Apr 1, 2016

@gmarek can you post logs? I looked through #19681 (comment) but nothing jumped out. The ipv4 error can happen on startup, and might show up in events, but it shouldn't persist. It shouldn't cause the node to flip to not ready, which is the failure mode I'd expect to cause the scheduler to fail to find a fit for 110 x 3 pods.

ipv4 error will persist once it happens from the past experiences. The node will stay ready but none of the pods can become running beyond that point.

@bprashanth
Copy link
Contributor

I think Dawn concluded that it was a startup time thing in > 90% of the cases, and we can solve this by nuking the checkpoint file when we restart docker for cbr0 in:

#21703 and #22293 are merged to remove potential corrupted docker checkpoint files on every docker daemon restart. I can reproduce the issue very easily at node bootup time without those two prs.

If it's not a startup time thing I think we should be smarter about our docker health check and continue to restart+nuke checkpoint when we detect it, till we roll docker version forward.

@yujuhong
Copy link
Contributor Author

yujuhong commented Apr 1, 2016

Yes, that was Dawn's conclusion. It could be that the docker checker script wasn't doing its job correctly, or that this is actually a startup problem.

If it's not a startup time thing I think we should be smarter about our docker health check and continue to restart+nuke checkpoint when we detect it, till we roll docker version forward.

We should improve or docker health check in general. On the other hand, I think upgrading the docker version may even come before that.

@dchen1107
Copy link
Member

I am closing this issue for now. We are going to switch to docker 1.10, which should contain fix for this bug claimed by docker. Also the cluster team is working on to switch to cni solution for ip allocation, so that soon we will remove the dependency here. cc/ @freehan

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Jan 27, 2019
UPSTREAM: <carry>: oc: allow easy binding to SCC via RBAC

Origin-commit: 8df740ef02f535a1e61ea67d71089a269dc6a36c
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Jan 28, 2019
UPSTREAM: <carry>: oc: allow easy binding to SCC via RBAC

Origin-commit: 8df740ef02f535a1e61ea67d71089a269dc6a36c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests