Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
GCI: Race condition when deleting docker0 #29756
During GCI bootup config, the docker0 bridge is deleted before kubelet starts which works fine when not using a NETWORK_PROVIDER.
referenced this issue
Jul 28, 2016
we don't need to restart docker with kubenet. The new containers are created on cbr0 by the network plugin, and if someone ssh's into the node and runs "docker run -it busybox /bin/sh" it still gets an ip from docker0. Now docker0 and cbr0 should not have overlapping cidrs, because docker0 will be created from the default 172.17.0.1, and cbr0 will get created from the podcidr range.
We should probably make sure the nodecidr doesn't overlap the 172 range or bad things can happen.
@maisem can you clarify the race? we delete docker0 on the master to avoid overlapping cidrs. There might actually be better ways to solve the race (like actually applying the master kubelet's --pod-cidr arg to cbr0 , and making sure it doesn't overlap with docker0), but i don't think we need to target everything for 1.3.4. The docker0 deletion shouldn't be an issue on nodes, in fact we shouldn't even need to delete docker0 because we start docker without --bridge option, so it's going to create docker0 anyway.
We need to restart docker at least once for it to pick up the new command line arguments.
When we start using kubenet the following happens.
There is a race between 2 and 3. If docker starts before the bridge is deleted with kubenet enabled it crashes because it is unable to use 172.17.0.1. Which causes the startup scripts to fail.
#29757 changes the order of the steps to
I hope that clarifies things.
Reading your previous comment, we don't need to delete docker0 with kubenet. In fact, with the "correct" ordering (2 before 3 in your list) doesn't docker just recreate docker0 for itself if we simply
There are a few situations to handle though:
The easiest option is to get rid of docker0 deletion hack by just changing the --pod-cidr given to the master on GKE. Is ther a reason it is what it is today? Then everywhere, we use kubenet, never delete docker0 and respect --pod-cidr if specified, otherwise use the podCIDR in the node object.