Skip to content
This repository has been archived by the owner on Dec 5, 2017. It is now read-only.

kube-proxy connection tracking adjustments are crashing smoke tests #724

Closed
3 tasks done
jdef opened this issue Jan 5, 2016 · 17 comments
Closed
3 tasks done

kube-proxy connection tracking adjustments are crashing smoke tests #724

jdef opened this issue Jan 5, 2016 · 17 comments

Comments

@jdef
Copy link

jdef commented Jan 5, 2016

task list:

current thinking is that it's likely some kind of permissions problem related to kube-proxy attempting to tweak networking sysctl's

@sttts
Copy link

sttts commented Jan 5, 2016

Is this change also on the release-1.1 branch?

@jdef
Copy link
Author

jdef commented Jan 5, 2016

i have not ported this to release-1.1 because the conntrack sysctl/sysfs-module behavior hasn't landed there (yet?)

@jdef
Copy link
Author

jdef commented Jan 5, 2016

using a modified jprobe originally obtained at http://www.friedhoff.org/posixfilecapsold.html I've observed that CAP_SYS_ADMIN is required to access the sysfs-modules filesystem, and that CAP_NET_ADMIN as well as CAP_SYS_ADMIN are required to access the sysctl file that's tweaked by kube-proxy. modified jprobe code is here:

int cr_capable (const struct cred *cred, struct user_namespace *ns, int cap, int audit)
{
        /*
         * extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
         *                        int cap, int audit);
         */
        printk(KERN_NOTICE "%s: asking for capability %d for %s\n",
                __FUNCTION__, cap, current->comm);
        jprobe_return();
        return 0;
}

@sttts
Copy link

sttts commented Jan 5, 2016

@jdef And our privileged dind containers don't have those capabilities?

@jdef
Copy link
Author

jdef commented Jan 5, 2016

@sttts that is the next thing to check :) the mesos-slave is using native mesos containerization to create the container that the executor runs in. so we have a combination effect here to consider:

  • the mesos slave appears to be running in a privileged docker container (via --privileged)
  • the mesos slave creates the kubelet-executor (non-Docker) container to host the minion process
  • the minion process launches the kube-proxy process

docker-container[privileged](slave::mesos-container[?caps?](minion::kube-proxy)

@sttts
Copy link

sttts commented Jan 5, 2016

Which means that maybe the slave steals the needed capabilities, right?

@jdef
Copy link
Author

jdef commented Jan 5, 2016

perhaps. it's also possible that the nf_conntrack module isn't loaded into the kernel in our CI environment

@sttts
Copy link

sttts commented Jan 5, 2016

We are root there. We could load it.

@jdef
Copy link
Author

jdef commented Jan 5, 2016

also, in contrib/mesos/ci/run-with-cluster.sh we're not running the test harness docker container with --privileged. this seems a likely culprit.

@sttts
Copy link

sttts commented Jan 5, 2016

The test container is the problem?

@sttts
Copy link

sttts commented Jan 5, 2016

Feel free to add --privileged there too. But it feels strange that the tests need those capabilities. /cc @karlkfi

@jdef
Copy link
Author

jdef commented Jan 5, 2016

it could be. i'd like to test 1 change at a time to see exactly where the problem is. so i'm going to create a testing branch where i roll back the conntrack=0 flag lines I added in the hotfix, then start making some changes to our test scripts to see what fixes the prob

@jdef
Copy link
Author

jdef commented Jan 5, 2016

data points:

  1. adding --privileged to run-with-cluster.sh DOES NOT allow smoke test to pass
  2. running lsmod in run-with-cluster.sh SHOWS nf_conntrack is LOADED

@jdef
Copy link
Author

jdef commented Jan 5, 2016

tweaking the /sys/module/nf_conntrack/parameters/hashsize value causes problems. the proposed PR (kubernetes/kubernetes#19303) simply defaults the conntrackMax value to 0, which avoids changing the hashsize. Arguably since mesos is a multi-tenant cluster we shouldn't be mucking with this by default anyway.

@jdef jdef added PTAL and removed WIP labels Jan 5, 2016
@s-urbaniak
Copy link

Nice catch regarding nf_conntrack hashsize! I am still wondering about the concrete root cause of the crashes. The upstream PR introduces a higher default hashsize value (256k vs. 64k) than the default. Was 256k still to small causing packets dropped by the kernel connection tracking causing the smoke test crash?

@jdef
Copy link
Author

jdef commented Jan 6, 2016

I suspect some strange interaction between our 3-level-nested container CI
environment and the attempts to tweak linux kernel mods at runtime. the
dockerized slave has it's own netns. skimming the kernel code, it looks
like the nf_conntrack module isn't very happy when setting hashsize on a
netns that's != init_net (the root-level, initial netns of the OS):

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/netfilter/nf_conntrack_core.c?id=refs/tags/v3.18.25#n1555

https://books.google.com/books?id=RpsQAwAAQBAJ&pg=PA423&lpg=PA423&dq=netns+init_net+namespace&source=bl&ots=rAuK1vwKVl&sig=5VCs683iZFEYq-S03ummEJEGF9s&hl=en&sa=X&ved=0ahUKEwi4443WxZXKAhUC8z4KHVwICkoQ6AEISjAH#v=onepage&q=netns%20init_net%20namespace&f=false

that said, i can't see the specific error code because our CI env loses the
executor and proxy logs (we should have a bug for that somewhere - it makes
debugging very hard).

On Wed, Jan 6, 2016 at 3:27 AM, Sergiusz Urbaniak notifications@github.com
wrote:

Nice catch regarding nf_conntrack hashsize! I am still wondering about the
concrete root cause of the crashes. The upstream PR introduces a higher
default hashsize value (256k vs. 64k) than the default. Was 256k still to
small causing packets dropped by the kernel connection tracking causing the
smoke test crash?


Reply to this email directly or view it on GitHub
#724 (comment)
.

@s-urbaniak
Copy link

@jdef thanks for the explanation and for the links!

@jdef jdef added LGTM and removed PTAL labels Jan 6, 2016
@jdef jdef closed this as completed Jan 7, 2016
@jdef jdef removed the LGTM label Jan 7, 2016
@jdef jdef added this to the v0.7.2 milestone Jan 10, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants