Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create pod latency increase #54651

Closed
dashpole opened this issue Oct 26, 2017 · 41 comments
Closed

Create pod latency increase #54651

dashpole opened this issue Oct 26, 2017 · 41 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@dashpole
Copy link
Contributor

dashpole commented Oct 26, 2017

The node e2e density tests have been failing consistently: https://k8s-testgrid.appspot.com/sig-node-kubelet#kubelet-serial-gce-e2e&include-filter-by-regex=create%20a%20batch%20of%20pods
The only PRs in the diff that touch the kubelet seem related to CNI: 88975e9...03cb11f, with the likely candidate being #51250
In the run prior to it failing, the latency ranged from 3s to 4s. In the first failing run, 50% percentile latency ranged from ~16s to ~19s, which is a major increase..

cc @kubernetes/sig-node-bugs @kubernetes/sig-network-bugs
cc @dixudx

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. labels Oct 26, 2017
@dixudx
Copy link
Member

dixudx commented Oct 27, 2017

Wow, this is a huge increase.
@squeed Any comments on the performances? I wonder whether this would have any impacts on kubelet and GC.
/cc @luxas

@squeed
Copy link
Contributor

squeed commented Oct 27, 2017

Interesting, taking a look.

@dixudx
Copy link
Member

dixudx commented Oct 27, 2017

@squeed Thanks for your help.

@squeed
Copy link
Contributor

squeed commented Oct 27, 2017

Ah, I think I know the issue. The new CNI plugins wait for DAD to finish before proceeding, which takes 3 seconds on Linux. Still need to do some testing (helpfully, my laptop drive died yesterday).

@squeed
Copy link
Contributor

squeed commented Oct 27, 2017

The address-settle adds about 1 second of delay. Is that likely the complete cause, or is it also something else? (I'm not familiar with this test)

@squeed
Copy link
Contributor

squeed commented Oct 27, 2017

Ah, network setup is also serialized behind a mutex; the increase from 20ms to 1020ms for network setup in the critical path is almost certainly the culprit.

The CNI plugins could avoid the DAD delay for now (since we're not doing IPv6), this will eventually become a problem when ipv6 is supported.

Since the CNI plugin execution itself doesn't need to be synchronized, I think I can move that outside the mutex.

@luxas
Copy link
Member

luxas commented Oct 27, 2017

This seems to be a v1.9-blocking issue that needs to be fixed/tracked, right?

@bowei
Copy link
Member

bowei commented Oct 30, 2017

@dnardo @freehan

@freehan
Copy link
Contributor

freehan commented Oct 30, 2017

Good catch... This is why we need to bump CNI version at the early stage of a release cycle.

@danehans
Copy link

@squeed sig-network is planning to add ipv6 alpha support for the 1.9 release.

k8s-github-robot pushed a commit that referenced this issue Nov 2, 2017
The CNI plugin can take up to 3 seconds to execute. CNI plugins can safely be
executed in parallel, so yield the lock to speed up pod creation.

Fixes: #54651
k8s-github-robot pushed a commit that referenced this issue Nov 2, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubenet: yield lock while executing CNI plugin.

The CNI plugin can take up to 3 seconds to execute. CNI plugins can safely be
executed in parallel, so yield the lock to speed up pod creation.

This caused problems with the pod latency tests - previously, CNI plugins executed
in under 20ms. Now they must wait for DAD to finish and addresses to leave
tentative state.

Fixes: #54651

**What this PR does / why we need it**:
After upgrading CNI plugins to v0.6 in #51250, the pod latency tests began failing. This is because the plugins, in order to support IPv6, need to wait for DAD to finish. Because this
delay is while the kubenet lock is held, it significantly slows down the pod creation rate.

**Special notes for your reviewer**:
The CNI plugins also do locking for their critical paths, so it is safe to run them concurrently.

**Release note**:
```release-note
NONE
```
@porridge
Copy link
Member

porridge commented Nov 6, 2017

/reopen

@k8s-ci-robot
Copy link
Contributor

@porridge: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@porridge
Copy link
Member

porridge commented Nov 6, 2017

/open

@porridge
Copy link
Member

porridge commented Nov 6, 2017

To be honest it does not seem like #54800 has fixed this. We are still seeing high pod startup in large ane medium scale tests: #55060 (comment)

@luxas
Copy link
Member

luxas commented Nov 6, 2017

Crazy idea: Can we do things conditionally for IPv4 and IPv6?
Does this only affect kubenet or any CNI plugin (like Weave, Flannel, etc)?

@porridge
Copy link
Member

porridge commented Nov 6, 2017 via email

@yujuhong
Copy link
Contributor

yujuhong commented Nov 6, 2017

We disable DAD when using kubenet. Since Kubenet has external guarantees that addresses won't collide, this is safe.

I thought kubenet would be deprecated sooner or later. If that's still true, does that mean we'd need yet another solution for a non-kubenet setup?

@porridge
Copy link
Member

porridge commented Nov 6, 2017

Another question from a layman: do we know why it's so slow on Linux? Is it talking to other nodes (something like gratitious ARP)? Does it have a chance to become faster in the future? Does the delay depend on the size of the network/number of nodes, etc?

@squeed
Copy link
Contributor

squeed commented Nov 6, 2017

One way to make it faster on Linux is to enable "optimistic DAD" (RFC4429), which allows usage of tentative addresses. This is still considered to be an "experimental" feature, and is not enabled in all kernels. IIRC, the latest Ubuntu still has it disabled. CoreOS CL, Debian do have it. This is a sysctl that would need to be set in the container's network namespace, since it's a namespaced sysctl (and new network namespaces take the default configuration, rather than inheriting from the host).

@squeed
Copy link
Contributor

squeed commented Nov 6, 2017

@yujuhong yeah, you're right. Hopefully by then optimistic_dad is enabled on all kernels...

@luxas
Copy link
Member

luxas commented Nov 6, 2017

I thought kubenet would be deprecated sooner or later.

it will

If that's still true

yes

does that mean we'd need yet another solution for a non-kubenet setup

probably, like @squeed said above.

pinging @kubernetes/kubernetes-release-managers already -- this MIGHT be one of the big things we need to notify people about if this doesn't get solved.

@shyamjvs shyamjvs added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 7, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Current

@dashpole

Issue Labels
  • sig/network sig/node: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@shyamjvs shyamjvs added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Nov 7, 2017
@shyamjvs
Copy link
Member

shyamjvs commented Nov 7, 2017

@spiffxp Could you approve this for the milestone?

This issue needs to be resolved urgently, as it's causing 5k-node performance job (which is release-blocking) to fail.

@squeed
Copy link
Contributor

squeed commented Nov 7, 2017

I'm working on disabling dad in kubenet (option 3).

Should be ready soon.

@squeed
Copy link
Contributor

squeed commented Nov 7, 2017

OK, PR filed.

@spiffxp
Copy link
Member

spiffxp commented Nov 7, 2017

/status approved-for-milestone
FYI @shyamjvs it looks like @luxas already added to this to the v1.9 milestone, and @dchen1107 already added the label manually.

k8s-github-robot pushed a commit that referenced this issue Nov 9, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubenet: disable DAD in the container.

Since kubenet externally guarantees that IP address will not conflict, we can short-circuit the kernel's normal wait. This lets us avoid the 1 second network wait.

**What this PR does / why we need it**:
Fixes the pod startup latency identified in #54651 and #55060

**Release note**:
```release-note
NONE
```
@shyamjvs
Copy link
Member

shyamjvs commented Nov 9, 2017

PR #55247 seems to have done the trick. The latency has decreased

from:

    {
      "data": {
        "Perc100": 4762.418815,
        "Perc50": 3308.810795,
        "Perc90": 3955.747742,
        "Perc99": 4756.553711
      },

to:

    {
      "data": {
        "Perc100": 3178.085005,
        "Perc50": 2186.109036,
        "Perc90": 2768.180504,
        "Perc99": 3126.759942
      },

Closing the issue. Thanks everyone for working on this!

@shyamjvs shyamjvs closed this as completed Nov 9, 2017
@luxas
Copy link
Member

luxas commented Nov 9, 2017

@shyamjvs Do we have any coverage what happens with these numbers if you're using a 3rd party CNI provider (Weave, Calico, Flannel, whatever)? If this ONLY fixes kubenet but throws a perf penalty to everyone non-GCE, I think we should reopen this issue.

@porridge
Copy link
Member

porridge commented Nov 9, 2017 via email

@squeed
Copy link
Contributor

squeed commented Nov 9, 2017

The delay was on the CNI executable side; so it depends on the plugin being used. That said, it would be good to have that codepath under test in general.

@shyamjvs
Copy link
Member

shyamjvs commented Nov 9, 2017

We're not scale-testing with other CNI providers. One, because it seems to be the default option being used at head. Two, because it's simply not possible to scale test different configurations exhaustively due to resource and time constraints of running large cluster tests. So we choose to validate against (mostly) the default configs.

That said, opening this issue to study those options sg. However, it would be hard to get this under sig-scale's purview atm.

@porridge
Copy link
Member

porridge commented Nov 9, 2017 via email

@luxas
Copy link
Member

luxas commented Nov 9, 2017

Ok, thanks.
cc @timothysc FYI

@yujuhong
Copy link
Contributor

I'd suggest opening a separate issue to track this, as it's unlikely to fit
in 1.9.

+1. Is this tracked anywhere?

@luxas
Copy link
Member

luxas commented Nov 13, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests