Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert CoreDNS to 1.3.1 in kube-up #78691

Merged
merged 1 commit into from Jun 10, 2019

Conversation

@rajansandeep
Copy link
Member

commented Jun 4, 2019

What type of PR is this?

/kind cleanup

What this PR does / why we need it:
Reverts CoreDNS to 1.3.1 for k8s 1.15
Reverts changes from #78030 and #78305

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Revert the CoreDNS version to 1.3.1
@rajansandeep

This comment has been minimized.

Copy link
Member Author

commented Jun 4, 2019

@rajansandeep

This comment has been minimized.

Copy link
Member Author

commented Jun 4, 2019

/priority critical-urgent

@mborsz

This comment has been minimized.

Copy link
Member

commented Jun 4, 2019

/hold

Hi,
Could you say why are you reverting this?
The upgrade to 1.5.0 is critical for keeping k8s 1.15 scalable. With 1.3.1 coredns is constantly OOMing in 5000-node scale after the cos upgrade to cos-beta-73-11647-64-0.

See #78562 and #76579 (comment) for context.

/cc @wojtek-t @mm4tt @oxddr

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

@mborsz: GitHub didn't allow me to request PR reviews from the following users: oxddr.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/hold

Hi,
Could you say why are you reverting this?
The upgrade to 1.5.0 is critical for keeping k8s 1.15 scalable. With 1.3.1 coredns is constantly OOMing in 5000-node scale after the cos upgrade to cos-beta-73-11647-64-0.

See #78562 and #76579 (comment) for context.

/cc @wojtek-t @mm4tt @oxddr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrisohaver

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

@mborsz,

CoreDNS 1.5.0 is being reverted due to configuration migration troubles. A non-trivial migration of the CoreDNS configuration is required between 1.3.1 and 1.5.0. This migration step was not approved in kubeadm in 1.15 in time, so it is being reverted. IMO, we should be testing the version of CoreDNS that will be used by most 1.15 deployments, which I think will be 1.3.1, since kubeadm will be installing that way. Not everyone uses kubeadm I realize, but it is I think it is widely used as a reference. Even if we upgrade to CoreDNS in the test framework to make tests pass, I think most people in the field will be using k8s 1.15 + CoreDNS 1.3.1, which potentially has this scaling issue.

Is the COS upgrade to the beta version critical? Can that be reverted? Or would that cause more trouble (e.g. is it fixing other more important problems)?

If we leave the kube-up tests at CoreDNS 1.5.0, then I think CoreDNS 1.5.0 would be the "recommended" version of CoreDNS to use with K8s 1.15 (since that is what was tested). Perhaps thats the easiest route. Kubeadm would then need to release note that it doesn't upgrade to the version of CoreDNS recommended for 1.15 due to manual configuration migrations required (although it wont install a fresh build with 1.5.0 either).

@mborsz

This comment has been minimized.

Copy link
Member

commented Jun 4, 2019

/cc @yguo0905 @yujuhong

For question whether COS upgrade is critical.

@neolit123

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

@soggiest

@kubernetes/sig-cluster-lifecycle Can someone provide some thoughts around this issue?

@chrisohaver just did. it's really up to the kube-up maintainers to make a decision.
my vote would be to apply this current PR.

@soggiest

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

@spiffxp can you take a look at this and approve it if it's valid?

@spiffxp

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

Thanks @chrisohaver for the summary. When in doubt, I'm going to err on the side of user upgrade experience and suggest that we revert. Going to read a bit more for context before making a call.

That said, in 1.11 kubeadm switched to coredns, while in 1.12 kube-up still used kube-dns due to scalability concerns, finally switching in 1.13... so we have prior art for two different ways of standing up clusters using two different DNS implementations.

@spiffxp

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

My opinion:

  • code thaw is scheduled for Tue June 11th
  • cos-beta-73-11647-64-0 was introduced early March (#75149), we'd be in unknown territory if we rollback to cos-stable-65-10323-64-0 this late
  • we currently have an unacceptable upgrade path with coredns 1.5.0, and not enough time to reasonably develop an acceptable upgrade path, and so should revert to coredns 1.3.1, because user upgrade experience comes first
  • this will result in coredns using more memory for reasons we don't yet know (#78562)
  • this will cause the our 5000 node performance tests that use kube-up.sh to fail if they use the dns addon unmodified (#78562)
  • in theory sig-scalability is supposed to report on the exact configuration they use to validate cluster performance (https://github.com/kubernetes/community/blob/master/sig-scalability/processes/scalability-validation.md#scalability-validation-of-the-release) though they have not done so since 1.9 (kubernetes/sig-release#543)
  • I think a good potential compromise would be to have scalability generate that report for 1.15, and include a non-default configuration of using coredns 1.5.0
  • It's unclear to me whether kube-up.sh supports such a custom configuration, but I will start looking there (it supports using a custom COS image, but I don't think that would help us out here for the reason mentioned up top)
@jdumars

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

@spiffxp this seems like a very reasonable path forward. I also agree that upgrade UX is the most high-impact aspect in play here.

@chrisohaver

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Sounds good to me. Basically, rollback to CoreDNS 1.3.1, release k8s 1.15 with the failing 5k node test, and issue some kind of scaling exception notice advising use of coredns 1.5.0 for large clusters in the ballpark of 5K nodes.

Thanks @spiffxp!

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

IIUC from above discussion, it seems like coredns 1.5.0 is having some correctness bugs but scalability improvements. We seem to have been using 1.3.1 even until 3 weeks ago (when it was bumped to 1.5.0 in #78030) and scalability tests were passing for last release where we were still using the old one (though IIRC we manually had to bump the resource requests). So if we're not at least regressing from last release wrt core-dns scalability, staying at old version to avoid correctness bugs seems better IMHO. I might be missing sth here.

@wojtek-t @mborsz - fyi

@jdumars

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

Given release timing, is there a deadline for this decision?

@mborsz

This comment has been minimized.

Copy link
Member

commented Jun 8, 2019

IIUC from above discussion, it seems like coredns 1.5.0 is having some correctness bugs but scalability improvements. We seem to have been using 1.3.1 even until 3 weeks ago (when it was bumped to 1.5.0 in #78030) and scalability tests were passing for last release where we were still using the old one (though IIRC we manually had to bump the resource requests). So if we're not at least regressing from last release wrt core-dns scalability, staying at old version to avoid correctness bugs seems better IMHO. I might be missing sth here.

Wojciech Tyczynski Maciej Borsz - fyi

It's not really a full image.

http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadRequestCountByClient&client=coredns%2Fv0.0.0%20(linux%2Famd64)%20kubernetes&code=200&component=apiserver&group=&resource=endpoints&scope=cluster&subresource=&verb=LIST&version=v1

shows number of LIST endpoints issued in load test which represents mostly number of coredns restarts.

There are 3 notable builds:

  • 330 -- this is when cos version was bumped. The number of restart started being really high (300-1000 restarts). Debugging few runs after that has shown that it was OOMing mostly
  • 1115162902861451269 -- the number of restarts has significantly decreased (to 5-20). I bet we already debugged this, but can't remember now what was the reason for decrease.
  • 1120236390706057221 -- the number of restarts goes to zero -- this is because of #76819 got merged which basically removes the only dns client in scalability tests.

So the fact that in the test was passing before upgrading coredns to 1.5.0 was mostly because we removed the last dns client talking to it.

We see that in similar conditions (328 vs 330 builds) coredns 1.3.1 with newer cos is using more memory than before cos upgrade.

@mborsz

This comment has been minimized.

Copy link
Member

commented Jun 8, 2019

I agree with @spiffxp suggestion.

We can also suggest increasing memory limit for coredns in big clusters (assuming the increase isn't that big). This seems to be more safe option than asking for coredns upgrade.

Just two concerns:

  • We should make sure that it's possible to use custom coredns version if we are going to suggest doing so. I'm not aware of any knob that can be used in kube-up. Can we add something like that in 1.15?
  • What we are doing here is moving the upgrade problem from all clusters to only big clusters. Given the current timeline this seems to be reasonable, but we still need to solve this issue somehow.
@mborsz

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

#78851 adds a way to customize coredns version in kube-up scripts. Can we get merged in 1.15?

@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Hi all, so what the plan?
will #78851 be merged/approved after this one or instead of this one?

@mborsz

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

I suggest merging this one and then #78851.

@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Awesome, thank you @mborsz !
Could we get an approval on this ?

@spiffxp

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

/retest
/approve
/lgtm

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rajansandeep, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jdumars

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

/lgtm
does someone want to remove the hold? Or, I can.

@dims

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit 32ec6c2 into kubernetes:master Jun 10, 2019

23 checks passed

cla/linuxfoundation rajansandeep authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@mboersma mboersma referenced this pull request Jun 12, 2019

Merged

feat: add support for Kubernetes 1.15.0-rc.1 #1469

2 of 5 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.