New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback etcd server version to 3.1.11 due to #60589 #60891

Merged
merged 2 commits into from Mar 8, 2018

Conversation

@shyamjvs
Member

shyamjvs commented Mar 7, 2018

Ref #60589 (comment)

The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump.
So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :)

/cc @wojtek-t @jpbetz

Downgrade default etcd server version to 3.1.11 due to #60589

(I'm not sure if we should instead remove release-notes of the original PRs)

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Mar 7, 2018

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Mar 7, 2018

I agree we should revert to etcd server 3.1 given that we don't understand why the performance regression occurred.
/lgtm

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Mar 7, 2018

@jdumars - I'm approving this PR for milestone to fix significant performance regression.
I hope you will be fine with that.

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Mar 7, 2018

@timothysc - why "do-not-merge" ?

@timothysc

This comment has been minimized.

Member

timothysc commented Mar 7, 2018

@wojtek-t @jpbetz - The failure conditions that are fixed in 3.2 are of far greater importance imo.

@timothysc

This comment has been minimized.

Member

timothysc commented Mar 7, 2018

There are a series of fixes in 3.2 that fix catastrophic failure conditions that we can't negate imo.

/cc @hongchaodeng @xiang90

@k8s-ci-robot k8s-ci-robot requested review from hongchaodeng and xiang90 Mar 7, 2018

@timothysc

This comment has been minimized.

Member

timothysc commented Mar 7, 2018

/cc @kubernetes/sig-cluster-lifecycle-bugs - PSA that this effects scale deployments but at the cost of other known fixes.

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 7, 2018

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@jpbetz @shyamjvs

Pull Request Labels
  • sig/api-machinery sig/cluster-lifecycle sig/scalability: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@@ -22,14 +22,14 @@
"command": [
"/bin/sh",
"-c",
"if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://127.0.0.1:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1"
"if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://{{ hostname }}:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1"

This comment has been minimized.

@wojtek-t

wojtek-t Mar 8, 2018

Member

Let's not revert this line - this was net improvement and we will need it again in the future.

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

we will need it again in the future

To confirm - do we know if this listen-client-url change works with 3.1.11 currently?

This comment has been minimized.

@wojtek-t

wojtek-t Mar 8, 2018

Member

Note that people are generally not doing etcd upgrades together with version upgrades. So because 1.9 manifest doesn't work with 3.2.16, the order has to be: "upgrade to 1.10 and only the upgrade etcd". So it has to work.
[And yes - it works]

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

That's interesting to know, thanks. Fixed it.

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 8, 2018

The gce-large-performance pre-submit failed even before build. Seems like docker daemon unavailable:

I0308 11:44:33.287] make: Entering directory '/go/src/k8s.io/kubernetes'
I0308 11:44:33.288] +++ [0308 11:44:33] Verifying Prerequisites....
W0308 11:44:34.451] Can't connect to 'docker' daemon.  please fix and retry.
W0308 11:44:34.451] 
W0308 11:44:34.451] Possible causes:
W0308 11:44:34.452]   - Docker Daemon not started
W0308 11:44:34.452]     - Linux: confirm via your init system
W0308 11:44:34.452]     - macOS w/ docker-machine: run `docker-machine ls` and `docker-machine start <name>`
W0308 11:44:34.452]     - macOS w/ Docker for Mac: Check the menu bar and start the Docker application
W0308 11:44:34.453]   - DOCKER_HOST hasn't been set or is set incorrectly
W0308 11:44:34.453]     - Linux: domain socket is used, DOCKER_* should be unset. In Bash run `unset ${!DOCKER_*}`
W0308 11:44:34.453]     - macOS w/ docker-machine: run `eval "$(docker-machine env <name>)"`
W0308 11:44:34.453]     - macOS w/ Docker for Mac: domain socket is used, DOCKER_* should be unset. In Bash run `unset ${!DOCKER_*}`
W0308 11:44:34.454]   - Other things to check:
W0308 11:44:34.454]     - Linux: User isn't in 'docker' group.  Add and relogin.
W0308 11:44:34.454]       - Something like 'sudo usermod -a -G docker ${USER}'
W0308 11:44:34.454]       - RHEL7 bug and workaround: https://bugzilla.redhat.com/show_bug.cgi?id=1119282#c8

cc @krzyzacy @BenTheElder - Any idea what's going wrong? I defined the pull-kubernetes-e2e-gce-100-performance job similar to pull-kubernetes-e2e-gce. Ref: kubernetes/test-infra#7168

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 8, 2018

/test pull-kubernetes-e2e-gce-large-performance

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 8, 2018

/test pull-kubernetes-e2e-gce-large-performance
retrying after above fix

@jdumars

This comment has been minimized.

Member

jdumars commented Mar 8, 2018

Thank you all for untangling this thorny issue. Please keep me posted with @jdumars as things progress. This is a great example of the strength of our community.

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 8, 2018

So I tested this PR manually against a 2k-node cluster and things seem fine:

wrt PUT pod-status latency:

{
      "data": {
        "Perc50": 1.485,
        "Perc90": 7.894,
        "Perc99": 43.414
      },
      "unit": "ms",
      "labels": {
        "Count": "450498",
        "Resource": "pods",
        "Scope": "namespace",
        "Subresource": "status",
        "Verb": "PUT"
      }
    },

and pod-startup latency:

INFO: perc50: 2.409912012s, perc90: 3.085908828s, perc99: 3.714040988s
@@ -24,4 +24,4 @@ spec:
dnsPolicy: Default
containers:
- name: etcd-empty-dir-cleanup
image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.10.0
image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.11.0

This comment has been minimized.

@jpbetz

jpbetz Mar 8, 2018

Contributor

I don't see a 3.1.11.0 in gcr.io. The etcd version of etcd-empty-dir-cleanup is only for the version of etcdctl copied into the container image which hasn't changed between 3.1.10 and 3.1.11.

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

I see. In that case, do you want me to:

  • keep the tag at 3.1.10.0, or
  • change it to 3.1.11.0 (to sync with etcd version used in the makefile) and push a new image?

This comment has been minimized.

@jpbetz

jpbetz Mar 8, 2018

Contributor

Sigh. To be consistent, let's publish 3.1.11.0..

This comment has been minimized.

@jpbetz

jpbetz Mar 8, 2018

Contributor

Of we could just tag the 3.1.10.0 image with 3.1.11.0 as well 😁

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

I'd prefer the former, to avoid messing up the tag wrt the underlying etcd version actually used to build the image.

That said, I failed with the following error while running make push:

The push refers to a repository [staging-k8s.gcr.io/etcd-empty-dir-cleanup]
41ad6ade37a0: Pushing [==================================================>]  14.32MB/14.32MB
8b8608fe70b0: Pushing [==================================================>]  14.32MB/14.32MB
c5183829c43c: Layer already exists 
read tcp [2a00:79e0:2:11:cdf:6d7d:727b:913e]:37568->[2a00:1450:400c:c06::52]:443: use of closed network connection

@krzyzacy @BenTheElder Is this somehow related to that docker auth issue? Leads on how to fix it?

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

I tried it. Getting:

ERROR: (gcloud.beta.auth.configure-docker) Error writing Docker configuration to disk: [Errno 2] No such file or directory: '/usr/local/google/home/shyamjvs/.docker/tmptYLal3'

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

Ok.. I think I fixed it by creating that dir. Let me try pushing now.

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

Still fails, but with:

unexpected EOF

This comment has been minimized.

@krzyzacy

krzyzacy Mar 8, 2018

Member

I think that's the gcloud docker -- push vs docker push issue again - have you upgraded your workstation yet?

This comment has been minimized.

@shyamjvs

shyamjvs Mar 8, 2018

Member

I finally managed to make it work by replacing docker with gcloud docker and using the right credentials allowing me to push.

@jpbetz - We're good to go here now.

@krzyzacy

This comment has been minimized.

Member

krzyzacy commented Mar 8, 2018

oh, forgot to mention default make build need docker-in-docker, seems that's resolved, thanks @dims for catching this :-)

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Mar 8, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 8, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Mar 8, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, shyamjvs, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 8, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@jdumars

This comment has been minimized.

Member

jdumars commented Mar 8, 2018

/test pull-kubernetes-e2e-gke
/test pull-kubernetes-e2e-gce-large-performance

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 8, 2018

@jdumars pull-kubernetes-e2e-gce-large-performance runs tests against a 2k-node cluster. Triggering it right now will mostly fail as I'm already running a 5k-node cluster manually in that project and we don't have enough quota to accommodate both :)

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 8, 2018

Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 56195fd into kubernetes:master Mar 8, 2018

11 of 15 checks passed

pull-kubernetes-e2e-gce Job failed.
Details
Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
pull-kubernetes-e2e-gce-large-performance Job triggered.
Details
pull-kubernetes-e2e-gke Job triggered.
Details
cla/linuxfoundation shyamjvs authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@shyamjvs shyamjvs deleted the shyamjvs:go-back-to-etcd-3.1.10 branch Mar 8, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Mar 9, 2018

@shyamjvs: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce ba6bb99 link /test pull-kubernetes-e2e-gce
pull-kubernetes-e2e-gke ba6bb99 link /test pull-kubernetes-e2e-gke
pull-kubernetes-e2e-gce-large-performance ba6bb99 link /test pull-kubernetes-e2e-gce-large-performance

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment