Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster/gce/upgrade.sh taking a long time to run #37257

Closed
rkouj opened this issue Nov 22, 2016 · 12 comments · Fixed by #37358
Closed

cluster/gce/upgrade.sh taking a long time to run #37257

rkouj opened this issue Nov 22, 2016 · 12 comments · Fixed by #37358
Assignees
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@rkouj
Copy link
Contributor

rkouj commented Nov 22, 2016

This is for manual upgrade testing:

I am running the upgrade.sh script on the release-1.5 branch and the script has been running for a long time now (>40mins) with no sign of completion.
Has anyone experienced a similar issue or has been successfully able to run the upgrade script ?

Command I ran: cluster/gce/upgrade.sh -M v1.5.0-beta.1

Output

rkouj@rkouj0:~/go/src/k8s.io/kubernetes$ cluster/gce/upgrade.sh -M v1.5.0-beta.1
== Pre-Upgrade Node OS and Kubelet Versions ==
name: "kubernetes-master", osImage: "Google Container-VM Image", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
name: "kubernetes-minion-group-7593", osImage: "Debian GNU/Linux 7 (wheezy)", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
name: "kubernetes-minion-group-bigx", osImage: "Debian GNU/Linux 7 (wheezy)", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
name: "kubernetes-minion-group-q5uf", osImage: "Debian GNU/Linux 7 (wheezy)", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
Your active configuration is: [default]

Project: rkouj-test1
Zone: us-central1-b
INSTANCE_GROUPS=kubernetes-minion-group
NODE_NAMES=kubernetes-minion-group-7593 kubernetes-minion-group-bigx kubernetes-minion-group-q5uf
== Upgrading master to 'https://storage.googleapis.com/kubernetes-release/release/v1.5.0-beta.1/kubernetes-server-linux-amd64.tar.gz'. Do not interrupt, deleting master instance. ==
Trying to find master named 'kubernetes-master'
Looking for address 'kubernetes-master-ip'
Using master: kubernetes-master (external IP: 104.197.149.96)
Deleted [https://www.googleapis.com/compute/v1/projects/rkouj-test1/zones/us-central1-b/instances/kubernetes-master].
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
Created [https://www.googleapis.com/compute/v1/projects/rkouj-test1/zones/us-central1-b/instances/kubernetes-master].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
kubernetes-master us-central1-b n1-standard-1 10.128.0.2 104.197.149.96 RUNNING
== Waiting for new master to respond to API requests ==


@saad-ali saad-ali added this to the v1.5 milestone Nov 22, 2016
@mtaufen
Copy link
Contributor

mtaufen commented Nov 22, 2016

That sounds long to me. Do you see the same behavior when you run it without the -M flag?

@rkouj
Copy link
Contributor Author

rkouj commented Nov 22, 2016

@mtaufen: Tried it without the -M flag. No change.

rkouj@rkouj0:~/go/src/k8s.io/kubernetes$ cluster/gce/upgrade.sh v1.5.0-beta.1
== Pre-Upgrade Node OS and Kubelet Versions ==
name: "kubernetes-master", osImage: "Google Container-VM Image", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
name: "kubernetes-minion-group-7kpl", osImage: "Debian GNU/Linux 7 (wheezy)", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
name: "kubernetes-minion-group-l2on", osImage: "Debian GNU/Linux 7 (wheezy)", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
name: "kubernetes-minion-group-lzgi", osImage: "Debian GNU/Linux 7 (wheezy)", kubeletVersion: "v1.4.7-beta.0.2+1ef121737093fd-dirty"
Your active configuration is: [default]

Project: rkouj-test1
Zone: us-central1-b
INSTANCE_GROUPS=kubernetes-minion-group
NODE_NAMES=kubernetes-minion-group-7kpl kubernetes-minion-group-l2on kubernetes-minion-group-lzgi
== Upgrading master to 'https://storage.googleapis.com/kubernetes-release/release/v1.5.0-beta.1/kubernetes-server-linux-amd64.tar.gz'. Do not interrupt, deleting master instance. ==
Trying to find master named 'kubernetes-master'
Looking for address 'kubernetes-master-ip'
Using master: kubernetes-master (external IP: 23.236.62.71)
Deleted [https://www.googleapis.com/compute/v1/projects/rkouj-test1/zones/us-central1-b/instances/kubernetes-master].
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
Created [https://www.googleapis.com/compute/v1/projects/rkouj-test1/zones/us-central1-b/instances/kubernetes-master].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
kubernetes-master us-central1-b n1-standard-1 10.128.0.2 23.236.62.71 RUNNING
== Waiting for new master to respond to API requests ==


@jpeeler
Copy link
Contributor

jpeeler commented Nov 22, 2016

I'm seeing this problem as well. And I saw this on two of the hosts:

[ 3870.180529] Memory cgroup out of memory: Kill process 4697 (google-fluentd) score 1923 or sacrifice child
[ 3870.191608] Killed process 4697 (google-fluentd) total-vm:1224416kB, anon-rss:186420kB, file-rss:8068kB

@davidopp davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker labels Nov 22, 2016
@rkouj
Copy link
Contributor Author

rkouj commented Nov 22, 2016

cc: @kubernetes/sig-node @dchen1107

@rkouj
Copy link
Contributor Author

rkouj commented Nov 22, 2016

After syncing up with @jpeeler

This is what we saw in the master configuration.

rkouj@kubernetes-master /var/log $ sudo systemctl status kube-master-configuration
● kube-master-configuration.service - Configure kubernetes master
Loaded: loaded (/etc/systemd/system/kube-master-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2016-11-22 21:06:55 UTC; 1h 24min ago
Process: 1145 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=1/FAILURE)
Process: 1140 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/mounter (code=exited, status=0/SUCCESS)
Process: 1136 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 1145 (code=exited, status=1/FAILURE)

@krousey
Copy link
Contributor

krousey commented Nov 22, 2016

cc @roberthbailey

@dchen1107 dchen1107 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Nov 22, 2016
@dchen1107
Copy link
Member

This is a cluster lifecycle integration issue. I will leave @roberthbailey to drive this. cc/ @dashpole for the support from the node side. Thanks!

@yujuhong
Copy link
Contributor

Wrong syntax introduced by #36346

https://github.com/kubernetes/kubernetes/blob/v1.5.0-beta.1/cluster/gce/gci/configure-helper.sh#L807

  else [ -n "${MASTER_ADVERTISE_ADDRESS:-}" ]

@roberthbailey
Copy link
Contributor

roberthbailey commented Nov 23, 2016

@yujuhong's analysis was correct. I was able to ssh into my upgraded Kubernetes master, manually edit /home/kubernetes/bin/configure-helper.sh to fix the shell, unmount the master pd (sudo umount /mnt/disks/master-pd), and re-run sudo /home/kubernetes/bin/configure-helper.sh. Once that finished, my master upgrade succeeded after being in a waiting state for a long time:

== Waiting for new master to respond to API requests ==
one ==

The bad news is that configure-helper.sh is baked into the k8s release tar, so we'll need to cut a new beta with a fix before we can resume manual upgrade testing.

@davidopp
Copy link
Member

Thanks for the investigation, @yujuhong and @roberthbailey ! I LGTMd the PR @roberthbailey sent with the fix.

k8s-github-robot pushed a commit that referenced this issue Nov 23, 2016
Automatic merge from submit-queue

Fix an else branch in trusty/configure-helper.sh

Similar to #37358, for fixing #37257 on trusty.
k8s-github-robot pushed a commit that referenced this issue Nov 23, 2016
Automatic merge from submit-queue

Fix an else branch in configure-helper.sh

**What this PR does / why we need it**: bug fix for upgrade.sh needed in 1.5

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #37257
@roberthbailey
Copy link
Contributor

FYI this won't actually be fixed until this PR is in the 1.5 release branch and we cut a beta release. At that point we will update the instructions in https://docs.google.com/document/d/19Q4AzWLD5jd2FNaPyKy2xdTN4JIGUUpvwDwg0tcBkyc/edit# to point to v1.5.0-beta.2 and upgrade.sh should be working properly again.

@davidopp
Copy link
Member

Agreed; I'll take care of doing that once I see the anago announcement of the new beta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants