Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster/gce/upgrade.sh fails to upgrade nodes when run from a mac #37474

Closed
roberthbailey opened this issue Nov 25, 2016 · 13 comments · Fixed by #37562
Closed

cluster/gce/upgrade.sh fails to upgrade nodes when run from a mac #37474

roberthbailey opened this issue Nov 25, 2016 · 13 comments · Fixed by #37562
Assignees
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@roberthbailey
Copy link
Contributor

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version): 1.5 beta 2

Environment:

  • Cloud provider or hardware configuration: GCE
  • OS (e.g. from /etc/os-release): CVM
  • Kernel (e.g. uname -a):
  • Install tools: kube-up.sh

What happened: Step 6 of https://docs.google.com/document/d/19Q4AzWLD5jd2FNaPyKy2xdTN4JIGUUpvwDwg0tcBkyc/edit# fails

What you expected to happen: Node upgrade to succeed.

How to reproduce it (as minimally and precisely as possible): Follow the manual upgrade steps outlined in the linked document.

Anything else do we need to know: The error is because the metadata is too large for the new node instance template.

ERROR: (gcloud.compute.instance-templates.create) Some requests did not succeed:
 - Value for field 'resource.properties.metadata.items[2].value' is too large: maximum size 32768 character(s); actual size 33595.
@roberthbailey
Copy link
Contributor Author

More complete error output:

Attempt 1 to create kubernetes-minion-template-v1-5-0-beta-2
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
ERROR: (gcloud.compute.instance-templates.create) Some requests did not succeed:
 - Value for field 'resource.properties.metadata.items[2].value' is too large: maximum size 32768 character(s); actual size 33595.

Attempt 1 failed to create instance template kubernetes-minion-template-v1-5-0-beta-2. Retrying.

Attempt 2 to create kubernetes-minion-template-v1-5-0-beta-2
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
ERROR: (gcloud.compute.instance-templates.create) Some requests did not succeed:
 - Value for field 'resource.properties.metadata.items[2].value' is too large: maximum size 32768 character(s); actual size 33595.

Attempt 2 failed to create instance template kubernetes-minion-template-v1-5-0-beta-2. Retrying.

Attempt 3 to create kubernetes-minion-template-v1-5-0-beta-2
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
ERROR: (gcloud.compute.instance-templates.create) Some requests did not succeed:
 - Value for field 'resource.properties.metadata.items[2].value' is too large: maximum size 32768 character(s); actual size 33595.

Attempt 3 failed to create instance template kubernetes-minion-template-v1-5-0-beta-2. Retrying.

Attempt 4 to create kubernetes-minion-template-v1-5-0-beta-2
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
ERROR: (gcloud.compute.instance-templates.create) Some requests did not succeed:
 - Value for field 'resource.properties.metadata.items[2].value' is too large: maximum size 32768 character(s); actual size 33595.

Attempt 4 failed to create instance template kubernetes-minion-template-v1-5-0-beta-2. Retrying.


Attempt 5 to create kubernetes-minion-template-v1-5-0-beta-2
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
ERROR: (gcloud.compute.instance-templates.create) Some requests did not succeed:
 - Value for field 'resource.properties.metadata.items[2].value' is too large: maximum size 32768 character(s); actual size 33595.

Attempt 5 failed to create instance template kubernetes-minion-template-v1-5-0-beta-2. Retrying.


Attempt 6 to create kubernetes-minion-template-v1-5-0-beta-2
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
ERROR: (gcloud.compute.instance-templates.create) Some requests did not succeed:
 - Value for field 'resource.properties.metadata.items[2].value' is too large: maximum size 32768 character(s); actual size 33595.

Failed to create instance template kubernetes-minion-template-v1-5-0-beta-2 

@soltysh
Copy link
Contributor

soltysh commented Nov 25, 2016

This is blocking test and per email I'm bumpting this to P0.

@soltysh soltysh added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker labels Nov 25, 2016
@wojtek-t
Copy link
Member

Does it mean that "node-kube-env.yaml" is too large?
https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/debian/node-helper.sh#L25

It seems like that too me. If so, we should understand why those are different than on startup...

@paralin
Copy link
Contributor

paralin commented Nov 26, 2016

This is broken for me too:

@kubernetes-master ~ $ sudo systemctl status kube-master-installation -l
● kube-master-installation.service - Download and install k8s binaries and configurations
   Loaded: loaded (/etc/systemd/system/kube-master-installation.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sat 2016-11-26 00:47:08 UTC; 2min 26s ago
  Process: 1070 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=1/FAILURE)
  Process: 1066 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
  Process: 1062 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/computeMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
  Process: 1058 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 1054 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 1050 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
 Main PID: 1070 (code=exited, status=1/FAILURE)

Nov 26 00:47:06 kubernetes-master configure.sh[1070]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 26 00:47:06 kubernetes-master configure.sh[1070]: Dload  Upload   Total   Spent    Left  Speed
Nov 26 00:47:07 kubernetes-master configure.sh[1070]: [155B blob data]
Nov 26 00:47:07 kubernetes-master configure.sh[1070]: == Downloaded https://storage.googleapis.com/kubernetes-release/network-plugins/cni-07a8a28637e97b22eb8dfe710eeae1344f69d16e.tar.gz (SHA1 = 19d49f7b2b99cd2493d5ae0ace896c64e289ccbb) ==
Nov 26 00:47:08 kubernetes-master configure.sh[1070]: Downloading k8s manifests tar
Nov 26 00:47:08 kubernetes-master configure.sh[1070]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 26 00:47:08 kubernetes-master configure.sh[1070]: Dload  Upload   Total   Spent    Left  Speed
Nov 26 00:47:08 kubernetes-master configure.sh[1070]: [155B blob data]
Nov 26 00:47:08 kubernetes-master configure.sh[1070]: == Downloaded https://storage.googleapis.com/kubernetes-release/release/v1.4.6/kubernetes-manifests.tar.gz (SHA1 = e9c52530a14612c91f45e017743925a0dba6dcc8) ==
Nov 26 00:47:08 kubernetes-master configure.sh[1070]: cp: cannot stat '/home/kubernetes/kube-manifests/kubernetes/gci-trusty/gci-mounter': No such file or directory

After an upgrade.

@saad-ali saad-ali added this to the v1.5 milestone Nov 26, 2016
@dims
Copy link
Member

dims commented Nov 27, 2016

@soltysh @wojtek-t @paralin : so who is going to be the assignee? and come up with a plan of attack? :)

@roberthbailey
Copy link
Contributor Author

The problem is that configure-vm.sh is too large. In my $KUBE_TEMP directory:

$ wc *
       1       1      11 cluster-name.txt
    1036    3546   33595 configure-vm.sh
      55     110   11863 node-kube-env.yaml
    1092    3657   45469 total

You can see that the number of bytes in configure-vm.sh (33595) matches the output in the error message (actual size 33595).

What is interesting is that configure-vm.sh in the repo is actually larger. In the 1.5 branch:

$ wc configure-vm.sh 
    1120    4289   37962 configure-vm.sh

and in the 1.4 branch:

$ wc configure-vm.sh 
    1090    4178   36917 configure-vm.sh

But we aren't seeing an error during new cluster creation in either release branch.

@roberthbailey
Copy link
Contributor Author

In the $KUBE_TEMP directory when creating a new cluster from the 1.4 branch:

$ wc configure-vm.sh 
    1006    3435   32550 configure-vm.sh

which is below the size limit.

@roberthbailey
Copy link
Contributor Author

Here's the diff between the two configure-vm.sh scripts:

$ diff configure-vm-1.4-new.sh configure-vm-1.5-upgrade.sh 
400d399
< dns_replicas: '$(echo "$DNS_REPLICAS" | sed -e "s/'/''/g")'
402a402
> enable_dns_horizontal_autoscaler: '$(echo "$ENABLE_DNS_HORIZONTAL_AUTOSCALER" | sed -e "s/'/''/g")'
404d403
< storage_backend: '$(echo "$STORAGE_BACKEND" | sed -e "s/'/''/g")'
418a418
> 
420a421,425
>     if [ -n "${STORAGE_BACKEND:-}" ]; then
>       cat <<EOF >>/srv/salt-overlay/pillar/cluster-params.sls
> storage_backend: '$(echo "$STORAGE_BACKEND" | sed -e "s/'/''/g")'
> EOF
>     fi
440a446,454
>     if [[ -n "${ETCD_CA_KEY:-}" && -n "${ETCD_CA_CERT:-}" && -n "${ETCD_PEER_KEY:-}" && -n "${ETCD_PEER_CERT:-}" ]]; then
>       cat <<EOF >>/srv/salt-overlay/pillar/cluster-params.sls
> etcd_over_ssl: 'true'
> EOF
>     else
>       cat <<EOF >>/srv/salt-overlay/pillar/cluster-params.sls
> etcd_over_ssl: 'false'
> EOF
>     fi
877d890
<   cbr-cidr: 10.123.45.0/29
896d908
<   cbr-cidr: 10.123.45.0/29
953c965,973
<   salt-call --local state.highstate || true
---
>   local rc=0
>   for i in {0..6}; do
>     salt-call --local state.highstate && rc=0 || rc=$?
>     if [[ "${rc}" == 0 ]]; then
>       return 0
>     fi
>   done
>   echo "Salt failed to run repeatedly" >&2
>   return "${rc}"
966a987,995
> function create-salt-master-etcd-auth {
>   if [[ -n "${ETCD_CA_CERT:-}" && -n "${ETCD_PEER_KEY:-}" && -n "${ETCD_PEER_CERT:-}" ]]; then
>     local -r auth_dir="/srv/kubernetes"
>     echo "${ETCD_CA_CERT}" | base64 --decode | gunzip > "${auth_dir}/etcd-ca.crt"
>     echo "${ETCD_PEER_KEY}" | base64 --decode > "${auth_dir}/etcd-peer.key"
>     echo "${ETCD_PEER_CERT}" | base64 --decode | gunzip > "${auth_dir}/etcd-peer.crt"
>   fi
> }
> 
982a1012
>     create-salt-master-etcd-auth

which makes it appear that #35516 is the culprit.

@jszczepkowski

@roberthbailey
Copy link
Contributor Author

This may be a mac issue -- when I manually run the sed command from the prepare-startup-script function it makes no changes. But if I run gsed instead of sed then it strips a bunch of comments, reducing the file size below the allowed limit:

$ wc *
     978    2951   30107 configure-vm-1.5-gsed.sh
    1036    3546   33595 configure-vm-1.5-upgrade.sh
    2014    6497   63702 total

@roberthbailey
Copy link
Contributor Author

We should be able to do the same thing we do in this file https://github.com/kubernetes/kubernetes/blob/master/hack/make-rules/test-cmd.sh#L177 to fix it.

@roberthbailey
Copy link
Contributor Author

I created #37562 but haven't had a chance to test it yet.

@davidopp
Copy link
Member

(Master and) node upgrade worked for me on Ubuntu, so this does indeed seem to be Mac-specific.

@roberthbailey roberthbailey changed the title cluster/gce/upgrade.sh fails to upgrade nodes cluster/gce/upgrade.sh fails to upgrade nodes when run from a mac Nov 29, 2016
@roberthbailey
Copy link
Contributor Author

I tested #37562 and it looks like it fixed my issue. I suppose I'll need to cherry pick it into the 1.5 branch once it merges to master.

k8s-github-robot pushed a commit that referenced this issue Nov 30, 2016
Automatic merge from submit-queue

Use gsed on the mac.

**What this PR does / why we need it**: Fixes node upgrades when run from a mac

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #37474

**Special notes for your reviewer**:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants