Master node will not restart on AWS #16188

jvalencia · 2015-10-23T21:12:27Z

If you create a cluster in AWS using kube-up.sh it uses cloud init to set the box up. On a restart of the node the script fails. This is true for 1.0.6 and 1.2.0-alpha.2 From the system log:

cloud-init-nonet[3.58]: static networking is now up
 * Starting configure network device�[74G[ OK ]
Cloud-init v. 0.7.5 running 'init' at Fri, 23 Oct 2015 20:47:55 +0000. Up 3.76 seconds.
ci-info: +++++++++++++++++++++++++Net device info++++++++++++++++++++++++++
ci-info: +--------+------+------------+---------------+-------------------+
ci-info: | Device |  Up  |  Address   |      Mask     |     Hw-Address    |
ci-info: +--------+------+------------+---------------+-------------------+
ci-info: |   lo   | True | 127.0.0.1  |   255.0.0.0   |         .         |
ci-info: |  eth0  | True | 172.20.0.9 | 255.255.255.0 | 06:f0:64:06:03:ff |
ci-info: +--------+------+------------+---------------+-------------------+
ci-info: +++++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++++
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: | Route | Destination |  Gateway   |    Genmask    | Interface | Flags |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 172.20.0.1 |    0.0.0.0    |    eth0   |   UG  |
ci-info: |   1   |  172.20.0.0 |  0.0.0.0   | 255.255.255.0 |    eth0   |   U   |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
 * Stopping cold plug devices�[74G[ OK ]
 * Stopping log initial device creation�[74G[ OK ]
 * Starting enable remaining boot-time encrypted block devices�[74G[ OK ]
The disk drive for /mnt/ephemeral is not ready yet or not present.
keys:Continue to wait, or Press S to skip mounting or M for manual recovery

The text was updated successfully, but these errors were encountered:

jvalencia · 2015-10-23T22:02:44Z

It's relevant that the cloud init scripts passed in through the kube-up.sh only run on first boot, not subsequent ones.

jvalencia · 2015-10-23T23:21:19Z

Apparently, the LVM for kubernetes is not getting created correctly, so on a restart it can't find the volume. This is on 1.2 alpha, it may be different on 1.0.6.
This is from the LVM creation scripts on first boot:

update-initramfs: Generating /boot/initrd.img-3.13.0-46-generic
  Physical volume "/dev/xvdc" successfully created
  No physical volume label read from /dev/xvdd
  Physical volume "/dev/xvdd" successfully created
  Volume group "vg-ephemeral" successfully created
  Rounding up size to full physical extent 32.00 MiB
  Insufficient free space: 3905 extents needed, but only 3897 available
  One or more specified logical volume(s) not found.
  Invalid argument for --virtualsize: M
  Error during parsing of command line.
mke2fs 1.42.9 (4-Feb-2014)
Could not stat /dev/vg-ephemeral/ephemeral --- No such file or directory

The device apparently does not exist; did you specify it correctly?
mount: special device /dev/vg-ephemeral/ephemeral does not exist

jvalencia · 2015-10-23T23:24:31Z

Some relevant debug info:

root@ip-172-20-0-9:/home/ubuntu# pvdisplay 
  --- Physical volume ---
  PV Name               /dev/xvdc
  VG Name               vg-ephemeral
  PV Size               15.26 GiB / not usable 1.50 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              3905
  Free PE               3905
  Allocated PE          0
  PV UUID               3kiWOO-1Qi8-3KuR-1zIi-VDGB-YtL2-uJRdJw

  --- Physical volume ---
  PV Name               /dev/xvdd
  VG Name               vg-ephemeral
  PV Size               15.26 GiB / not usable 1.50 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              3905
  Free PE               3905
  Allocated PE          0
  PV UUID               x1tFVk-eEfx-70Ek-zmZZ-1DZU-Xqdr-xNU318

root@ip-172-20-0-9:/home/ubuntu# vgdisplay 
  --- Volume group ---
  VG Name               vg-ephemeral
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               30.51 GiB
  PE Size               4.00 MiB
  Total PE              7810
  Alloc PE / Size       0 / 0   
  Free  PE / Size       7810 / 30.51 GiB
  VG UUID               pl4cmc-PXWc-bp6N-5K1P-zHAY-Qdit-a62MRc

root@ip-172-20-0-9:/home/ubuntu# lvdisplay 
root@ip-172-20-0-9:/home/ubuntu#

jvalencia · 2015-10-23T23:47:39Z

In stepping through the code in format-disks.sh, the error starts here (on non-wheezy versions of Ubuntu... it looks like wheezy is fine):

root@ip-172-20-0-9:/home/ubuntu# lvcreate -l 100%FREE --thinpool pool-ephemeral vg-ephemeral
  Rounding up size to full physical extent 32.00 MiB
  Insufficient free space: 3905 extents needed, but only 3897 available

Then checking the details:

root@ip-172-20-0-9:/home/ubuntu# vgs -o +vg_free_count
  VG           #PV #LV #SN Attr   VSize  VFree  Free
  vg-ephemeral   2   0   0 wz--n- 30.51g 30.51g 7810
root@ip-172-20-0-9:/home/ubuntu# pvs -o +pv_pe_count,pv_pe_alloc_count
  PV         VG           Fmt  Attr PSize  PFree  PE   Alloc
  /dev/xvdc  vg-ephemeral lvm2 a--  15.25g 15.25g 3905     0
  /dev/xvdd  vg-ephemeral lvm2 a--  15.25g 15.25g 3905     0

jvalencia · 2015-10-24T00:01:22Z

As a test, I ran the wheezy version of lvcreate (no thin provisioning) manually and restarted. It worked, and master came back up correctly.

zaa · 2015-10-24T21:23:14Z

+1. We have experienced the issue on AWS too. The nodes (Ubuntu) did not come back after reboot.

stemau98 · 2015-10-26T09:29:21Z

+1

jvalencia · 2015-10-26T17:38:01Z

Apparently, the metadata is written into the pool and thin provisioning will not work with 100%FREE:
https://wiki.gentoo.org/wiki/LVM
https://bugzilla.redhat.com/show_bug.cgi?id=812726

ghost · 2015-10-26T21:21:55Z

/sub @quinton-hoole

willmore · 2015-10-30T09:15:44Z

Will this patch be applied to the 1.1 release branch?

drora · 2015-12-20T10:18:47Z

just met this issue on v1.1.3
after master node reboot, lost control over the cluster.

(kubectl config was empty, probably lost data on attached ebs storage, wasn't able to re-gain control and recover with etcd, couldn't find more info about how to re-configure master back to existing cluster on aws)

kubectl get nodes ::
"error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused"

from kubelet log ::
"Unable to write event: 'Post https://{IP}/api/v1/namespaces/default/events: dial tcp {IP}:443: connection refused' (may retry after sleeping)"
"Skipping pod synchronization, network is not configured"

drora · 2015-12-21T09:34:14Z

this is a major deal here for production use.
very easy to repro locally (via vagrant).

just spin up a cluster using kube up,
restart master vm
lose cluster.

jvalencia · 2015-12-21T17:04:15Z

This fix has been pulled into 1.2, but is not in 1.1.

We run 1.1 but during cluster up, we copy master's cluster/aws/templates/format-disks.sh

jvalencia mentioned this issue Oct 23, 2015

Stopp and Restart Kubernetes Cluster on AWS #15160

Closed

saad-ali added the team/control-plane label Oct 23, 2015

justinsb added the area/platform/aws label Oct 26, 2015

jvalencia mentioned this issue Oct 26, 2015

Resolve lvm provisioning by removing thin provisioning #16303

Merged

k8s-github-robot closed this as completed in 8f5a2de Oct 28, 2015

RichieEscarez pushed a commit to RichieEscarez/kubernetes that referenced this issue Dec 4, 2015

Fixed kubernetes#16188

a449179

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master node will not restart on AWS #16188

Master node will not restart on AWS #16188

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 24, 2015

zaa commented Oct 24, 2015

stemau98 commented Oct 26, 2015

jvalencia commented Oct 26, 2015

ghost commented Oct 26, 2015

willmore commented Oct 30, 2015

drora commented Dec 20, 2015

drora commented Dec 21, 2015

jvalencia commented Dec 21, 2015

Master node will not restart on AWS #16188

Master node will not restart on AWS #16188

Comments

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 23, 2015

jvalencia commented Oct 24, 2015

zaa commented Oct 24, 2015

stemau98 commented Oct 26, 2015

jvalencia commented Oct 26, 2015

ghost commented Oct 26, 2015

willmore commented Oct 30, 2015

drora commented Dec 20, 2015

drora commented Dec 21, 2015

jvalencia commented Dec 21, 2015