Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master node will not restart on AWS #16188

Closed
jvalencia opened this issue Oct 23, 2015 · 13 comments
Closed

Master node will not restart on AWS #16188

jvalencia opened this issue Oct 23, 2015 · 13 comments

Comments

@jvalencia
Copy link
Contributor

If you create a cluster in AWS using kube-up.sh it uses cloud init to set the box up. On a restart of the node the script fails. This is true for 1.0.6 and 1.2.0-alpha.2 From the system log:

cloud-init-nonet[3.58]: static networking is now up
 * Starting configure network device�[74G[ OK ]
Cloud-init v. 0.7.5 running 'init' at Fri, 23 Oct 2015 20:47:55 +0000. Up 3.76 seconds.
ci-info: +++++++++++++++++++++++++Net device info++++++++++++++++++++++++++
ci-info: +--------+------+------------+---------------+-------------------+
ci-info: | Device |  Up  |  Address   |      Mask     |     Hw-Address    |
ci-info: +--------+------+------------+---------------+-------------------+
ci-info: |   lo   | True | 127.0.0.1  |   255.0.0.0   |         .         |
ci-info: |  eth0  | True | 172.20.0.9 | 255.255.255.0 | 06:f0:64:06:03:ff |
ci-info: +--------+------+------------+---------------+-------------------+
ci-info: +++++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++++
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: | Route | Destination |  Gateway   |    Genmask    | Interface | Flags |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 172.20.0.1 |    0.0.0.0    |    eth0   |   UG  |
ci-info: |   1   |  172.20.0.0 |  0.0.0.0   | 255.255.255.0 |    eth0   |   U   |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
 * Stopping cold plug devices�[74G[ OK ]
 * Stopping log initial device creation�[74G[ OK ]
 * Starting enable remaining boot-time encrypted block devices�[74G[ OK ]
The disk drive for /mnt/ephemeral is not ready yet or not present.
keys:Continue to wait, or Press S to skip mounting or M for manual recovery
@jvalencia
Copy link
Contributor Author

It's relevant that the cloud init scripts passed in through the kube-up.sh only run on first boot, not subsequent ones.

@jvalencia
Copy link
Contributor Author

Apparently, the LVM for kubernetes is not getting created correctly, so on a restart it can't find the volume. This is on 1.2 alpha, it may be different on 1.0.6.
This is from the LVM creation scripts on first boot:

update-initramfs: Generating /boot/initrd.img-3.13.0-46-generic
  Physical volume "/dev/xvdc" successfully created
  No physical volume label read from /dev/xvdd
  Physical volume "/dev/xvdd" successfully created
  Volume group "vg-ephemeral" successfully created
  Rounding up size to full physical extent 32.00 MiB
  Insufficient free space: 3905 extents needed, but only 3897 available
  One or more specified logical volume(s) not found.
  Invalid argument for --virtualsize: M
  Error during parsing of command line.
mke2fs 1.42.9 (4-Feb-2014)
Could not stat /dev/vg-ephemeral/ephemeral --- No such file or directory

The device apparently does not exist; did you specify it correctly?
mount: special device /dev/vg-ephemeral/ephemeral does not exist

@jvalencia
Copy link
Contributor Author

Some relevant debug info:

root@ip-172-20-0-9:/home/ubuntu# pvdisplay 
  --- Physical volume ---
  PV Name               /dev/xvdc
  VG Name               vg-ephemeral
  PV Size               15.26 GiB / not usable 1.50 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              3905
  Free PE               3905
  Allocated PE          0
  PV UUID               3kiWOO-1Qi8-3KuR-1zIi-VDGB-YtL2-uJRdJw

  --- Physical volume ---
  PV Name               /dev/xvdd
  VG Name               vg-ephemeral
  PV Size               15.26 GiB / not usable 1.50 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              3905
  Free PE               3905
  Allocated PE          0
  PV UUID               x1tFVk-eEfx-70Ek-zmZZ-1DZU-Xqdr-xNU318

root@ip-172-20-0-9:/home/ubuntu# vgdisplay 
  --- Volume group ---
  VG Name               vg-ephemeral
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               30.51 GiB
  PE Size               4.00 MiB
  Total PE              7810
  Alloc PE / Size       0 / 0   
  Free  PE / Size       7810 / 30.51 GiB
  VG UUID               pl4cmc-PXWc-bp6N-5K1P-zHAY-Qdit-a62MRc

root@ip-172-20-0-9:/home/ubuntu# lvdisplay 
root@ip-172-20-0-9:/home/ubuntu# 

@jvalencia
Copy link
Contributor Author

In stepping through the code in format-disks.sh, the error starts here (on non-wheezy versions of Ubuntu... it looks like wheezy is fine):

root@ip-172-20-0-9:/home/ubuntu# lvcreate -l 100%FREE --thinpool pool-ephemeral vg-ephemeral
  Rounding up size to full physical extent 32.00 MiB
  Insufficient free space: 3905 extents needed, but only 3897 available

Then checking the details:

root@ip-172-20-0-9:/home/ubuntu# vgs -o +vg_free_count
  VG           #PV #LV #SN Attr   VSize  VFree  Free
  vg-ephemeral   2   0   0 wz--n- 30.51g 30.51g 7810
root@ip-172-20-0-9:/home/ubuntu# pvs -o +pv_pe_count,pv_pe_alloc_count
  PV         VG           Fmt  Attr PSize  PFree  PE   Alloc
  /dev/xvdc  vg-ephemeral lvm2 a--  15.25g 15.25g 3905     0
  /dev/xvdd  vg-ephemeral lvm2 a--  15.25g 15.25g 3905     0

@jvalencia
Copy link
Contributor Author

As a test, I ran the wheezy version of lvcreate (no thin provisioning) manually and restarted. It worked, and master came back up correctly.

@zaa
Copy link

zaa commented Oct 24, 2015

+1. We have experienced the issue on AWS too. The nodes (Ubuntu) did not come back after reboot.

@stemau98
Copy link

+1

@jvalencia
Copy link
Contributor Author

Apparently, the metadata is written into the pool and thin provisioning will not work with 100%FREE:
https://wiki.gentoo.org/wiki/LVM
https://bugzilla.redhat.com/show_bug.cgi?id=812726

@ghost
Copy link

ghost commented Oct 26, 2015

/sub @quinton-hoole

@willmore
Copy link

Will this patch be applied to the 1.1 release branch?

RichieEscarez pushed a commit to RichieEscarez/kubernetes that referenced this issue Dec 4, 2015
@drora
Copy link

drora commented Dec 20, 2015

just met this issue on v1.1.3
after master node reboot, lost control over the cluster.

(kubectl config was empty, probably lost data on attached ebs storage, wasn't able to re-gain control and recover with etcd, couldn't find more info about how to re-configure master back to existing cluster on aws)

kubectl get nodes ::
"error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused"

from kubelet log ::
"Unable to write event: 'Post https://{IP}/api/v1/namespaces/default/events: dial tcp {IP}:443: connection refused' (may retry after sleeping)"
"Skipping pod synchronization, network is not configured"

@drora
Copy link

drora commented Dec 21, 2015

this is a major deal here for production use.
very easy to repro locally (via vagrant).

  1. just spin up a cluster using kube up,
  2. restart master vm
  3. lose cluster.

@jvalencia
Copy link
Contributor Author

This fix has been pulled into 1.2, but is not in 1.1.

We run 1.1 but during cluster up, we copy master's cluster/aws/templates/format-disks.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants