Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protokube fails to mount etcd master volume in Kubernetes v1.8 #3552

Closed
KashifSaadat opened this issue Oct 6, 2017 · 7 comments
Closed

Protokube fails to mount etcd master volume in Kubernetes v1.8 #3552

KashifSaadat opened this issue Oct 6, 2017 · 7 comments
Assignees

Comments

@KashifSaadat
Copy link
Contributor

Steps to recreate:

  1. Build kops, nodeup & protokube off master (I've tested at commit 51d3cde)
  2. Set your environment variables (NODEUP_URL, PROTOKUBE_IMAGE)
  3. Create a Kubernetes cluster (standard options should be fine)

The cluster will fail to start. Logging into a master node, you can see some run.scope service failures and one of the etcd volumes failed to mount (events is fine, main does not mount successfully). Viewing from near the top of the protokube logs I can see:

volume_mounter.go:135] Creating mount directory "/rootfs/mnt/master-vol-0fd788d4873320c84"
volume_mounter.go:140] Mounting device "/rootfs/dev/xvdu" on "/rootfs/mnt/master-vol-0fd788d4873320c84"
mount_linux.go:366] Checking for issues with fsck on disk: /rootfs/dev/xvdu
mount_linux.go:379] `fsck` error fsck from util-linux 2.25.2
fsck.ext2: Bad magic number in super-block while trying to open /rootfs/dev/xvdu
/rootfs/dev/xvdu:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

mount_linux.go:384] Attempting to mount disk:  /rootfs/dev/xvdu /rootfs/mnt/master-vol-0fd788d4873320c84
mount_linux.go:135] Mounting cmd (systemd-run) with arguments ([--description=Kubernetes transient mount for /rootfs/mnt/master-vol-0fd788d4873320c84 --scope -- mount -o defaults /rootfs/dev/xvdu /rootfs/mnt/master-vol-0fd788d4873320c84])
mount_linux.go:140] Mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /rootfs/mnt/master-vol-0fd788d4873320c84 --scope -- mount -o defaults /rootfs/dev/xvdu /rootfs/mnt/master-vol-0fd788d4873320c84
Output: Running as unit run-14.scope.
mount: wrong fs type, bad option, bad superblock on /rootfs/dev/xvdu,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

mount_linux.go:430] Attempting to determine if disk "/rootfs/dev/xvdu" is formatted using lsblk with args: ([-n -o FSTYPE /rootfs/dev/xvdu])
mount_linux.go:433] Output: "\n"
mount_linux.go:404] Disk "/rootfs/dev/xvdu" appears to be unformatted, attempting to format as type: "ext4" with options: [-F /rootfs/dev/xvdu]
mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /rootfs/dev/xvdu /rootfs/mnt/master-vol-0fd788d4873320c84
mount_linux.go:135] Mounting cmd (systemd-run) with arguments ([--description=Kubernetes transient mount for /rootfs/mnt/master-vol-0fd788d4873320c84 --scope -- mount -t ext4 -o defaults /rootfs/dev/xvdu /rootfs/mnt/master-vol-0fd788d4873320c84])
volume_mounter.go:76] unable to mount master volume: "error formatting and mounting disk \"/dev/xvdu\" on \"/mnt/master-vol-0fd788d4873320c84\" in host: exit status 1"

Testing with protokube-1.8.0-alpha.1 the volumes mount and cluster builds successfully, so something merged into master since then appears to have caused the regression. I believe it may be related to the following change, but haven't managed to verify: https://github.com/kubernetes/kops/blame/51d3cde6970bfc29018eab9c8f05753079115f7c/protokube/pkg/protokube/volume_mounter.go#L110

@chrislovecnm
Copy link
Contributor

Thanks I will test a full build. The interesting thing is that e2e should be running this same test continuously ... interesting meaning that if you are correct, e2e is busted ;(

I appreciate your help!

@chrislovecnm
Copy link
Contributor

Note

I found some challenges in the Makefile, I have put in one PR, about to drop another

Create cluster

So here is what I did, I use scripts so this bash is untested. I have standard manifests that I use, that have k8s version 1.8.0 set.

export MY_BUCKET=whatyourbucknameis
export KOPS_FILE=myclutster.yaml
S3_BUCKET=s3://$MY_BUCKET make kops-install upload`
export KOPS_BASE_URL=https://s3.amazonaws.com/$MY_BUCKET/kops/1.8.0-alpha.1/
# probably do not have to do this
export NODEUP_URL=${KOPS_BASE_URL}linux/amd64/nodeup
export KOPS=kops
CLUSTER_NAME=$(head -n 10 $KOPS_FILE | grep name | awk '{ print $2 }')
$KOPS create -f $KOPS_FILE
$KOPS create secret --name $CLUSTER_NAME sshpublickey admin -i ~/.ssh/id_rsa.pub
$KOPS update cluster $CLUSTER_NAME --yes

Validate Protokube

ssh'ed into the master and it seems to be working

It installed correctly

Oct  7 23:45:01 ip-172-60-0-183 nodeup[734]: I1007 23:45:01.077672     734 executor.go:157] Executing task "LoadImage.protokube": LoadImageTask: https://s3.amazonaws.com/redacted/kops/1.8.0-alpha.1/images/protokube.tar.gz

And it mounted protokube correctly

admin@ip-172-60-0-183:~$ sudo docker logs f14247974dbc
protokube version 0.1
I1007 23:45:20.483998       1 aws_volume.go:71] AWS API Request: ec2metadata/GetMetadata
I1007 23:45:20.485084       1 aws_volume.go:71] AWS API Request: ec2metadata/GetMetadata
I1007 23:45:20.485921       1 aws_volume.go:71] AWS API Request: ec2metadata/GetMetadata
I1007 23:45:20.488424       1 aws_volume.go:71] AWS API Request: ec2/DescribeInstances
I1007 23:45:20.581503       1 aws_volume.go:71] AWS API Request: ec2/DescribeVolumes
I1007 23:45:20.581692       1 dnscontroller.go:101] starting DNS controller
I1007 23:45:20.582481       1 dnscache.go:75] querying all DNS zones (no cached results)
I1007 23:45:20.582571       1 route53.go:50] AWS request: route53 ListHostedZones
I1007 23:45:20.650251       1 volume_mounter.go:227] Trying to mount master volume: "vol-0ee184d96cdd5ff49"
I1007 23:45:20.650410       1 aws_volume.go:71] AWS API Request: ec2/AttachVolume
I1007 23:45:20.861699       1 aws_volume.go:329] AttachVolume request returned {
  AttachTime: 2017-10-07 23:45:20.859 +0000 UTC,
  Device: "/dev/xvdu",
  InstanceId: "i-0da8fbcbac9b3763e",
  State: "attaching",
  VolumeId: "vol-0ee184d96cdd5ff49"
}
I1007 23:45:20.861923       1 aws_volume.go:71] AWS API Request: ec2/DescribeVolumes
I1007 23:45:20.934395       1 volume_mounter.go:227] Trying to mount master volume: "vol-0cc5454b7fbdae79b"
I1007 23:45:20.934568       1 aws_volume.go:71] AWS API Request: ec2/AttachVolume
I1007 23:45:21.150469       1 aws_volume.go:329] AttachVolume request returned {
  AttachTime: 2017-10-07 23:45:21.15 +0000 UTC,
  Device: "/dev/xvdv",
  InstanceId: "i-0da8fbcbac9b3763e",
  State: "attaching",
  VolumeId: "vol-0cc5454b7fbdae79b"
}
I1007 23:45:21.150737       1 aws_volume.go:71] AWS API Request: ec2/DescribeVolumes
I1007 23:45:21.226847       1 volume_mounter.go:246] Currently attached volumes: [{"ID":"vol-0ee184d96cdd5ff49","LocalDevice":"/dev/xvdu","AttachedTo":"","Mountpoint":"","Status":"available","Info":{"Description":"vol-0ee184d96cdd5ff49","EtcdClusters":[{"clusterKey":"main","nodeName":"d","nodeNames":["d"]}]}} {"ID":"vol-0cc5454b7fbdae79b","LocalDevice":"/dev/xvdv","AttachedTo":"","Mountpoint":"","Status":"available","Info":{"Description":"vol-0cc5454b7fbdae79b","EtcdClusters":[{"clusterKey":"events","nodeName":"d","nodeNames":["d"]}]}}]
I1007 23:45:21.226944       1 volume_mounter.go:58] Master volume "vol-0ee184d96cdd5ff49" is attached at "/dev/xvdu"
I1007 23:45:21.227583       1 volume_mounter.go:72] Doing safe-format-and-mount of /dev/xvdu to /mnt/master-vol-0ee184d96cdd5ff49
I1007 23:45:21.227616       1 volume_mounter.go:103] Waiting for device "/dev/xvdu" to be attached
I1007 23:45:22.227746       1 volume_mounter.go:106] Found device "/dev/xvdu"
I1007 23:45:22.230991       1 mount_linux.go:168] Detected OS with systemd
I1007 23:45:22.231582       1 volume_mounter.go:135] Creating mount directory "/rootfs/mnt/master-vol-0ee184d96cdd5ff49"
I1007 23:45:22.231719       1 volume_mounter.go:140] Mounting device "/rootfs/dev/xvdu" on "/rootfs/mnt/master-vol-0ee184d96cdd5ff49"
I1007 23:45:22.231733       1 mount_linux.go:366] Checking for issues with fsck on disk: /rootfs/dev/xvdu
I1007 23:45:22.247351       1 mount_linux.go:379] `fsck` error fsck from util-linux 2.25.2
fsck.ext2: Bad magic number in super-block while trying to open /rootfs/dev/xvdu
/rootfs/dev/xvdu:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

I1007 23:45:22.247377       1 mount_linux.go:384] Attempting to mount disk:  /rootfs/dev/xvdu /rootfs/mnt/master-vol-0ee184d96cdd5ff49
I1007 23:45:22.247500       1 mount_linux.go:135] Mounting cmd (systemd-run) with arguments ([--description=Kubernetes transient mount for /rootfs/mnt/master-vol-0ee184d96cdd5ff49 --scope -- mount -o defaults /rootfs/dev/xvdu /rootfs/mnt/master-vol-0ee184d96cdd5ff49])
E1007 23:45:22.266151       1 mount_linux.go:140] Mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /rootfs/mnt/master-vol-0ee184d96cdd5ff49 --scope -- mount -o defaults /rootfs/dev/xvdu /rootfs/mnt/master-vol-0ee184d96cdd5ff49
Output: Running as unit run-14.scope.
mount: wrong fs type, bad option, bad superblock on /rootfs/dev/xvdu,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

I1007 23:45:22.266211       1 mount_linux.go:430] Attempting to determine if disk "/rootfs/dev/xvdu" is formatted using lsblk with args: ([-n -o FSTYPE /rootfs/dev/xvdu])
I1007 23:45:22.268133       1 mount_linux.go:433] Output: "\n"
I1007 23:45:22.268153       1 mount_linux.go:404] Disk "/rootfs/dev/xvdu" appears to be unformatted, attempting to format as type: "ext4" with options: [-F /rootfs/dev/xvdu]
I1007 23:45:23.777400       1 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /rootfs/dev/xvdu /rootfs/mnt/master-vol-0ee184d96cdd5ff49
I1007 23:45:23.777522       1 mount_linux.go:135] Mounting cmd (systemd-run) with arguments ([--description=Kubernetes transient mount for /rootfs/mnt/master-vol-0ee184d96cdd5ff49 --scope -- mount -t ext4 -o defaults /rootfs/dev/xvdu /rootfs/mnt/master-vol-0ee184d96cdd5ff49])
I1007 23:45:23.858021       1 volume_mounter.go:80] mounted master volume "vol-0ee184d96cdd5ff49" on /mnt/master-vol-0ee184d96cdd5ff49
I1007 23:45:23.858057       1 volume_mounter.go:58] Master volume "vol-0cc5454b7fbdae79b" is attached at "/dev/xvdv"
I1007 23:45:23.858096       1 volume_mounter.go:72] Doing safe-format-and-mount of /dev/xvdv to /mnt/master-vol-0cc5454b7fbdae79b
I1007 23:45:23.858127       1 volume_mounter.go:106] Found device "/dev/xvdv"
I1007 23:45:23.875391       1 mount_linux.go:168] Detected OS with systemd
I1007 23:45:23.876357       1 volume_mounter.go:135] Creating mount directory "/rootfs/mnt/master-vol-0cc5454b7fbdae79b"
I1007 23:45:23.876762       1 volume_mounter.go:140] Mounting device "/rootfs/dev/xvdv" on "/rootfs/mnt/master-vol-0cc5454b7fbdae79b"
I1007 23:45:23.876780       1 mount_linux.go:366] Checking for issues with fsck on disk: /rootfs/dev/xvdv
I1007 23:45:23.895085       1 mount_linux.go:379] `fsck` error fsck from util-linux 2.25.2
fsck.ext2: Bad magic number in super-block while trying to open /rootfs/dev/xvdv
/rootfs/dev/xvdv:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

I1007 23:45:23.895112       1 mount_linux.go:384] Attempting to mount disk:  /rootfs/dev/xvdv /rootfs/mnt/master-vol-0cc5454b7fbdae79b
I1007 23:45:23.895209       1 mount_linux.go:135] Mounting cmd (systemd-run) with arguments ([--description=Kubernetes transient mount for /rootfs/mnt/master-vol-0cc5454b7fbdae79b --scope -- mount -o defaults /rootfs/dev/xvdv /rootfs/mnt/master-vol-0cc5454b7fbdae79b])
E1007 23:45:23.913945       1 mount_linux.go:140] Mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /rootfs/mnt/master-vol-0cc5454b7fbdae79b --scope -- mount -o defaults /rootfs/dev/xvdv /rootfs/mnt/master-vol-0cc5454b7fbdae79b
Output: Running as unit run-22.scope.
mount: wrong fs type, bad option, bad superblock on /rootfs/dev/xvdv,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

I1007 23:45:23.913986       1 mount_linux.go:430] Attempting to determine if disk "/rootfs/dev/xvdv" is formatted using lsblk with args: ([-n -o FSTYPE /rootfs/dev/xvdv])
I1007 23:45:23.916778       1 mount_linux.go:433] Output: "\n"
I1007 23:45:23.916802       1 mount_linux.go:404] Disk "/rootfs/dev/xvdv" appears to be unformatted, attempting to format as type: "ext4" with options: [-F /rootfs/dev/xvdv]
I1007 23:45:27.468125       1 mount_linux.go:408] Disk successfully formatted (mkfs): ext4 - /rootfs/dev/xvdv /rootfs/mnt/master-vol-0cc5454b7fbdae79b
I1007 23:45:27.468169       1 mount_linux.go:135] Mounting cmd (systemd-run) with arguments ([--description=Kubernetes transient mount for /rootfs/mnt/master-vol-0cc5454b7fbdae79b --scope -- mount -t ext4 -o defaults /rootfs/dev/xvdv /rootfs/mnt/master-vol-0cc5454b7fbdae79b])
I1007 23:45:27.512945       1 volume_mounter.go:80] mounted master volume "vol-0cc5454b7fbdae79b" on /mnt/master-vol-0cc5454b7fbdae79b
I1007 23:45:27.512981       1 kube_boot.go:103] Found etcd cluster spec on volume "vol-0cc5454b7fbdae79b": {"clusterKey":"events","nodeName":"d","nodeNames":["d"]}
I1007 23:45:27.513069       1 kube_boot.go:103] Found etcd cluster spec on volume "vol-0ee184d96cdd5ff49": {"clusterKey":"main","nodeName":"d","nodeNames":["d"]}

Validate Clustere

cluster name is redacted

$ kops validate cluster
Using cluster from kubectl context: mycluster

Validating cluster mycluster

INSTANCE GROUPS
NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
master-us-west-2c	Master	c4.large	1	1	us-west-2c
nodes			Node	m3.medium	2	2	us-west-2c

NODE STATUS
NAME						ROLE	READY
ip-172-60-0-183.us-west-2.compute.internal	master	True
ip-172-60-0-206.us-west-2.compute.internal	node	True
ip-172-60-0-228.us-west-2.compute.internal	node	True

Your cluster mycluster is ready

@KashifSaadat
Copy link
Contributor Author

Hey Chris, thanks for looking into this!

Did you build protokube fresh off the master branch? Could you try with a new image?

When I built a cluster with protokube v1.8.0-alpha1 the volume mounts were successful, but with a custom protokube image (just built off of commit hash 51d3cde, no local changes) that's when I started getting the problem. I think it's possibly regressed between those commit hashes.

@chrislovecnm
Copy link
Contributor

Tested on master.

@KashifSaadat
Copy link
Contributor Author

KashifSaadat commented Oct 13, 2017

Apologies I forgot to mention, I had this issue using CoreOS (which it seems this issue may be limited to). It may be fixed by the following PR (I'll test): #3621

@KashifSaadat
Copy link
Contributor Author

PR #3621 has been merged into master now. I gave this another test and the cluster builds successfully on CoreOS (nodes healthy, kuberang tests pass), however I noticed the following unit failures (CoreOS, etcd v3.0.17, tls enabled):

Failed Units: 2
  run-1360.scope
  run-1411.scope

<hostname> systemd[1]: run-1360.scope: Failed to add PIDs to scope's control group: No such process
<hostname> systemd[1]: Failed to start Kubernetes systemd probe.
<hostname> systemd[1]: run-1360.scope: Unit entered failed state.

<hostname> systemd[1]: run-1411.scope: Failed to add PIDs to scope's control group: No such process
<hostname> systemd[1]: Failed to start Kubernetes systemd probe.
<hostname> systemd[1]: run-1411.scope: Unit entered failed state.

It doesn't appear to be causing any issues but figured I'd raise just incase. @justinsb any thoughts?

@KashifSaadat
Copy link
Contributor Author

Been testing recently and not spotted any more issues since the fix in #3621 got merged. Not sure why the above service errors are occurring but they don't appear to have any negative impact on the cluster functioning.

Will close this issue, thanks for the help! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants