Stopp and Restart Kubernetes Cluster on AWS #15160

stemau98 · 2015-10-06T15:10:11Z

Is there a way to only stop a Kubernetes cluster on AWS and not to destroy the whole cluster?
I tested to stop Instances with Ubuntu Vivid Image and get an error after booting my Instances for the second time.
The System Log of the Instances shows Welcom to emergency mode and Dependency failed for Local File Systems and Dependency failed for /mnt/ephemeral.

stemau98 · 2015-10-06T15:13:03Z

If it helps I can post the complete System Log of the EC2 Instance.

romanek-adam · 2015-10-13T05:47:47Z

I have exactly the same problem and I don't have any clue on what's going on. My master node does not get up after being stopped/rebooted.

My setup is:

export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=eu-west-1a
export NUM_MINIONS=4
export MASTER_SIZE=m3.medium
export MINION_SIZE=m3.medium
export AWS_S3_REGION=eu-west-1
export AWS_S3_BUCKET=xxxxxxxxxx-k8s-staging
export KUBE_AWS_INSTANCE_PREFIX=k8s-staging-eu
export KUBE_ENABLE_CLUSTER_MONITORING=none
export KUBE_ENABLE_NODE_LOGGING=false
export KUBE_ENABLE_CLUSTER_LOGGING=false

The logs from master node boot contain:

(2 of 2) A start job is running for...l-docker.device (36s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (36s / 1min 30s)
(1 of 2) A start job is running for...bernetes.device (37s / 1min 30s)
(1 of 2) A start job is running for...bernetes.device (37s / 1min 30s)
(1 of 2) A start job is running for...bernetes.device (38s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (38s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (39s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (39s / 1min 30s)

and so on...

And then finally:

[�[1;33mDEPEND�[0m] Dependency failed for /mnt/ephemeral/docker.
[�[1;33mDEPEND�[0m] Dependency failed for Local File Systems.
[�[1;31m TIME �[0m] Timed out waiting for device dev-vg\x2dephemeral-kubernetes.device.
[�[1;33mDEPEND�[0m] Dependency failed for /mnt/ephemeral/kubernetes.

(...)

error: unexpectedly disconnected from boot status daemon
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to
try again to boot into default mode.

Could you please help us out with this one?

UPDATE 1:
BTW, please update the title of this issue as it does not reflect its seriousness.

UPDATE 2:
On m4.large nodes I don't observe this problem. To me it looks like it has something to do with ephemeral storage on m3.medium.

Could this be related to the fact that the instance storage on m3.medium is only 4GB?

stemau98 · 2015-10-13T06:36:45Z

@romanek-adam
is the titel no now okay for you?
i had exactly the same problem on a m3.large instance for my minions. I solved this by removing the following line from the common.sh file in kubernetes/cluster/aws/trusty/ directory
Line 34 - grep -v "^#" "${KUBE_ROOT}/cluster/aws/templates/format-disks.sh"

Maybe this helps you too!

geoah · 2015-10-21T15:15:48Z

Having the same issue with m3.large once rebooted go into emergency mode.
As you can imagine not being able to reboot your minions is a bit of an issue! :p

ps. Cluster build on AWS using kube-up on latest release-1.1 (dd9ccae) with

export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=eu-west-1c
export NUM_MINIONS=2
export MINION_SIZE=m3.large
export MASTER_SIZE=m3.medium
export AWS_S3_REGION=eu-west-1
export AWS_S3_BUCKET=xxx-kubernetes-staging-artifacts
export KUBE_AWS_INSTANCE_PREFIX=kubernetes-staging

romanek-adam · 2015-10-21T18:21:48Z

@stemau98 I wouldn't dare remove disk formatting code. It might have worked for you but results are unpredictable.

There's definitely something wrong with m3 instance types so the issue name should be updated accordingly.

geoah · 2015-10-22T10:04:27Z

@romanek-adam & @stemau98 what version/s are you on?

stemau98 · 2015-10-22T11:50:20Z

on kuberntes version 1.1 but i compiled it by my self

romanek-adam · 2015-10-22T12:25:11Z

@geoah 1.0.6

geoah · 2015-10-22T12:42:52Z

FYI: Restarting minions works fine on m4 instances.

jvalencia · 2015-10-23T21:13:32Z

I see an issue with the master which is related:
#16188

willmore · 2015-10-30T11:56:36Z

We too hit the restart failure on 1.1 with m3.medium.
M3.large fails as well.
M4.large works.

drora · 2015-12-21T09:32:55Z

just met this issue on v1.1.3
after master node reboot, lost control over the cluster.

(kubectl config was empty, probably lost data on attached ebs storage, wasn't able to re-gain control and recover with etcd, couldn't find more info about how to re-configure master back to existing cluster on aws)

kubectl get nodes ::
"error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused"

from kubelet log ::
"Unable to write event: 'Post https://{IP}/api/v1/namespaces/default/events: dial tcp {IP}:443: connection refused' (may retry after sleeping)"
"Skipping pod synchronization, network is not configured"

drora · 2015-12-21T09:33:49Z

this is a major deal here for production use.
very easy to repro locally (via vagrant).

just spin up a cluster using kube up,
restart master vm
lose cluster.

keithdadkins · 2015-12-24T03:35:39Z

Same issue with m3.medium and ubuntu vivid. Lost cluster after master rebooted (emergency mode).

justinsb · 2016-03-05T21:46:43Z

This is now fixed in 1.2: restart & stop/start should work reliably.

janetkuo added area/platform/aws labels Oct 6, 2015

stemau98 changed the title ~~Only stopping Kubernetes Cluster on AWS~~ Stopp and Restart Kubernetes Cluster on AWS Oct 13, 2015

justinsb modified the milestone: v1.2 Feb 20, 2016

justinsb added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 24, 2016

thockin assigned justinsb Mar 1, 2016

thockin added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 2, 2016

justinsb closed this as completed Mar 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopp and Restart Kubernetes Cluster on AWS #15160

Stopp and Restart Kubernetes Cluster on AWS #15160

stemau98 commented Oct 6, 2015

stemau98 commented Oct 6, 2015

romanek-adam commented Oct 13, 2015

stemau98 commented Oct 13, 2015

geoah commented Oct 21, 2015

romanek-adam commented Oct 21, 2015

geoah commented Oct 22, 2015

stemau98 commented Oct 22, 2015

romanek-adam commented Oct 22, 2015

geoah commented Oct 22, 2015

jvalencia commented Oct 23, 2015

willmore commented Oct 30, 2015

drora commented Dec 21, 2015

drora commented Dec 21, 2015

keithdadkins commented Dec 24, 2015

justinsb commented Mar 5, 2016

Stopp and Restart Kubernetes Cluster on AWS #15160

Stopp and Restart Kubernetes Cluster on AWS #15160

Comments

stemau98 commented Oct 6, 2015

stemau98 commented Oct 6, 2015

romanek-adam commented Oct 13, 2015

stemau98 commented Oct 13, 2015

geoah commented Oct 21, 2015

romanek-adam commented Oct 21, 2015

geoah commented Oct 22, 2015

stemau98 commented Oct 22, 2015

romanek-adam commented Oct 22, 2015

geoah commented Oct 22, 2015

jvalencia commented Oct 23, 2015

willmore commented Oct 30, 2015

drora commented Dec 21, 2015

drora commented Dec 21, 2015

keithdadkins commented Dec 24, 2015

justinsb commented Mar 5, 2016