Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopp and Restart Kubernetes Cluster on AWS #15160

Closed
stemau98 opened this issue Oct 6, 2015 · 15 comments
Closed

Stopp and Restart Kubernetes Cluster on AWS #15160

stemau98 opened this issue Oct 6, 2015 · 15 comments
Assignees
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@stemau98
Copy link

stemau98 commented Oct 6, 2015

Is there a way to only stop a Kubernetes cluster on AWS and not to destroy the whole cluster?
I tested to stop Instances with Ubuntu Vivid Image and get an error after booting my Instances for the second time.
The System Log of the Instances shows Welcom to emergency mode and Dependency failed for Local File Systems and Dependency failed for /mnt/ephemeral.

@stemau98
Copy link
Author

stemau98 commented Oct 6, 2015

If it helps I can post the complete System Log of the EC2 Instance.

@romanek-adam
Copy link

I have exactly the same problem and I don't have any clue on what's going on. My master node does not get up after being stopped/rebooted.

My setup is:

export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=eu-west-1a
export NUM_MINIONS=4
export MASTER_SIZE=m3.medium
export MINION_SIZE=m3.medium
export AWS_S3_REGION=eu-west-1
export AWS_S3_BUCKET=xxxxxxxxxx-k8s-staging
export KUBE_AWS_INSTANCE_PREFIX=k8s-staging-eu
export KUBE_ENABLE_CLUSTER_MONITORING=none
export KUBE_ENABLE_NODE_LOGGING=false
export KUBE_ENABLE_CLUSTER_LOGGING=false

The logs from master node boot contain:

(2 of 2) A start job is running for...l-docker.device (36s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (36s / 1min 30s)
(1 of 2) A start job is running for...bernetes.device (37s / 1min 30s)
(1 of 2) A start job is running for...bernetes.device (37s / 1min 30s)
(1 of 2) A start job is running for...bernetes.device (38s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (38s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (39s / 1min 30s)
(2 of 2) A start job is running for...l-docker.device (39s / 1min 30s)

and so on...

And then finally:

[�[1;33mDEPEND�[0m] Dependency failed for /mnt/ephemeral/docker.
[�[1;33mDEPEND�[0m] Dependency failed for Local File Systems.
[�[1;31m TIME �[0m] Timed out waiting for device dev-vg\x2dephemeral-kubernetes.device.
[�[1;33mDEPEND�[0m] Dependency failed for /mnt/ephemeral/kubernetes.

(...)

error: unexpectedly disconnected from boot status daemon
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to
try again to boot into default mode.

Could you please help us out with this one?

UPDATE 1:
BTW, please update the title of this issue as it does not reflect its seriousness.

UPDATE 2:
On m4.large nodes I don't observe this problem. To me it looks like it has something to do with ephemeral storage on m3.medium.

Could this be related to the fact that the instance storage on m3.medium is only 4GB?

@stemau98 stemau98 changed the title Only stopping Kubernetes Cluster on AWS Stopp and Restart Kubernetes Cluster on AWS Oct 13, 2015
@stemau98
Copy link
Author

@romanek-adam
is the titel no now okay for you?
i had exactly the same problem on a m3.large instance for my minions. I solved this by removing the following line from the common.sh file in kubernetes/cluster/aws/trusty/ directory
Line 34 - grep -v "^#" "${KUBE_ROOT}/cluster/aws/templates/format-disks.sh"

Maybe this helps you too!

@geoah
Copy link

geoah commented Oct 21, 2015

Having the same issue with m3.large once rebooted go into emergency mode.
As you can imagine not being able to reboot your minions is a bit of an issue! :p

ps. Cluster build on AWS using kube-up on latest release-1.1 (dd9ccae) with

export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=eu-west-1c
export NUM_MINIONS=2
export MINION_SIZE=m3.large
export MASTER_SIZE=m3.medium
export AWS_S3_REGION=eu-west-1
export AWS_S3_BUCKET=xxx-kubernetes-staging-artifacts
export KUBE_AWS_INSTANCE_PREFIX=kubernetes-staging

@romanek-adam
Copy link

@stemau98 I wouldn't dare remove disk formatting code. It might have worked for you but results are unpredictable.

There's definitely something wrong with m3 instance types so the issue name should be updated accordingly.

@geoah
Copy link

geoah commented Oct 22, 2015

@romanek-adam & @stemau98 what version/s are you on?

@stemau98
Copy link
Author

on kuberntes version 1.1 but i compiled it by my self

@romanek-adam
Copy link

@geoah 1.0.6

@geoah
Copy link

geoah commented Oct 22, 2015

FYI: Restarting minions works fine on m4 instances.

@jvalencia
Copy link
Contributor

I see an issue with the master which is related:
#16188

@willmore
Copy link

We too hit the restart failure on 1.1 with m3.medium.
M3.large fails as well.
M4.large works.

@drora
Copy link

drora commented Dec 21, 2015

just met this issue on v1.1.3
after master node reboot, lost control over the cluster.

(kubectl config was empty, probably lost data on attached ebs storage, wasn't able to re-gain control and recover with etcd, couldn't find more info about how to re-configure master back to existing cluster on aws)

kubectl get nodes ::
"error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused"

from kubelet log ::
"Unable to write event: 'Post https://{IP}/api/v1/namespaces/default/events: dial tcp {IP}:443: connection refused' (may retry after sleeping)"
"Skipping pod synchronization, network is not configured"

@drora
Copy link

drora commented Dec 21, 2015

this is a major deal here for production use.
very easy to repro locally (via vagrant).

  1. just spin up a cluster using kube up,
  2. restart master vm
  3. lose cluster.

@keithdadkins
Copy link

Same issue with m3.medium and ubuntu vivid. Lost cluster after master rebooted (emergency mode).

@justinsb justinsb modified the milestone: v1.2 Feb 20, 2016
@justinsb justinsb added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 24, 2016
@thockin thockin added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 2, 2016
@justinsb
Copy link
Member

justinsb commented Mar 5, 2016

This is now fixed in 1.2: restart & stop/start should work reliably.

@justinsb justinsb closed this as completed Mar 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

10 participants