Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing CloudInit and Kubelet setup during worker node startup #18

Closed
eshamay opened this issue Nov 6, 2018 · 2 comments
Closed

Failing CloudInit and Kubelet setup during worker node startup #18

eshamay opened this issue Nov 6, 2018 · 2 comments
Assignees
Milestone

Comments

@eshamay
Copy link

eshamay commented Nov 6, 2018

Recently in the past day all EKS clusters I create via @pulumi/eks's new eks.Cluster are failing to start up worker nodes. One symptom is that some or all of the nodes fail to join the EKS cluster.

A snippet of the log is pasted further below. The code used to create the cluster, which was working properly last week, is here:

import * as fs from "fs";
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
import * as eks from "@pulumi/eks";
import * as k8s from "@pulumi/kubernetes";
import * as YAML from "js-yaml";

//
// Configuration
//
let config = new pulumi.Config("eks-cluster");
const environment = config.require("environment");
const vpcSubnetIds = config.require("subnet-ids").split(",");

let splunkConfig = new pulumi.Config("splunk");

//
// Actions to perform at the startup of any worker node in the cluster
//
const userData = `set -o xtrace
echo "Starting UserData for the EKS worker nodes"
yum install -y epel-release
yum update -y
yum install -y \
  jq \
  yq \
  cifs-utils \
  nfs-utils \
  nfs-common \
  nfs4-acl-tools \
  portmap \
  screen \
  vim \
  mountpoint \
  base64 \
  bind-utils \
  coreutils \
  util-linux
VOLUME_PLUGIN_DIR="/usr/libexec/kubernetes/kubelet-plugins/volume/exec"
mkdir -p "$VOLUME_PLUGIN_DIR/fstab~cifs"
cd "$VOLUME_PLUGIN_DIR/fstab~cifs"
curl -L -O https://raw.githubusercontent.com/fstab/cifs/master/cifs
chmod 755 cifs\
`


// Create the Kubernetes cluster and all associated artifacts and resources
const cluster = new eks.Cluster(environment+"-eks-cluster", {
    vpcId             : config.require("vpc-id"),
    subnetIds         : vpcSubnetIds,
    instanceType      : "m4.4xlarge",
    nodePublicKey     : config.require("worker-ssh-key"),
    nodeRootVolumeSize: 250,
    desiredCapacity   : 3,
    maxSize           : 5,
    minSize           : 1,
    nodeUserData      : userData,
    deployDashboard   : false,
});

// Write out the kubeconfig file
export const kubeconfig = cluster.kubeconfig;
cluster.kubeconfig.apply(kc => fs.writeFileSync("./config-"+environment+".kubeconfig.json", JSON.stringify(kc)))

// Write the kubeconfig to a location in S3 for safe keeping
cluster.kubeconfig.apply(kc => {
  let s3Kubeconfig = new aws.s3.BucketObject("kube_config/config-"+environment+".json", {
    bucket: "sdp-helm",
    content: JSON.stringify(kc),
  })}
);

// Allow SSH ingress to worker nodes
new aws.ec2.SecurityGroupRule("ssh-ingress", {
    type: "ingress",
    fromPort: 22,
    toPort: 22,
    protocol: "tcp",
    securityGroupId: cluster.nodeSecurityGroup.id,
    cidrBlocks: [ "0.0.0.0/0" ],
});

The journalctl output on one of the hosts shows the following problems:

failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0f016b7e718a12801

Failed to start Apply the settings specified in cloud-config.

Unit cloud-config.service entered failed state.


Nov 06 01:06:13 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting yum locked retry of update-motd...
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com update-motd[5484]: Yum database was locked, so we couldn't get fresh package info.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Started yum locked retry of update-motd.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting yum locked retry of update-motd.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: One of the configured repositories failed (Unknown),
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: and yum doesn't have enough cached data to continue. At this point the only
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: safe thing yum can do is fail. There are a few ways to work "fix" this:
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 1. Contact the upstream for the repository and get them to fix the problem.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: upstream. This is most often useful if you are using a newer
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: distribution release than is supported by the repository (and the
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: packages for the previous distribution release still work).
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 3. Run the command with the repository temporarily disabled
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: yum --disablerepo=<repoid> ...
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 4. Disable the repository permanently, so yum won't use it by default. Yum
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: will then just ignore the repository until you permanently enable it
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: again or use --enablerepo for temporary usage:
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: yum-config-manager --disable <repoid>
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: or
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: subscription-manager repos --disable=<repoid>
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 5. Configure the failing repository to be skipped, if it is unavailable.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Note that yum will try to contact the repo. when it runs most commands,
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: so will have to try and fail each time (and thus. yum will be be much
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: slower). If it is a very temporary problem though, this is often a nice
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: compromise:
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Could not retrieve mirrorlist http://amazonlinux.us-west-2.amazonaws.com/2/core/latest/x86_64/mirror.list error was
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 12: Timeout on http://amazonlinux.us-west-2.amazonaws.com/2/core/latest/x86_64/mirror.list: (28, 'Resolving timed out after 5515 milliseconds')
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Nov 06 01:06:18 cloud-init[4927]: util.py[WARNING]: Package upgrade failed
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Nov 06 01:06:18 cloud-init[4927]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Nov 06 01:06:18 cloud-init[4927]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_u
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: cloud-config.service: main process exited, code=exited, status=1/FAILURE
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Failed to start Apply the settings specified in cloud-config.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Unit cloud-config.service entered failed state.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: cloud-config.service failed.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting Execute cloud user/final scripts...
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[5517]: Cloud-init v. 18.2-72.amzn2.0.6 running 'modules:final' at Tue, 06 Nov 2018 01:06:18 +0000. Up 155.71 seconds.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com cloud-init[5517]: Cluster "kubernetes" set.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Reloading.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com cloud-init[5517]: Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Reloading.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Started Kubernetes Kubelet.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting Kubernetes Kubelet...

...

Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com kubelet[5645]: F1106 01:08:22.657450    5645 server.go:233] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0f016b7e718a12801
Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com systemd[1]: Unit kubelet.service entered failed state.
Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com systemd[1]: kubelet.service failed.
Nov 06 01:08:27 ip-10-10-3-24.tableausandbox.com systemd[1]: kubelet.service holdoff time over, scheduling restart.

@eshamay eshamay changed the title Failing CloudInit during worker node startup Failing CloudInit and Kubelet setup during worker node startup Nov 6, 2018
@lukehoban lukehoban added this to the 0.19 milestone Nov 6, 2018
@pgavlin
Copy link
Member

pgavlin commented Nov 8, 2018

We dug into this offline. It turns out that the root cause here is that some worker nodes were being created in public subnets and had no route to the EKS API server. This was because the subnetIds argument to the eks.Cluster contained both public and private subnets, and this list is passed verbatim to the CF template that boots the worker nodes. I am investigating whether or not eks.Cluster can and should automatically filter out these public subnets or whether we ought to expose an additional argument s.t. the user can restrict the subnets in which workers are created.

pgavlin added a commit that referenced this issue Nov 8, 2018
If both public and private subnets are attached to an EKS cluster, the
API server will only be exposed to the private subnets. Any workers
attached to the public subnets will be unable to contact the API server
and will fail to register as nodes. These changes filter public subnets
from the set of subnets passed to the worker nodes iff the EKS cluster
is also attached to private subnets.

Fixes #18.
pgavlin added a commit that referenced this issue Nov 8, 2018
If both public and private subnets are attached to an EKS cluster, the
API server will only be exposed to the private subnets. Any workers
attached to the public subnets will be unable to contact the API server
and will fail to register as nodes. These changes filter public subnets
from the set of subnets passed to the worker nodes iff the EKS cluster
is also attached to private subnets.

Fixes #18.
@eshamay
Copy link
Author

eshamay commented Nov 8, 2018

This was due to incorrect EKS networking setup. Had workers deploying to public subnets. Fix is to only put workers on the private subnets.

@eshamay eshamay closed this as completed Nov 8, 2018
pgavlin added a commit that referenced this issue Nov 8, 2018
If both public and private subnets are attached to an EKS cluster, the
API server will only be exposed to the private subnets. Any workers
attached to the public subnets will be unable to contact the API server
and will fail to register as nodes. These changes filter public subnets
from the set of subnets passed to the worker nodes iff the EKS cluster
is also attached to private subnets.

Fixes #18.
houqp pushed a commit to houqp/pulumi-eks that referenced this issue Jan 9, 2019
If both public and private subnets are attached to an EKS cluster, the
API server will only be exposed to the private subnets. Any workers
attached to the public subnets will be unable to contact the API server
and will fail to register as nodes. These changes filter public subnets
from the set of subnets passed to the worker nodes iff the EKS cluster
is also attached to private subnets.

Fixes pulumi#18.

cherry-picked from fe96413
houqp pushed a commit to houqp/pulumi-eks that referenced this issue Jan 10, 2019
If both public and private subnets are attached to an EKS cluster, the
API server will only be exposed to the private subnets. Any workers
attached to the public subnets will be unable to contact the API server
and will fail to register as nodes. These changes filter public subnets
from the set of subnets passed to the worker nodes iff the EKS cluster
is also attached to private subnets.

Fixes pulumi#18.

cherry-picked from fe96413
pgavlin added a commit that referenced this issue Jan 10, 2019
If both public and private subnets are attached to an EKS cluster, the
API server will only be exposed to the private subnets. Any workers
attached to the public subnets will be unable to contact the API server
and will fail to register as nodes. These changes filter public subnets
from the set of subnets passed to the worker nodes iff the EKS cluster
is also attached to private subnets.

Fixes #18.

cherry-picked from fe96413
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants