Failing CloudInit and Kubelet setup during worker node startup #18

eshamay · 2018-11-06T01:20:25Z

Recently in the past day all EKS clusters I create via @pulumi/eks's new eks.Cluster are failing to start up worker nodes. One symptom is that some or all of the nodes fail to join the EKS cluster.

A snippet of the log is pasted further below. The code used to create the cluster, which was working properly last week, is here:

import * as fs from "fs";
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
import * as eks from "@pulumi/eks";
import * as k8s from "@pulumi/kubernetes";
import * as YAML from "js-yaml";

//
// Configuration
//
let config = new pulumi.Config("eks-cluster");
const environment = config.require("environment");
const vpcSubnetIds = config.require("subnet-ids").split(",");

let splunkConfig = new pulumi.Config("splunk");

//
// Actions to perform at the startup of any worker node in the cluster
//
const userData = `set -o xtrace
echo "Starting UserData for the EKS worker nodes"
yum install -y epel-release
yum update -y
yum install -y \
  jq \
  yq \
  cifs-utils \
  nfs-utils \
  nfs-common \
  nfs4-acl-tools \
  portmap \
  screen \
  vim \
  mountpoint \
  base64 \
  bind-utils \
  coreutils \
  util-linux
VOLUME_PLUGIN_DIR="/usr/libexec/kubernetes/kubelet-plugins/volume/exec"
mkdir -p "$VOLUME_PLUGIN_DIR/fstab~cifs"
cd "$VOLUME_PLUGIN_DIR/fstab~cifs"
curl -L -O https://raw.githubusercontent.com/fstab/cifs/master/cifs
chmod 755 cifs\
`


// Create the Kubernetes cluster and all associated artifacts and resources
const cluster = new eks.Cluster(environment+"-eks-cluster", {
    vpcId             : config.require("vpc-id"),
    subnetIds         : vpcSubnetIds,
    instanceType      : "m4.4xlarge",
    nodePublicKey     : config.require("worker-ssh-key"),
    nodeRootVolumeSize: 250,
    desiredCapacity   : 3,
    maxSize           : 5,
    minSize           : 1,
    nodeUserData      : userData,
    deployDashboard   : false,
});

// Write out the kubeconfig file
export const kubeconfig = cluster.kubeconfig;
cluster.kubeconfig.apply(kc => fs.writeFileSync("./config-"+environment+".kubeconfig.json", JSON.stringify(kc)))

// Write the kubeconfig to a location in S3 for safe keeping
cluster.kubeconfig.apply(kc => {
  let s3Kubeconfig = new aws.s3.BucketObject("kube_config/config-"+environment+".json", {
    bucket: "sdp-helm",
    content: JSON.stringify(kc),
  })}
);

// Allow SSH ingress to worker nodes
new aws.ec2.SecurityGroupRule("ssh-ingress", {
    type: "ingress",
    fromPort: 22,
    toPort: 22,
    protocol: "tcp",
    securityGroupId: cluster.nodeSecurityGroup.id,
    cidrBlocks: [ "0.0.0.0/0" ],
});

The journalctl output on one of the hosts shows the following problems:

failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0f016b7e718a12801

Failed to start Apply the settings specified in cloud-config.

Unit cloud-config.service entered failed state.


Nov 06 01:06:13 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting yum locked retry of update-motd...
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com update-motd[5484]: Yum database was locked, so we couldn't get fresh package info.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Started yum locked retry of update-motd.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting yum locked retry of update-motd.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: One of the configured repositories failed (Unknown),
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: and yum doesn't have enough cached data to continue. At this point the only
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: safe thing yum can do is fail. There are a few ways to work "fix" this:
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 1. Contact the upstream for the repository and get them to fix the problem.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: upstream. This is most often useful if you are using a newer
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: distribution release than is supported by the repository (and the
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: packages for the previous distribution release still work).
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 3. Run the command with the repository temporarily disabled
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: yum --disablerepo=<repoid> ...
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 4. Disable the repository permanently, so yum won't use it by default. Yum
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: will then just ignore the repository until you permanently enable it
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: again or use --enablerepo for temporary usage:
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: yum-config-manager --disable <repoid>
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: or
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: subscription-manager repos --disable=<repoid>
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 5. Configure the failing repository to be skipped, if it is unavailable.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Note that yum will try to contact the repo. when it runs most commands,
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: so will have to try and fail each time (and thus. yum will be be much
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: slower). If it is a very temporary problem though, this is often a nice
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: compromise:
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Could not retrieve mirrorlist http://amazonlinux.us-west-2.amazonaws.com/2/core/latest/x86_64/mirror.list error was
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: 12: Timeout on http://amazonlinux.us-west-2.amazonaws.com/2/core/latest/x86_64/mirror.list: (28, 'Resolving timed out after 5515 milliseconds')
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Nov 06 01:06:18 cloud-init[4927]: util.py[WARNING]: Package upgrade failed
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Nov 06 01:06:18 cloud-init[4927]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[4927]: Nov 06 01:06:18 cloud-init[4927]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_u
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: cloud-config.service: main process exited, code=exited, status=1/FAILURE
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Failed to start Apply the settings specified in cloud-config.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Unit cloud-config.service entered failed state.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: cloud-config.service failed.
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting Execute cloud user/final scripts...
Nov 06 01:06:18 ip-10-10-3-24.tableausandbox.com cloud-init[5517]: Cloud-init v. 18.2-72.amzn2.0.6 running 'modules:final' at Tue, 06 Nov 2018 01:06:18 +0000. Up 155.71 seconds.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com cloud-init[5517]: Cluster "kubernetes" set.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Reloading.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com cloud-init[5517]: Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Reloading.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Started Kubernetes Kubelet.
Nov 06 01:06:20 ip-10-10-3-24.tableausandbox.com systemd[1]: Starting Kubernetes Kubelet...

...

Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com kubelet[5645]: F1106 01:08:22.657450    5645 server.go:233] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0f016b7e718a12801
Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com systemd[1]: Unit kubelet.service entered failed state.
Nov 06 01:08:22 ip-10-10-3-24.tableausandbox.com systemd[1]: kubelet.service failed.
Nov 06 01:08:27 ip-10-10-3-24.tableausandbox.com systemd[1]: kubelet.service holdoff time over, scheduling restart.

The text was updated successfully, but these errors were encountered:

pgavlin · 2018-11-08T00:53:19Z

We dug into this offline. It turns out that the root cause here is that some worker nodes were being created in public subnets and had no route to the EKS API server. This was because the subnetIds argument to the eks.Cluster contained both public and private subnets, and this list is passed verbatim to the CF template that boots the worker nodes. I am investigating whether or not eks.Cluster can and should automatically filter out these public subnets or whether we ought to expose an additional argument s.t. the user can restrict the subnets in which workers are created.

If both public and private subnets are attached to an EKS cluster, the API server will only be exposed to the private subnets. Any workers attached to the public subnets will be unable to contact the API server and will fail to register as nodes. These changes filter public subnets from the set of subnets passed to the worker nodes iff the EKS cluster is also attached to private subnets. Fixes #18.

eshamay · 2018-11-08T21:06:04Z

This was due to incorrect EKS networking setup. Had workers deploying to public subnets. Fix is to only put workers on the private subnets.

If both public and private subnets are attached to an EKS cluster, the API server will only be exposed to the private subnets. Any workers attached to the public subnets will be unable to contact the API server and will fail to register as nodes. These changes filter public subnets from the set of subnets passed to the worker nodes iff the EKS cluster is also attached to private subnets. Fixes #18.

If both public and private subnets are attached to an EKS cluster, the API server will only be exposed to the private subnets. Any workers attached to the public subnets will be unable to contact the API server and will fail to register as nodes. These changes filter public subnets from the set of subnets passed to the worker nodes iff the EKS cluster is also attached to private subnets. Fixes pulumi#18. cherry-picked from fe96413

If both public and private subnets are attached to an EKS cluster, the API server will only be exposed to the private subnets. Any workers attached to the public subnets will be unable to contact the API server and will fail to register as nodes. These changes filter public subnets from the set of subnets passed to the worker nodes iff the EKS cluster is also attached to private subnets. Fixes #18. cherry-picked from fe96413

eshamay changed the title ~~Failing CloudInit during worker node startup~~ Failing CloudInit and Kubelet setup during worker node startup Nov 6, 2018

lukehoban assigned pgavlin Nov 6, 2018

lukehoban added this to the 0.19 milestone Nov 6, 2018

pgavlin mentioned this issue Nov 8, 2018

Filter public subnets as necessary #21

Merged

eshamay closed this as completed Nov 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing CloudInit and Kubelet setup during worker node startup #18

Failing CloudInit and Kubelet setup during worker node startup #18

eshamay commented Nov 6, 2018

pgavlin commented Nov 8, 2018

eshamay commented Nov 8, 2018

Failing CloudInit and Kubelet setup during worker node startup #18

Failing CloudInit and Kubelet setup during worker node startup #18

Comments

eshamay commented Nov 6, 2018

pgavlin commented Nov 8, 2018

eshamay commented Nov 8, 2018