Release Prep to v0.13.x branch #1589

davidmccormick · 2019-05-04T12:35:40Z

Kube-aws 0.13 Release PR

The 0.13.x release adds new node.kubernetes.io/role labels to all nodes but does not use them. They will become active in the 0.14.x release where the NodeRestriction Admission Controller will be enabled which denies the use of the existing labels.

Changes in this release: -

Update to k8s v1.13.5
Put the worker/kubelet and admin certs on the controllers.
Disabled apiserver insecure port 8080 - only https on 443 allowed.
Enable kubelet anonymous authentication but only allow Webhook authorization
Add RBAC objects to allow unauthenticated access to the kubelet's /healthz endpoint (so that cfn-signal can curl it without creds)
Configure controllers kubelet to do TLS bootstrapping same as workers (if >=1.14).
Update Networking Components (calico v3.6.1, flannel v0.11.0)
Enable Metrics-server by default and remove heapster
Use CoreDNS for Cluster DNS resolution by default
Refactor install-kube-system (group related manifests for clarity and deploy with single apply/delete for performance) and allow removal of legacy services by manifest or by reference
Roll existing apply-kube-aws-plugins systemd service into install-kube-system
Update Kiam to 3.2 - WARNING! Kiam Server Certificate now needs to be re-generated to include SAN "kiam-server" (previously was just kiam-server:443)
Remove Experimental Settings for TLSBootstrap, PodPriority, PodSecurityPolicy, NodeAuthorizer, PersistentVolumeClaimResize which are all now enabled by default.
Create core permissive PodSecurityPolicy for kube-system service accounts and optionally bind all cluster service accounts and authenticated users to it if it is the only PSP present in the system (i.e. do not break existing clusters with no policies)
Remove deprecated Experimental DenyEscalatingExec admission controller in favour of using the PodSecurityPolicy controller

…en away in the api package. Update to k8s v1.13.5 Put the worker/kubelet and admin certs on the controllers. Disabled apiserver insecure port 8080 - only https on 443 alllowed. Configure controllers kubelet to do TLS bootstrapping same as workers (if >=1.14). Update Networking Components (calico v3.6.1, flannel v0.11.0) Enable PodPriority by default Enable Metrics-server by default and remove heapster Enable CoreDNS for Cluster DNS resolution Refactor install-kube-system (group related manifests for clarity and deploy with single apply/delete for performance) Update install-kube-system to clean up deprecated services and objects (.e.g. heapster) Update Kiam to 3.2 - WARNING! Kiam Server Certificate now needs to be re-generated to include SAN "kiam-server" (previously was just kiam-server:443) Remove Experminental Settings for TLSBootstrap, Pod Priority, NodeAuthorizer, PersistentVolumeClaimResize Remove experimental Mutating and Validating Webhooks which are now enabled by default. Update the node role label to node.kubernetes.io/role which is allowed by the NodeRestriction AdmissionController

…branch and then switch to using them in the 0.14 release branch. Disable Admission Controller NodeRestriction in 0.13 release branch.

k8s-ci-robot · 2019-05-04T12:35:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mumoshu

If they are not already assigned, you can assign the PR to them by writing /assign @mumoshu in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2019-05-04T15:16:37Z

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/check-cla

codecov-io · 2019-05-07T13:48:57Z

Codecov Report

Merging #1589 into v0.13.x will decrease coverage by 0.2%.
The diff coverage is 2.22%.

@@             Coverage Diff             @@
##           v0.13.x    #1589      +/-   ##
===========================================
- Coverage    25.87%   25.67%   -0.21%     
===========================================
  Files           98       98              
  Lines         5074     5052      -22     
===========================================
- Hits          1313     1297      -16     
+ Misses        3614     3610       -4     
+ Partials       147      145       -2

Impacted Files	Coverage Δ
pkg/api/deployment.go	`0% <ø> (ø)`	⬆️
pkg/api/types.go	`0% <ø> (ø)`	⬆️
pkg/model/node_pool_compile.go	`54.54% <ø> (-1.16%)`	⬇️
pkg/model/node_pool_config.go	`23.75% <ø> (-11.14%)`	⬇️
pkg/api/feature_gates.go	`0% <0%> (ø)`	⬆️
pkg/api/cluster.go	`0% <0%> (ø)`	⬆️
credential/generator.go	`0% <0%> (ø)`	⬆️
pkg/model/credentials.go	`60.6% <50%> (-1.16%)`	⬇️
pkg/model/config.go	`52.38% <0%> (-2.39%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67869df...123d83b. Read the comment docs.

NodeStatusUpdateFrequency is not definable for Controller nodes

…matically signed

…ization Add RBAC objects to allow unauthenticated access to the kubelet's /healthz endpoint (so that cfn-signal can curl it without creds)

…do things after the plugin manifests and/or helm charts have been deployed)

…ogs to authenticated users

paalkr · 2019-05-13T12:15:49Z

I'm trying to upgrade a test cluster created with kube-aws 0.12.3, and some of the cluster resources in the kube-system namespace refuses to start when the control-plane nodes are being updated. Resulting in a rollback stack rollback. Etcd and network stacks did update nicely.

The test cluster has three etcd nodes and two controller nodes. In addition it has two regular worker nodes.

davidmccormick · 2019-05-13T14:22:14Z

I'm trying to upgrade a test cluster created with kube-aws 0.12.3, and some of the cluster resources in the kube-system namespace refuses to start when the control-plane nodes are being updated.

Hi thanks for testing @paalkr! Do you have any more details on what the errors are?

… that controller nodes can create mirror pods. Remove writing kube-aws version to the motd - causing extended rolls just to update version number which is available on a tag anyway.

davidmccormick · 2019-05-13T16:29:40Z

@paalkr do you have any further details? I was able to provision a v0.12.3 cluster and then upgrade to v0.13.0 but my cluster.yaml and cluster state could be very different to yours! I'm sure there is an issue here, but we'll have to dig deeper in order to pinpoint what is happening! Could you perhaps try the upgrade and capture the kubelet logs on a node which fails? If it's a controller can you send me:-

journalctl -lu install-kube-system
journalctl -lu cfn-signal
docker logs $(docker ps -q --filter="name=k8s_kube-apiserver_kube-apiserver*")
docker logs $(docker ps -q --filter="name=k8s_kube-controller-manager_kube-controller-manager*")

Thanks! 🙏

paalkr · 2019-05-13T20:26:12Z

Thanks!

I will try to gather more information. BTW, I'm updating the cluster in a sequence, following these steps
./kube-aws render stack
./kube-aws apply --pretty-print --targets network -> OK, when turning of CloudWatch logs output. I have created a ticket #1580 for this issue, that is probably not related to my previous comment.
./kube-aws apply --pretty-print --targets etcd -> OK
./kube-aws apply --pretty-print --targets control-plane -> Not OK. The first new control plane node does get registered with the cluster, and the issues I described in my previous comment starts. The cfn-signal.service does never finish, so eventually the update rolls back.
My next step would be to update the worker pools, if the control-plane did not fail
./kube-aws apply --pretty-print --targets <pool_name>
etc etc

I will try a new update now to collect more logs.

paalkr · 2019-05-13T21:11:30Z

The install-kube-system log did reveal a problem with the Kubernetes Dashboard deployment, retrying over and over again. This corresponds with a lot of the pods being started and terminated as shown in my previous screenshot.

May 13 20:39:24 ip-10-1-43-83.eu-west-1.compute.internal retry[2930]: The Deployment "kubernetes-dashboard" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit
May 13 20:39:24 ip-10-1-43-83.eu-west-1.compute.internal retry[2930]: daemonset.extensions/kiam-agent unchanged
May 13 20:39:24 ip-10-1-43-83.eu-west-1.compute.internal retry[2930]: Attempt 4 failed! Trying again in 3 seconds...

My initial dashboard configuration was just

kubernetesDashboard:
  adminPrivileges: true
  insecureLogin: true

Altering to the values as shown below fixed the dashboard deployment, and made the control plane node start successfully and signaling OK to the stack :)

kubernetesDashboard:
  adminPrivileges: true
  insecureLogin: true
  allowSkipLogin: true
  replicas: 1
  enabled: true
  resources:
    requests:
      cpu: 100m
      memory: 100Mi
    limits:
      cpu: 200m
      memory: 500Mi

paalkr · 2019-05-13T21:18:50Z

My next problem is that kiam server and client version 3.2 are stuck in crash loop. My initial kube-aws 0.12.3 cluster did run kiam 2.7. I'm trying to upgrade to 3.2.

This the the describe server pod output

Events:
  Type     Reason                  Age                    From                                               Message
  ----     ------                  ----                   ----                                               -------
  Normal   Scheduled               2m53s                  default-scheduler                                  Successfully assigned kube-system/kiam-server-q6fgz to ip-10-1-43-32.eu-west-1.compute.internal
  Warning  NetworkNotReady         2m39s (x8 over 2m53s)  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]
  Warning  FailedCreatePodSandBox  2m36s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "873309c963c14086f98af62bcc12c871b51a740c96f91939d9ad9f08ff24200c" network for pod "kiam-server-q6fgz": NetworkPlugin cni failed to set up pod "kiam-server-q6fgz_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  2m33s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ffc77aed2e660a41ad66d63ba26c0a458ab43e4ea302182dc8dc75a81e220c5f" network for pod "kiam-server-q6fgz": NetworkPlugin cni failed to set up pod "kiam-server-q6fgz_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  2m32s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "612def71555a3329cd5e7670df8d38915395d21f4908d29fe23cd2d897c88833" network for pod "kiam-server-q6fgz": NetworkPlugin cni failed to set up pod "kiam-server-q6fgz_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Normal   SandboxChanged          2m31s (x3 over 2m35s)  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 2m28s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  pulling image "quay.io/uswitch/kiam:v3.2"
  Normal   Pulled                  2m24s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Successfully pulled image "quay.io/uswitch/kiam:v3.2"
  Warning  BackOff                 2m19s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Back-off restarting failed container
  Normal   Created                 2m9s (x3 over 2m24s)   kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Created container
  Warning  Failed                  2m9s (x3 over 2m23s)   kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Error: failed to start container "kiam": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/server\": stat /server: no such file or directory": unknown

And this is the client

Events:
  Type     Reason     Age                     From                                                Message
  ----     ------     ----                    ----                                                -------
  Normal   Scheduled  8m53s                   default-scheduler                                   Successfully assigned kube-system/kiam-agent-26d6l to ip-10-1-45-207.eu-west-1.compute.internal
  Normal   Pulling    8m51s                   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  pulling image "quay.io/uswitch/kiam:v3.2"
  Normal   Pulled     8m47s                   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Successfully pulled image "quay.io/uswitch/kiam:v3.2"
  Normal   Pulled     7m12s (x4 over 8m46s)   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Container image "quay.io/uswitch/kiam:v3.2" already present on machine
  Normal   Created    7m11s (x5 over 8m47s)   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Created container
  Warning  Failed     7m11s (x5 over 8m46s)   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Error: failed to start container "kiam": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/agent\": stat /agent: no such file or directory": unknown
  Warning  BackOff    3m46s (x28 over 8m44s)  kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Back-off restarting failed container

paalkr · 2019-05-13T21:26:06Z

Looking at the kiam default deployment, the command and args has changed
https://github.com/uswitch/kiam/tree/master/deploy

from

        command:
        - /agent

and

        command:
        - /server

to just /kiam for both server and client, with a different set of args.

          command:
            - /kiam

paalkr · 2019-05-13T21:36:00Z

Sorry for spamming you with comments @davidmccormick ;)

helm/tiller issue that I also experience after upgrade of the control-plane. Execute a helm version fails as described in this issue: helm/helm#3104

davidmccormick · 2019-05-14T10:19:24Z

The install-kube-system log did reveal a problem with the Kubernetes Dashboard deployment, retrying over and over again. This corresponds with a lot of the pods being started and terminated as shown in my previous screenshot.

Hi - many thanks for the extra info here - I fat fingered a change which set default resource limits on the dashboard and didn't notice because in my cluster.yaml they are set explicitly! I've pushed a commit which will hopefully resolve this one.

davidmccormick · 2019-05-14T11:09:06Z

Looking at the kiam default deployment, the command and args has changed

Ah great! Thanks for pointing that out! I have checked our manifests against the official versions and updated the command-line args. Can you take another look? Many thanks! 🙏

davidmccormick · 2019-05-14T15:56:54Z

@paalkr regarding the helm/tiller issue - I just pushed a fix to RBAC that I believe should fix things!

paalkr · 2019-05-14T21:26:41Z

Thanks @davidmccormick . I will make a new build and execute some testes.

paalkr · 2019-05-15T11:01:35Z

So what is the plan to support kiam versions prior to 3.0? The new args and command line you added will not work with older version, like if someone specified this in their cluster.yaml. Should you check for the kiam version, or will it be enough to document this as a breaking change?

experimental:
  kiamSupport:
    enabled: true  
    image: 
      repo: quay.io/uswitch/kiam
      tag: v2.8

paalkr · 2019-05-15T11:10:45Z

helm/tiller seems to work as intended now. Thanks for fixing this issue

paalkr · 2019-05-15T12:09:26Z

The changes you introduced to add localhost to the kiam cert, to make kiam health check to work forces you to generate new certificates using kube-aws render certificates (--kiam). That again introduces a lot of pain, like flannel not starting
E0515 12:03:02.676214 1 main.go:241] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/flannel-6jlql': Get https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/flannel-6jlql: dial tcp 10.96.0.1:443: i/o timeout
Is it possible to only regenerate the kiam certificate? I tried using --kiam, but that did not limit the other certs from being regenerated as well.

I'm not sure if the flannel issue is related to what I describe above, but it started to happen after I regenerated the certs. What is the recommended workflow upgrading a kube-aws 0.12.3 cluster with kiam 2.7 to kube-aws 0.13 with kiam 3.2?

I'm not sure of what commands to execute to provide whatever logs you might need.

davidmccormick · 2019-05-15T14:56:23Z

@paalkr I appreciate that the mechanism is a bit klunky! We actually use our own tool and Vault to manage all of the credentials files and certificates. My only suggestion would be to back up your credentials directory, run the re-generate to get new kiam certs and then restore and replace just your kiam certs.

davidmccormick · 2019-05-15T14:57:57Z

I am going to merge this into the branch now and make it a beta release. Please do continue to test but can you raise an issue for any further bugs that you find (one issue for all would be fine)?

paalkr · 2019-05-15T18:06:15Z

@davidmccormick , thnaks. My plan regarding the kiam cert was to do exactly as you described. I will continue to test, and can raise a general ticket for 0.13 beta testing results.

davidmccormick added 2 commits May 4, 2019 12:51

Prevent migration downtime, bring in new node lables in 0.13 release …

2341e00

…branch and then switch to using them in the 0.14 release branch. Disable Admission Controller NodeRestriction in 0.13 release branch.

k8s-ci-robot requested review from cknowles and mumoshu May 4, 2019 12:35

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 4, 2019

davidmccormick changed the title ~~0.13 release migration from existing~~ Release Prep to v0.13.x branch May 4, 2019

davidmccormick added this to the v0.13.0 milestone May 4, 2019

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 4, 2019

Reference the correct hyperkube image

4317e7e

davidmccormick added 11 commits May 7, 2019 17:10

Update kubelets to use KubeletConfiguration file where appropriate

2441d87

NodeStatusUpdateFrequency is not definable for Controller nodes

Fix cluster dns setting

a3d5315

Turn off serverTLSBootstrap by default because the csr's are not auto…

bcc2d2e

…matically signed

Enable kubelet anonymous authentication but only allow Webhook author…

294f412

…ization Add RBAC objects to allow unauthenticated access to the kubelet's /healthz endpoint (so that cfn-signal can curl it without creds)

Allow metrics-server to scrape kubelets with self-signed certificates

d79f762

update vendor

300e35e

missed Gopkg.lock

e6c0a0d

Wrap apply-kube-aws-plugins into install-kube-system (so that we can …

21053ec

…do things after the plugin manifests and/or helm charts have been deployed)

Update RBAC for Nodes to allow authenticated access to things like /l…

45cc29c

…ogs to authenticated users

Merge changes from 0.14-prep branch

0098a66

Allow access to the kubelets for metrics scraping from worker nodes

53b579b

davidmccormick added 2 commits May 13, 2019 15:26

Add more calico crds that Typha wants to watch

3884620

Map Group system:nodes and User kube-worker to core permissive psp so…

7c7cad9

… that controller nodes can create mirror pods. Remove writing kube-aws version to the motd - causing extended rolls just to update version number which is available on a tag anyway.

Correct broken default dashboard cpu resource

a7e6bb8

Update kiam command line

49e7107

Fix for tiller not able to access kubelet

3d48783

Add localhost to kiam server cert

2f6337b

davidmccormick added 3 commits May 15, 2019 14:03

Allow pass-through proxying to all aws metadata values

523982b

Correct regex

751fd60

Remove annoying request timeout

123d83b

davidmccormick merged commit 628028d into kubernetes-retired:v0.13.x May 15, 2019

paalkr mentioned this pull request May 21, 2019

0.12.3 -> 0.13.0-rc1 upgrade. Workloads fails to start due to pod security policy issue #1597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Prep to v0.13.x branch #1589

Release Prep to v0.13.x branch #1589

davidmccormick commented May 4, 2019 •

edited

Loading

k8s-ci-robot commented May 4, 2019

fejta-bot commented May 4, 2019

codecov-io commented May 7, 2019 •

edited

Loading

paalkr commented May 13, 2019

davidmccormick commented May 13, 2019

davidmccormick commented May 13, 2019

paalkr commented May 13, 2019

paalkr commented May 13, 2019 •

edited

Loading

paalkr commented May 13, 2019

paalkr commented May 13, 2019

paalkr commented May 13, 2019

davidmccormick commented May 14, 2019

davidmccormick commented May 14, 2019

davidmccormick commented May 14, 2019

paalkr commented May 14, 2019

paalkr commented May 15, 2019

paalkr commented May 15, 2019

paalkr commented May 15, 2019 •

edited

Loading

davidmccormick commented May 15, 2019

davidmccormick commented May 15, 2019

paalkr commented May 15, 2019

Release Prep to v0.13.x branch #1589

Release Prep to v0.13.x branch #1589

Conversation

davidmccormick commented May 4, 2019 • edited Loading

k8s-ci-robot commented May 4, 2019

fejta-bot commented May 4, 2019

codecov-io commented May 7, 2019 • edited Loading

Codecov Report

paalkr commented May 13, 2019

davidmccormick commented May 13, 2019

davidmccormick commented May 13, 2019

paalkr commented May 13, 2019

paalkr commented May 13, 2019 • edited Loading

paalkr commented May 13, 2019

paalkr commented May 13, 2019

paalkr commented May 13, 2019

davidmccormick commented May 14, 2019

davidmccormick commented May 14, 2019

davidmccormick commented May 14, 2019

paalkr commented May 14, 2019

paalkr commented May 15, 2019

paalkr commented May 15, 2019

paalkr commented May 15, 2019 • edited Loading

davidmccormick commented May 15, 2019

davidmccormick commented May 15, 2019

paalkr commented May 15, 2019

davidmccormick commented May 4, 2019 •

edited

Loading

codecov-io commented May 7, 2019 •

edited

Loading

paalkr commented May 13, 2019 •

edited

Loading

paalkr commented May 15, 2019 •

edited

Loading