Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Release Prep to v0.13.x branch #1589

Conversation

davidmccormick
Copy link
Contributor

@davidmccormick davidmccormick commented May 4, 2019

Kube-aws 0.13 Release PR

The 0.13.x release adds new node.kubernetes.io/role labels to all nodes but does not use them. They will become active in the 0.14.x release where the NodeRestriction Admission Controller will be enabled which denies the use of the existing labels.

Changes in this release: -

  • Update to k8s v1.13.5
  • Put the worker/kubelet and admin certs on the controllers.
  • Disabled apiserver insecure port 8080 - only https on 443 allowed.
  • Enable kubelet anonymous authentication but only allow Webhook authorization
  • Add RBAC objects to allow unauthenticated access to the kubelet's /healthz endpoint (so that cfn-signal can curl it without creds)
  • Configure controllers kubelet to do TLS bootstrapping same as workers (if >=1.14).
  • Update Networking Components (calico v3.6.1, flannel v0.11.0)
  • Enable Metrics-server by default and remove heapster
  • Use CoreDNS for Cluster DNS resolution by default
  • Refactor install-kube-system (group related manifests for clarity and deploy with single apply/delete for performance) and allow removal of legacy services by manifest or by reference
  • Roll existing apply-kube-aws-plugins systemd service into install-kube-system
  • Update Kiam to 3.2 - WARNING! Kiam Server Certificate now needs to be re-generated to include SAN "kiam-server" (previously was just kiam-server:443)
  • Remove Experimental Settings for TLSBootstrap, PodPriority, PodSecurityPolicy, NodeAuthorizer, PersistentVolumeClaimResize which are all now enabled by default.
  • Create core permissive PodSecurityPolicy for kube-system service accounts and optionally bind all cluster service accounts and authenticated users to it if it is the only PSP present in the system (i.e. do not break existing clusters with no policies)
  • Remove deprecated Experimental DenyEscalatingExec admission controller in favour of using the PodSecurityPolicy controller

…en away in the api package.

Update to k8s v1.13.5

Put the worker/kubelet and admin certs on the controllers.
Disabled apiserver insecure port 8080 - only https on 443 alllowed.

Configure controllers kubelet to do TLS bootstrapping same as workers (if >=1.14).

Update Networking Components (calico v3.6.1, flannel v0.11.0)

Enable PodPriority by default

Enable Metrics-server by default and remove heapster

Enable CoreDNS for Cluster DNS resolution

Refactor install-kube-system (group related manifests for clarity and deploy with single apply/delete for performance)

Update install-kube-system to clean up deprecated services and objects (.e.g. heapster)

Update Kiam to 3.2 - WARNING! Kiam Server Certificate now needs to be re-generated to include SAN "kiam-server" (previously was just kiam-server:443)

Remove Experminental Settings for TLSBootstrap, Pod Priority, NodeAuthorizer, PersistentVolumeClaimResize

Remove experimental Mutating and Validating Webhooks which are now enabled by default.

Update the node role label to node.kubernetes.io/role which is allowed by the NodeRestriction AdmissionController
…branch and then switch to using them in the 0.14 release branch. Disable Admission Controller NodeRestriction in 0.13 release branch.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mumoshu

If they are not already assigned, you can assign the PR to them by writing /assign @mumoshu in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 4, 2019
@davidmccormick davidmccormick changed the title 0.13 release migration from existing Release Prep to v0.13.x branch May 4, 2019
@davidmccormick davidmccormick added this to the v0.13.0 milestone May 4, 2019
@fejta-bot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/check-cla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 4, 2019
@codecov-io
Copy link

codecov-io commented May 7, 2019

Codecov Report

Merging #1589 into v0.13.x will decrease coverage by 0.2%.
The diff coverage is 2.22%.

Impacted file tree graph

@@             Coverage Diff             @@
##           v0.13.x    #1589      +/-   ##
===========================================
- Coverage    25.87%   25.67%   -0.21%     
===========================================
  Files           98       98              
  Lines         5074     5052      -22     
===========================================
- Hits          1313     1297      -16     
+ Misses        3614     3610       -4     
+ Partials       147      145       -2
Impacted Files Coverage Δ
pkg/api/deployment.go 0% <ø> (ø) ⬆️
pkg/api/types.go 0% <ø> (ø) ⬆️
pkg/model/node_pool_compile.go 54.54% <ø> (-1.16%) ⬇️
pkg/model/node_pool_config.go 23.75% <ø> (-11.14%) ⬇️
pkg/api/feature_gates.go 0% <0%> (ø) ⬆️
pkg/api/cluster.go 0% <0%> (ø) ⬆️
credential/generator.go 0% <0%> (ø) ⬆️
pkg/model/credentials.go 60.6% <50%> (-1.16%) ⬇️
pkg/model/config.go 52.38% <0%> (-2.39%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67869df...123d83b. Read the comment docs.

@paalkr
Copy link
Contributor

paalkr commented May 13, 2019

I'm trying to upgrade a test cluster created with kube-aws 0.12.3, and some of the cluster resources in the kube-system namespace refuses to start when the control-plane nodes are being updated. Resulting in a rollback stack rollback. Etcd and network stacks did update nicely.

The test cluster has three etcd nodes and two controller nodes. In addition it has two regular worker nodes.

image

@davidmccormick
Copy link
Contributor Author

I'm trying to upgrade a test cluster created with kube-aws 0.12.3, and some of the cluster resources in the kube-system namespace refuses to start when the control-plane nodes are being updated.

Hi thanks for testing @paalkr! Do you have any more details on what the errors are?

… that controller nodes can create mirror pods.

Remove writing kube-aws version to the motd - causing extended rolls just to update version number which is available on a tag anyway.
@davidmccormick
Copy link
Contributor Author

@paalkr do you have any further details? I was able to provision a v0.12.3 cluster and then upgrade to v0.13.0 but my cluster.yaml and cluster state could be very different to yours! I'm sure there is an issue here, but we'll have to dig deeper in order to pinpoint what is happening! Could you perhaps try the upgrade and capture the kubelet logs on a node which fails? If it's a controller can you send me:-

journalctl -lu install-kube-system
journalctl -lu cfn-signal
docker logs $(docker ps -q --filter="name=k8s_kube-apiserver_kube-apiserver*")
docker logs $(docker ps -q --filter="name=k8s_kube-controller-manager_kube-controller-manager*")

Thanks! 🙏

@paalkr
Copy link
Contributor

paalkr commented May 13, 2019

Thanks!

I will try to gather more information. BTW, I'm updating the cluster in a sequence, following these steps
./kube-aws render stack
./kube-aws apply --pretty-print --targets network -> OK, when turning of CloudWatch logs output. I have created a ticket #1580 for this issue, that is probably not related to my previous comment.
./kube-aws apply --pretty-print --targets etcd -> OK
./kube-aws apply --pretty-print --targets control-plane -> Not OK. The first new control plane node does get registered with the cluster, and the issues I described in my previous comment starts. The cfn-signal.service does never finish, so eventually the update rolls back.
My next step would be to update the worker pools, if the control-plane did not fail
./kube-aws apply --pretty-print --targets <pool_name>
etc etc

I will try a new update now to collect more logs.

@paalkr
Copy link
Contributor

paalkr commented May 13, 2019

The install-kube-system log did reveal a problem with the Kubernetes Dashboard deployment, retrying over and over again. This corresponds with a lot of the pods being started and terminated as shown in my previous screenshot.

May 13 20:39:24 ip-10-1-43-83.eu-west-1.compute.internal retry[2930]: The Deployment "kubernetes-dashboard" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit
May 13 20:39:24 ip-10-1-43-83.eu-west-1.compute.internal retry[2930]: daemonset.extensions/kiam-agent unchanged
May 13 20:39:24 ip-10-1-43-83.eu-west-1.compute.internal retry[2930]: Attempt 4 failed! Trying again in 3 seconds...

My initial dashboard configuration was just

kubernetesDashboard:
  adminPrivileges: true
  insecureLogin: true

Altering to the values as shown below fixed the dashboard deployment, and made the control plane node start successfully and signaling OK to the stack :)

kubernetesDashboard:
  adminPrivileges: true
  insecureLogin: true
  allowSkipLogin: true
  replicas: 1
  enabled: true
  resources:
    requests:
      cpu: 100m
      memory: 100Mi
    limits:
      cpu: 200m
      memory: 500Mi

@paalkr
Copy link
Contributor

paalkr commented May 13, 2019

My next problem is that kiam server and client version 3.2 are stuck in crash loop. My initial kube-aws 0.12.3 cluster did run kiam 2.7. I'm trying to upgrade to 3.2.

This the the describe server pod output

Events:
  Type     Reason                  Age                    From                                               Message
  ----     ------                  ----                   ----                                               -------
  Normal   Scheduled               2m53s                  default-scheduler                                  Successfully assigned kube-system/kiam-server-q6fgz to ip-10-1-43-32.eu-west-1.compute.internal
  Warning  NetworkNotReady         2m39s (x8 over 2m53s)  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]
  Warning  FailedCreatePodSandBox  2m36s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "873309c963c14086f98af62bcc12c871b51a740c96f91939d9ad9f08ff24200c" network for pod "kiam-server-q6fgz": NetworkPlugin cni failed to set up pod "kiam-server-q6fgz_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  2m33s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ffc77aed2e660a41ad66d63ba26c0a458ab43e4ea302182dc8dc75a81e220c5f" network for pod "kiam-server-q6fgz": NetworkPlugin cni failed to set up pod "kiam-server-q6fgz_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  2m32s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "612def71555a3329cd5e7670df8d38915395d21f4908d29fe23cd2d897c88833" network for pod "kiam-server-q6fgz": NetworkPlugin cni failed to set up pod "kiam-server-q6fgz_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Normal   SandboxChanged          2m31s (x3 over 2m35s)  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 2m28s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  pulling image "quay.io/uswitch/kiam:v3.2"
  Normal   Pulled                  2m24s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Successfully pulled image "quay.io/uswitch/kiam:v3.2"
  Warning  BackOff                 2m19s                  kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Back-off restarting failed container
  Normal   Created                 2m9s (x3 over 2m24s)   kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Created container
  Warning  Failed                  2m9s (x3 over 2m23s)   kubelet, ip-10-1-43-32.eu-west-1.compute.internal  Error: failed to start container "kiam": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/server\": stat /server: no such file or directory": unknown

And this is the client

Events:
  Type     Reason     Age                     From                                                Message
  ----     ------     ----                    ----                                                -------
  Normal   Scheduled  8m53s                   default-scheduler                                   Successfully assigned kube-system/kiam-agent-26d6l to ip-10-1-45-207.eu-west-1.compute.internal
  Normal   Pulling    8m51s                   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  pulling image "quay.io/uswitch/kiam:v3.2"
  Normal   Pulled     8m47s                   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Successfully pulled image "quay.io/uswitch/kiam:v3.2"
  Normal   Pulled     7m12s (x4 over 8m46s)   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Container image "quay.io/uswitch/kiam:v3.2" already present on machine
  Normal   Created    7m11s (x5 over 8m47s)   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Created container
  Warning  Failed     7m11s (x5 over 8m46s)   kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Error: failed to start container "kiam": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/agent\": stat /agent: no such file or directory": unknown
  Warning  BackOff    3m46s (x28 over 8m44s)  kubelet, ip-10-1-45-207.eu-west-1.compute.internal  Back-off restarting failed container

@paalkr
Copy link
Contributor

paalkr commented May 13, 2019

Looking at the kiam default deployment, the command and args has changed
https://github.com/uswitch/kiam/tree/master/deploy

from

        command:
        - /agent

and

        command:
        - /server

to just /kiam for both server and client, with a different set of args.

          command:
            - /kiam

@paalkr
Copy link
Contributor

paalkr commented May 13, 2019

Sorry for spamming you with comments @davidmccormick ;)

helm/tiller issue that I also experience after upgrade of the control-plane. Execute a helm version fails as described in this issue: helm/helm#3104

@davidmccormick
Copy link
Contributor Author

The install-kube-system log did reveal a problem with the Kubernetes Dashboard deployment, retrying over and over again. This corresponds with a lot of the pods being started and terminated as shown in my previous screenshot.

Hi - many thanks for the extra info here - I fat fingered a change which set default resource limits on the dashboard and didn't notice because in my cluster.yaml they are set explicitly! I've pushed a commit which will hopefully resolve this one.

@davidmccormick
Copy link
Contributor Author

Looking at the kiam default deployment, the command and args has changed

Ah great! Thanks for pointing that out! I have checked our manifests against the official versions and updated the command-line args. Can you take another look? Many thanks! 🙏

@davidmccormick
Copy link
Contributor Author

@paalkr regarding the helm/tiller issue - I just pushed a fix to RBAC that I believe should fix things!

@paalkr
Copy link
Contributor

paalkr commented May 14, 2019

Thanks @davidmccormick . I will make a new build and execute some testes.

@paalkr
Copy link
Contributor

paalkr commented May 15, 2019

So what is the plan to support kiam versions prior to 3.0? The new args and command line you added will not work with older version, like if someone specified this in their cluster.yaml. Should you check for the kiam version, or will it be enough to document this as a breaking change?

experimental:
  kiamSupport:
    enabled: true  
    image: 
      repo: quay.io/uswitch/kiam
      tag: v2.8

@paalkr
Copy link
Contributor

paalkr commented May 15, 2019

helm/tiller seems to work as intended now. Thanks for fixing this issue

@paalkr
Copy link
Contributor

paalkr commented May 15, 2019

The changes you introduced to add localhost to the kiam cert, to make kiam health check to work forces you to generate new certificates using kube-aws render certificates (--kiam). That again introduces a lot of pain, like flannel not starting
E0515 12:03:02.676214 1 main.go:241] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/flannel-6jlql': Get https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/flannel-6jlql: dial tcp 10.96.0.1:443: i/o timeout
Is it possible to only regenerate the kiam certificate? I tried using --kiam, but that did not limit the other certs from being regenerated as well.

I'm not sure if the flannel issue is related to what I describe above, but it started to happen after I regenerated the certs. What is the recommended workflow upgrading a kube-aws 0.12.3 cluster with kiam 2.7 to kube-aws 0.13 with kiam 3.2?

I'm not sure of what commands to execute to provide whatever logs you might need.

@davidmccormick
Copy link
Contributor Author

@paalkr I appreciate that the mechanism is a bit klunky! We actually use our own tool and Vault to manage all of the credentials files and certificates. My only suggestion would be to back up your credentials directory, run the re-generate to get new kiam certs and then restore and replace just your kiam certs.

@davidmccormick
Copy link
Contributor Author

I am going to merge this into the branch now and make it a beta release. Please do continue to test but can you raise an issue for any further bugs that you find (one issue for all would be fine)?

@davidmccormick davidmccormick merged commit 628028d into kubernetes-retired:v0.13.x May 15, 2019
@paalkr
Copy link
Contributor

paalkr commented May 15, 2019

@davidmccormick , thnaks. My plan regarding the kiam cert was to do exactly as you described. I will continue to test, and can raise a general ticket for 0.13 beta testing results.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants