Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm UX improvements for the v1.5 stable release #37568

Merged

Conversation

Projects
None yet
8 participants
@luxas
Copy link
Member

commented Nov 28, 2016

This PR targets the next stable kubeadm release.

It's work in progress, but please comment on it and review, since there are many changes.

I tried to group the commits logically, so you can review them separately.

Q: Why this large PR? Why not many small?
A: Because of the Submit Queue and the time it takes.

PTAL @kubernetes/sig-cluster-lifecycle

Edit: This work was splitted up in three PRs in total


This change is Reviewable

@pires

This comment has been minimized.

Copy link
Member

commented Nov 28, 2016

@luxas while I understand your frustration, the process is in place for everyone and so everyone should follow it. Even if I'm not assigned to this, I wanted to review it. However, in the end, it's a hard task. If at least we had unit/e2e tests..

@luxas

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2016

@pires I'm not frustrated at all.
I just follow the pattern we have discussed in kubernetes-dev, that the fewer kubeadm PRs we have to merge this week, the better, given this rush with PRs we have right now.

This was a decision, but I definitely don't think this is too overwhelming to review, we reviewed ~5000 LOC in the initial, and now I've grouped the changes in some commits.

And in comparision: #36263 is twice the size of this one.

Also, this code is battle-tested e2e-wise (manually), I've used it for spinning up a lot of DigitalOcean clusters last week.

I will continue to work on it and rebase upon all changes I'm merging this week, and then get this up for final review and merge. But please choose a commit and start looking at it.

@pires

This comment has been minimized.

Copy link
Member

commented Nov 28, 2016

And in comparision: #36263 is twice the size of this one.

I won't measure PR complexity with LOC. The PR you linked is purely unit-testing, this is not. It touches a lot of the different pieces of code and so, to me, it's complicated to review properly.

Anyway, this is just my two cents. If you guys decided to do it, by all means! 👍

@luxas

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2016

Read kubernetes-dev on Slack :), but again, commits are reviewable separately

@errordeveloper

This comment has been minimized.

Copy link
Member

commented Nov 29, 2016

I agree with @pires.

@marun

This comment has been minimized.

Copy link
Member

commented Nov 29, 2016

I don't think this PR is reviewable in its current form. Even if it were acceptable to lump everything together - and I don't think it is - it is essential to ensure that the message associated with each commit clearly documents the intent of the code changes. Without that coherency, a reviewer will have a difficult time providing useful feedback for commits like 'wip' and 'a lot of changes'.

@luxas

This comment has been minimized.

Copy link
Member Author

commented Dec 4, 2016

Rebased and updated, this is still WIP, but the 7th, 8th and maybe 9th commit can now be reviewed.
The 6 first commits are from #37835 and #37831, so don't be scared of the size.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2016

Jenkins kops AWS e2e failed for commit d7e49d4. Full PR test history.

The magic incantation to run this job again is @k8s-bot kops aws e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@luxas

This comment has been minimized.

Copy link
Member Author

commented Dec 6, 2016

As soon as #37831 and #37835 merge, I'm gonna rebase/update this PR to be mergeable (probably tomorrow), but here is a sneak peek what it now looks like:

Old kubeadm init (terrible!):

Running pre-flight checks
I1206 14:53:12.397845   15515 validators.go:50] Validating os...
I1206 14:53:12.398782   15515 validators.go:50] Validating kernel...
I1206 14:53:12.399455   15515 kernel_validator.go:77] Validating kernel version
I1206 14:53:12.399539   15515 kernel_validator.go:92] Validating kernel config
I1206 14:53:12.409318   15515 validators.go:50] Validating cgroups...
I1206 14:53:12.409391   15515 validators.go:50] Validating docker...
Using Kubernetes version: v1.4.6
<master/tokens> generated token: "579bd7.3d2b5a1cd24b6964"
<master/pki> generated Certificate Authority key and certificate:
Issuer: CN=kubernetes | Subject: CN=kubernetes | CA: true
Not before: 2016-12-06 12:53:13 +0000 UTC Not After: 2026-12-04 12:53:13 +0000 UTC
Public: /etc/kubernetes/pki/ca-pub.pem
Private: /etc/kubernetes/pki/ca-key.pem
Cert: /etc/kubernetes/pki/ca.pem
<master/pki> generated API Server key and certificate:
Issuer: CN=kubernetes | Subject: CN=kube-apiserver | CA: false
Not before: 2016-12-06 12:53:13 +0000 UTC Not After: 2017-12-06 12:53:14 +0000 UTC
Alternate Names: [192.168.1.115 10.96.0.1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local]
Public: /etc/kubernetes/pki/apiserver-pub.pem
Private: /etc/kubernetes/pki/apiserver-key.pem
Cert: /etc/kubernetes/pki/apiserver.pem
<master/pki> generated Service Account Signing keys:
Public: /etc/kubernetes/pki/sa-pub.pem
Private: /etc/kubernetes/pki/sa-key.pem
<master/pki> created keys and certificates in "/etc/kubernetes/pki"
<util/kubeconfig> created "/etc/kubernetes/kubelet.conf"
<util/kubeconfig> created "/etc/kubernetes/admin.conf"
<master/apiclient> created API client configuration
<master/apiclient> created API client, waiting for the control plane to become ready
<master/apiclient> all control plane components are healthy after 71.296436 seconds
<master/apiclient> waiting for at least one node to register and become ready
<master/apiclient> first node is ready after 1.502372 seconds
<master/apiclient> attempting a test deployment
<master/apiclient> test deployment succeeded
<master/apiclient> failed to delete test deployment [no kind "DeleteOptions" is registered for version "kubeadm.k8s.io/v1alpha1"] (will ignore)<master/discovery> created essential addon: kube-discovery, waiting for it to become ready
<master/discovery> kube-discovery is ready after 3.002425 seconds
<master/addons> created essential addon: kube-proxy
<master/addons> created essential addon: kube-dns

Kubernetes master initialised successfully!

You can now join any number of machines by running the following on each node:

kubeadm join --token=579bd7.3d2b5a1cd24b6964 192.168.1.115

New kubeadm init:

[kubeadm] Bear in mind that kubeadm is in alpha, do not use it in production clusters.
[preflight] Running pre-flight checks...
[preflight] Starting the kubelet service by running "systemctl start kubelet"
[init] Using Kubernetes version: v1.4.6
[tokens] Generated token: "e69295.cc499553b867df2c"
[certificates] Generated Certificate Authority key and certificate.
[certificates] Generated API Server key and certificate
[certificates] Generated Service Account signing keys
[certificates] Created keys and certificates in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[apiclient] Created API client, waiting for the control plane to become ready
[apiclient] All control plane components are healthy after 26.777194 seconds
[apiclient] Waiting for at least one node to register and become ready
[apiclient] First node is ready after 0.503150 seconds
[apiclient] Creating a test deployment
[apiclient] Test deployment succeeded
[token-discovery] Created the kube-discovery deployment, waiting for it to become ready
[token-discovery] kube-discovery is ready after 3.002083 seconds
[addons] Created essential addon: kube-proxy
[addons] Created essential addon: kube-dns

Your Kubernetes master has initialized successfully!

But you still need to deploy a pod network to the cluster.
You should "kubectl apply -f" some pod network yaml file that's listed at:
    http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node:

kubeadm join --token=e69295.cc499553b867df2c 192.168.255.6

And similar improvements to kubeadm join and kubeadm reset, and in different cases where things fail => still user-friendly output

As said earlier, the three last commits are the real ones. Please only look at them.
Here are the changes in a human-readable format:

  • Mark socat, ethtool and ebtables as soft deps, since kubelet can be run in a container.
  • Auto-start the kubelet service if it isn't active. This is really convenient. If kubeadm does it, it informs the user that it ran that command so the user knows what's happening.
  • Renamed /etc/kubernetes/cloud-config.json to /etc/kubernetes/cloud-config since it shouldn't be a json file
  • A lot of logging improvements
  • Removed dead code
  • Refactored the code so setting KUBE_KUBERNETES_DIR and KUBE_HOST_PKI_PATH actually works
  • Simplification of the code
  • Made a small logging/output framework:
    • fmt.Println("[the-stage-here] Capital first letter of this message. Tell the user what the current state is")
    • fmt.Printf("[the-stage-here] Capital first letter. Maybe a [%v] in the end if an error should be displayed. Always ends with \n")
    • fmt.Errorf("Never starts with []. Includes a short error message plus the underlying error in [%v]. Never ends with \n")
    • In short: made everything consistent, since now everything is done differently which is a mess...

k8s-github-robot pushed a commit that referenced this pull request Dec 7, 2016

Kubernetes Submit Queue
Merge pull request #37831 from luxas/improve_reset
Automatic merge from submit-queue (batch tested with PRs 38194, 37594, 38123, 37831, 37084)

Improve kubeadm reset

Depends on: #36474
Broken out from: #37568
Carries: #35709, @camilocot

This makes the `kubeadm reset` command more robust and user-friendly.
I'll rebase after #36474 merges...

cc-ing reviewers: @mikedanese @errordeveloper @dgoodwin @jbeda

@luxas luxas force-pushed the luxas:various_kubeadm_improvements branch from d7e49d4 to 784f276 Dec 7, 2016

@dgoodwin
Copy link
Contributor

left a comment

Couple more small changes, mostly text, testing looks good on my vms.

Your Kubernetes master has initialized successfully!
But you still need to deploy a pod network to the cluster.
You should "kubectl apply -f" some pod network yaml file that's listed at:

This comment has been minimized.

Copy link
@dgoodwin

dgoodwin Dec 8, 2016

Contributor

Suggest:

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

This comment has been minimized.

Copy link
@luxas

luxas Dec 8, 2016

Author Member

Fixed, thanks


fmt.Printf("[preflight] Starting the kubelet systemd service by running %q\n", "systemctl start kubelet")
if err := initSystem.ServiceStart("kubelet"); err != nil {
fmt.Println("[preflight] Couldn't start the kubelet service via systemd. Please start the kubelet service manually and try again.")

This comment has been minimized.

Copy link
@dgoodwin

dgoodwin Dec 8, 2016

Contributor

Should we include the "err" here? Might be something relevant to the user in there.

This comment has been minimized.

Copy link
@dgoodwin

dgoodwin Dec 8, 2016

Contributor

Also not great that we're mentioning systemd multiple times here despite an attempt to remain init system agnostic.

Why don't we just go to:

Starting kubelet service...
WARNING: Unable to start kubelet service: %s
WARNING: Please ensure kubelet is running manually.

Drop "try again" as I think we just proceed here, and kubeadm will hang if it's not actually running.

This comment has been minimized.

Copy link
@luxas

luxas Dec 8, 2016

Author Member

Fixing

@@ -22,6 +22,7 @@ import (
"html/template"
"io"
"io/ioutil"
"os"

This comment has been minimized.

Copy link
@dgoodwin

dgoodwin Dec 8, 2016

Contributor

With testing I notice that in this file we're outputting:

[preflight] Starting the kubelet systemd service by running "systemctl start kubelet"
Using Kubernetes version: v1.4.6

Where that is about the only line in all output that doesn't have a [something] prefix.

This comment has been minimized.

Copy link
@luxas

luxas Dec 8, 2016

Author Member

That was a rebase conflict, thanks for catching

@@ -66,7 +66,7 @@ func NewReset(skipPreFlight, removeNode bool) (*Reset, error) {
if !skipPreFlight {
fmt.Println("[preflight] Running pre-flight checks...")

if err := preflight.RunResetCheck(); err != nil {
if err := preflight.RunChecks([]preflight.PreFlightCheck{preflight.IsRootCheck{}}, os.Stderr); err != nil {

This comment has been minimized.

Copy link
@dgoodwin

dgoodwin Dec 8, 2016

Contributor

Do you want to slip one more fix into this PR? :)

We clean directories, then shutdown all running containers. This can result in etcd writing more to the directory after we cleaned it, and then your next init fails and you have to run reset a second time, which will work.

We need to cleanup directories after shutting down all containers.

I can get this in a followup PR if you prefer.

This comment has been minimized.

Copy link
@luxas

luxas Dec 8, 2016

Author Member

I'll fix it

}

// Then continue with the others...
if err := preflight.RunJoinNodeChecks(cfg); err != nil {
return nil, &preflight.PreFlightError{Msg: err.Error()}

This comment has been minimized.

Copy link
@mikedanese

mikedanese Dec 8, 2016

Member

why don't we just return a preflight.PreFlightError from RunJoinNodeChecks.

This comment has been minimized.

Copy link
@luxas

luxas Dec 8, 2016

Author Member

Seems like no one just saw it before, fixing

fmt.Println("[preflight] Running pre-flight checks...")

// First, check if we're root separately from the other preflight checks and fail fast
if err := preflight.RunChecks([]preflight.PreFlightCheck{preflight.IsRootCheck{}}, os.Stderr); err != nil {

This comment has been minimized.

Copy link
@mikedanese

mikedanese Dec 8, 2016

Member

Consider moving into preflight package.

This comment has been minimized.

Copy link
@luxas

luxas Dec 8, 2016

Author Member

Yes, gonna do that


if err := client.Extensions().Deployments(api.NamespaceSystem).Delete("dummy", &v1.DeleteOptions{}); err != nil {

This comment has been minimized.

Copy link
@mikedanese

mikedanese Dec 8, 2016

Member

Let's just not create if we are not going to delete it.

This comment has been minimized.

Copy link
@mikedanese

mikedanese Dec 8, 2016

Member

Have you checked if #38330 fixes your problem? Can we just put a priority on getting that merged?

This comment has been minimized.

Copy link
@luxas

luxas Dec 8, 2016

Author Member

It does, tested it now.
Gonna revert this and wait for #38330, but hopefully it will merge very soon

@luxas

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2016

Thanks for the reviews @mikedanese and @dgoodwin

If there's anything else you think should be fixed, please comment now. Otherwise I'll rebase and make it ready for merge tomorrow so we can get it in in time.

luxas added some commits Dec 4, 2016

Mark socat, ethtool and ebtables as soft deps, since kubelet can be r…
…un in a container. Also refactor preflight.go a little bit and improve logging
Run the root check before the other checks in order to fail fast if n…
…on-root to avoid strange errors. Also auto-start the kubelet if inactive
Refactor the whole binary, a lot of changes in one commit I know, but…
… I just hacked on this and modified everything I thought was messy or could be done better.

Fix boilerplates, comments in the code and make the output of kubeadm more user-friendly
Start using HostPKIPath and KubernetesDir everywhere in the code, so they can be changed for real
More robust kubeadm reset code now.
Removed old glog-things from app.Run()
Renamed /etc/kubernetes/cloud-config.json to /etc/kubernetes/cloud-config since it shouldn't be a json file
Simplification of the code
Less verbose output from master/pki.go
Cleaned up dead code

Start a small logging/output framework:
 - fmt.Println("[the-stage-here] Capital first letter of this message. Tell the user what the current state is")
 - fmt.Printf("[the-stage-here] Capital first letter. Maybe a [%v] in the end if an error should be displayed. Always ends with \n")
 - fmt.Errorf("Never starts with []. Includes a short error message plus the underlying error in [%v]. Never ends with \n")

@luxas luxas force-pushed the luxas:various_kubeadm_improvements branch from f1cf640 to 50b1077 Dec 9, 2016

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2016

Jenkins GCI GKE smoke e2e failed for commit 50b1077. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@luxas

This comment has been minimized.

Copy link
Member Author

commented Dec 9, 2016

@dgoodwin Looks ok?

@dgoodwin

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2016

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 9, 2016

@mikedanese

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

Please squash fixup commits

@luxas

This comment has been minimized.

Copy link
Member Author

commented Dec 9, 2016

@mikedanese The fourth and the fifth one? Sure.
Also, seems like I have to update bazel as well...

@luxas luxas force-pushed the luxas:various_kubeadm_improvements branch from 50b1077 to db4ab53 Dec 9, 2016

@k8s-github-robot k8s-github-robot removed the lgtm label Dec 9, 2016

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2016

Jenkins verification failed for commit db4ab53. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Fix review feedback, bazel files, tests and the dnsmasq-metrics spec.…
… Set --kubelet-preferred-address-types on v1.5 and higher clusters

@luxas luxas force-pushed the luxas:various_kubeadm_improvements branch from db4ab53 to b060304 Dec 9, 2016

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2016

Jenkins GKE smoke e2e failed for commit db4ab53. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@luxas luxas added the lgtm label Dec 9, 2016

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2016

Automatic merge from submit-queue (batch tested with PRs 37270, 38309, 37568, 34554)

@k8s-github-robot k8s-github-robot merged commit ac05e71 into kubernetes:master Dec 9, 2016

12 checks passed

Jenkins CRI GCE Node e2e Build succeeded.
Details
Jenkins GCE Node e2e Build succeeded.
Details
Jenkins GCE e2e Build succeeded.
Details
Jenkins GCE etcd3 e2e Build succeeded.
Details
Jenkins GCI GCE e2e Build succeeded.
Details
Jenkins GCI GKE smoke e2e Build succeeded.
Details
Jenkins GKE smoke e2e Build succeeded.
Details
Jenkins Kubemark GCE e2e Build succeeded.
Details
Jenkins unit/integration Build succeeded.
Details
Jenkins verification Build succeeded.
Details
Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation luxas authorized
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.