Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-apiserver 1.13.x refuses to work when first etcd-server is not available. #72102

Closed
Cytrian opened this issue Dec 17, 2018 · 68 comments · Fixed by etcd-io/etcd#10476, etcd-io/etcd#10911 or #81434
Assignees
Labels
kind/bug lifecycle/frozen priority/critical-urgent sig/api-machinery sig/cluster-lifecycle
Milestone

Comments

@Cytrian
Copy link

@Cytrian Cytrian commented Dec 17, 2018

How to reproduce the problem:
Set up a new demo cluster with kubeadm 1.13.1.
Create default configurationwith kubeadm config print init-defaults
Initialize cluster as usual with kubeadm init

Change the --etcd-servers list in kube-apiserver manifest to --etcd-servers=https://127.0.0.2:2379,https://127.0.0.1:2379, so that the first etcd node is unavailable ("connection refused").

The kube-apiserver is then not able to connect to etcd any more.

Last message: Unable to create storage backend: config (\u0026{ /registry [https://127.0.0.2:2379 https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true 0xc000381dd0 \u003cnil\u003e 5m0s 1m0s}), err (dial tcp 127.0.0.2:2379: connect: connection refused)\n","stream":"stderr","time":"2018-12-17T12:13:19.608822816Z"}

kube-apiserver does not start.

If I upgrade etcd to version 3.3.10, it reports an error remote error: tls: bad certificate", ServerName ""

Environment:

  • Kubernetes version 1.13.1
  • kubeadm in Vagrant box

I also experience this bug in an environment with a real etcd cluster.

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug needs-sig labels Dec 17, 2018
@Cytrian
Copy link
Author

@Cytrian Cytrian commented Dec 17, 2018

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery and removed needs-sig labels Dec 17, 2018
@yue9944882
Copy link
Contributor

@yue9944882 yue9944882 commented Dec 17, 2018

/remove-sig api-machinery
/sig cluster-lifecycle

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle and removed sig/api-machinery labels Dec 17, 2018
@yue9944882
Copy link
Contributor

@yue9944882 yue9944882 commented Dec 17, 2018

/sig api-machinery

apologies, just had another look and it's indeed an api-machinery issue.

// Endpoints defines a set of URLs (schemes, hosts and ports only)
// that can be used to communicate with a logical etcd cluster. For
// example, a three-node cluster could be provided like so:
//
// Endpoints: []string{
// "http://node1.example.com:2379",
// "http://node2.example.com:2379",
// "http://node3.example.com:2379",
// }
//
// If multiple endpoints are provided, the Client will attempt to
// use them all in the event that one or more of them are unusable.
//
// If Client.Sync is ever called, the Client may cache an alternate
// set of endpoints to continue operation.

we are passing the server list straight into etcd v3 client which return the error u reported. not sure if it's designed

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery label Dec 17, 2018
@JishanXing
Copy link
Contributor

@JishanXing JishanXing commented Dec 20, 2018

This is an etcdv3 client issue. See etcd-io/etcd#9949

@fedebongio
Copy link
Contributor

@fedebongio fedebongio commented Dec 20, 2018

/cc @jpbetz

@timothysc
Copy link
Member

@timothysc timothysc commented Feb 1, 2019

/assign @timothysc @detiber

So live updating a static pod manifest is typically not recommended, was this triggered via some other operation or were you editing your static manifests?

@timothysc timothysc added priority/important-soon priority/awaiting-more-evidence and removed priority/important-soon labels Feb 1, 2019
@timothysc timothysc assigned alexbrand and unassigned detiber Feb 1, 2019
@Cytrian
Copy link
Author

@Cytrian Cytrian commented Feb 1, 2019

No pod manifest involved here. Just a group of etcd and a kube-apiserver. The issue appeared when we rebooted the first etcd node.

@alexbrand
Copy link
Member

@alexbrand alexbrand commented Feb 4, 2019

I was able to repro this issue with the repro steps provided by @Cytrian. I also reproduced this issue with a real etcd cluster.

As @JishanXing previously mentioned, the problem is caused by a bug in the etcd v3 client library (or perhaps the grpc library). The vault project is also running into this: hashicorp/vault#4349

The problem seems to be that the etcd library uses the first node’s address as the ServerName for TLS. This means that all attempts to connect to any server other than the first will fail with a certificate validation error (i.e. cert has ${nameOfNode2} in SANs, but the client is expecting ${nameOfNode1}).

An important thing to highlight is that when the first etcd server goes down, it also takes the Kubernetes API servers down, because they fail to connect to the remaining etcd servers.

With that said, this all depends on what your etcd server certificates look like:

  • If you follow the kubeadm instructions to stand up a 3 node etcd cluster, you get a set of certificates that include the first node’s name and IP in the SANs (because all certs are generated on the first etcd node). Thus, you should not run into this issue.
  • If you have used another process to generate certificates for etcd, and the certs do not include the first node’s name and IP in the SANs, you will most likely run into this issue when the first etcd node goes down.

To reproduce the issue with a real etcd cluster:

  1. Create a 3 node etcd cluster with TLS enabled. Each certificate should only contain the name/IP of the node that will be serving it.
  2. Start an API server that points to the etcd cluster.
  3. Stop the first etcd node.
  4. API server crashes and fails to come back up

Versions:

  • kubeadm version: v1.13.2
  • kubernetes api server version: v1.13.2
  • etcd image: k8s.gcr.io/etcd:3.2.24

API server crash log: https://gist.github.com/alexbrand/ba86f506e4278ed2ada4504ab44b525b

I was unable to reproduce this issue with API server v1.12.5 (n.b. this was somewhat of a non-scientific test => tested by updating the image field of the API server static pod produced by kubeadm v1.13.2)

@timothysc
Copy link
Member

@timothysc timothysc commented Feb 4, 2019

/assign @gyuho @xiang90 @jpbetz

@timothysc timothysc added priority/critical-urgent and removed priority/awaiting-more-evidence labels Feb 4, 2019
@timothysc
Copy link
Member

@timothysc timothysc commented Feb 4, 2019

@liggitt ^ FYI.

@neolit123
Copy link
Member

@neolit123 neolit123 commented Feb 6, 2019

thank you for the investigation @alexbrand

@gyuho
Copy link
Member

@gyuho gyuho commented Aug 30, 2019

I am adding "Known issue" section to etcd docs here kubernetes/website#16156.

@dims
Copy link
Member

@dims dims commented Aug 30, 2019

@igcherkaev see what @gyuho said :)

@javendo
Copy link

@javendo javendo commented Sep 10, 2019

Some days ago I had opened this issue #81837, but reading this one I think it is related. Can anyone take a look at my issue and see it they are related and if so I can close it.

@TheDukeDK
Copy link

@TheDukeDK TheDukeDK commented Oct 7, 2019

I believe I am running into this issue or at least something similar.

CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                       PORTS               NAMES
ef5a8da82f60        3cab8e1b9802           "etcd --advertise-..."   41 seconds ago      Up 40 seconds                                    k8s_etcd_etcd-nem-docker-master01.inter-olymp.local_kube-system_beb5ba6bc28b987902829f8d53bdef31_1381
3f1cd9728ecc        3cab8e1b9802           "etcd --advertise-..."   2 minutes ago       Exited (0) 40 seconds ago                        k8s_etcd_etcd-nem-docker-master01.inter-olymp.local_kube-system_beb5ba6bc28b987902829f8d53bdef31_1380
9b991bfbf812        ab60b017e34f           "kube-apiserver --..."   5 minutes ago       Exited (255) 4 minutes ago                       k8s_kube-apiserver_kube-apiserver-nem-docker-master01.inter-olymp.local_kube-system_e4ebc726604ae399a1b7beb9adcb6b4d_1056
f66d4ae02fea        5a1527e735da           "kube-scheduler --..."   3 days ago          Up 3 days                                        k8s_kube-scheduler_kube-scheduler-nem-docker-master01.inter-olymp.local_kube-system_dd3b0cd7d636afb2b116453dc6524f26_19
482625481e36        07e068033cf2           "kube-controller-m..."   3 days ago          Up 3 days                                        k8s_kube-controller-manager_kube-controller-manager-nem-docker-master01.inter-olymp.local_kube-system_ee67cb8ee97d2edbb62c52d7615f8b47_18

I see the etcd going up and down and the api server. This cluster was created with kubeadm.

The logs from etcd show the following.

2019-10-07 07:15:17.105026 I | embed: ready to serve client requests
2019-10-07 07:15:17.105242 I | embed: serving client requests on 127.0.0.1:2379
WARNING: 2019/10/07 07:15:17 Failed to dial 127.0.0.1:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
2019-10-07 07:16:45.903683 N | pkg/osutil: received terminated signal, shutting down...
2019-10-07 07:16:45.903767 I | etcdserver: skipped leadership transfer for single member cluster

The logs from the api server show the following.

Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I1007 06:59:46.911950       1 server.go:681] external host was not specified, using 192.168.2.227
I1007 06:59:46.912076       1 server.go:152] Version: v1.12.0
I1007 06:59:47.402070       1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I1007 06:59:47.402095       1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I1007 06:59:47.402622       1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I1007 06:59:47.402631       1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
F1007 07:00:07.404985       1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry [https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true true 1000 0xc42015f440 <nil> 5m0s 1m0s}), err (context deadline exceeded)

Is this the same?

@gjcarneiro
Copy link

@gjcarneiro gjcarneiro commented Oct 16, 2019

There are claims here that the bug is solved, but I am seeing evidence of it not being solved in our cluster:

I1016 09:59:22.196298       1 client.go:361] parsed scheme: "endpoint"
I1016 09:59:22.196340       1 endpoint.go:66] ccResolverWrapper: sending new addresses to cc: [{https://hex-64d-pm.k2.gambit:2379 0  <nil>} {https://hex-64f-pm.k2.gambit:2379 0  <nil>} {https://hex-8c4-pm.k2.gambit:2379 0  <nil>}]
W1016 09:59:22.212143       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://hex-8c4-pm.k2.gambit:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, hex-8c4-pm.k2.gambit, hex-8c4-pm, not hex-64d-pm.k2.gambit". Reconnecting...
W1016 09:59:22.216358       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://hex-64f-pm.k2.gambit:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for hex-64f-pm, hex-64f-pm.k2.gambit, hex-64f-pm.hex10.gambit, not hex-64d-pm.k2.gambit". Reconnecting...
I1016 09:59:22.511696       1 client.go:361] parsed scheme: "endpoint"
I1016 09:59:22.511736       1 endpoint.go:66] ccResolverWrapper: sending new addresses to cc: [{https://hex-64d-pm.k2.gambit:2379 0  <nil>} {https://hex-64f-pm.k2.gambit:2379 0  <nil>} {https://hex-8c4-pm.k2.gambit:2379 0  <nil>}]
W1016 09:59:22.525738       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://hex-8c4-pm.k2.gambit:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, hex-8c4-pm.k2.gambit, hex-8c4-pm, not hex-64d-pm.k2.gambit". Reconnecting...
W1016 09:59:22.530117       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://hex-64f-pm.k2.gambit:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for hex-64f-pm, hex-64f-pm.k2.gambit, hex-64f-pm.hex10.gambit, not hex-64d-pm.k2.gambit". Reconnecting...

Are we absolutely sure the etcd client fix made it onto the release? I am testing v1.6.2.

@seh
Copy link

@seh seh commented Oct 16, 2019

That bug is not fixed yet. The only fix was for IP address-only connections, not those using DNS names like this. We are waiting on #83968 for what will probably be Kubernetes version 1.16.3.

The workaround I'm using today is to replace my etcd server certificates with ones that use a wildcard SAN for the members in the subdomain, rather than including the given machine's DNS name as a SAN. So far, it works.

@liggitt
Copy link
Member

@liggitt liggitt commented Oct 16, 2019

It was fixed for IP addresses, but not DNS names (DNS name issue is tracked in #83028). Additionally, part of the fix regressed IPv6 address handling (#83550). See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md#known-issues.

These two issues have been resolved in master, and #83968 is open to pick them to 1.16 (targeting 1.16.3)

@gjcarneiro
Copy link

@gjcarneiro gjcarneiro commented Oct 16, 2019

Ah... thank you @seh and @liggitt, that explains it. Cheers!

@Nuru
Copy link

@Nuru Nuru commented Oct 21, 2019

@seh Would you please explain how to change the SAN on the etcd certificates?

The workaround I'm using today is to replace my etcd server certificates with ones that use a wildcard SAN for the members in the subdomain, rather than including the given machine's DNS name as a SAN. So far, it works.

@seh
Copy link

@seh seh commented Oct 21, 2019

Would you please explain how to change the SAN on the etcd certificates?

I generate these certificates myself using Terraform's tls provider, so it's a matter of revising the arguments passed for the tls_cert_request resource's "dns_names" attribute.

@yacinelazaar
Copy link

@yacinelazaar yacinelazaar commented Nov 7, 2019

Tried with Kubernetes 1.15.3 and with 1.16.2 but its not working with neither.
This is not fixed even for IP addresses:

W1107 12:48:06.316691       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://172.17.8.202:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.0.2.15, 127.0.0.1, ::1, 172.17.8.202, not 172.17.8.201". Reconnecting...
W1107 12:48:06.328186       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://172.17.8.203:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.0.2.15, 127.0.0.1, ::1, 172.17.8.203, not 172.17.8.201". Reconnecting...

@r0bj
Copy link

@r0bj r0bj commented Nov 7, 2019

I have similar observation as @yacinelazaar with IP addresses:
etcd 3.3.15
kubernetes 1.16.2

kube-apiserver log:

W1107 10:57:50.375677       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.12.72.135:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.12.72.135, not 10.12.70.111". Reconnecting...

jhunt added a commit to jhunt/k8s-boshrelease that referenced this issue Nov 11, 2019
This fixes a big issue with apiserver <-> etcd interaction and mutual
TLS, as defined in [1] and [2].

[1]: https://github.com/etcd-io/etcd/releases/tag/v3.3.14
[2]: kubernetes/kubernetes#72102

Fixes #24
@yacinelazaar
Copy link

@yacinelazaar yacinelazaar commented Nov 16, 2019

I have similar observation as @yacinelazaar with IP addresses:
etcd 3.3.15
kubernetes 1.16.2

kube-apiserver log:

W1107 10:57:50.375677       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.12.72.135:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.12.72.135, not 10.12.70.111". Reconnecting...

You should be fine with 1.16.2 and Etcd 3.3.15 now. I managed to get 3 masters running.

@Davidrjx
Copy link

@Davidrjx Davidrjx commented Mar 21, 2020

In my case, apiserver has been repeating warnings about connecting to external etcd cluster with tls, log snippets as follows

...
I0316 09:23:15.568757       1 client.go:354] parsed scheme: ""
I0316 09:23:15.568774       1 client.go:354] scheme "" not registered, fallback to default scheme
I0316 09:23:15.568812       1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{10.90.9.32:2379 0  <nil>}]
I0316 09:23:15.568868       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{10.90.9.32:2379 <nil>} {10.90.9.41:2379 <nil>} {10.9
0.9.44:2379 <nil>}]
W0316 09:23:15.573002       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {10.90.9.44:2379 0  <nil>}. Err :connection er
ror: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.90.9.44, 127.0.0.1, not 10.90.9.32". Reconnecting...
W0316 09:23:15.581827       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {10.90.9.41:2379 0  <nil>}. Err :connection er
ror: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.90.9.41, 127.0.0.1, not 10.90.9.32". Reconnecting...
I0316 09:23:15.582204       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{10.90.9.32:2379 <nil>}]
W0316 09:23:15.582232       1 asm_amd64.s:1337] Failed to dial 10.90.9.44:2379: context canceled; please retry.
W0316 09:23:15.582242       1 asm_amd64.s:1337] Failed to dial 10.90.9.41:2379: context canceled; please retry.
...

My environment:
HA kubernetes 1.15.5 cluster made by kubeadm
Etcd 3.3.10 cluster with three members

but i am not sure whether my issue is releated with grpc. any answer will be apprecicated

@terzonstefano
Copy link

@terzonstefano terzonstefano commented Jul 24, 2020

COMPLETE BACKUP AND RESTORE PROCEDURE FOR ETCD

NOTE: Check that in the file "/etc/kubernetes/etcd.yml" there is the port with the address configured like this below :

  • --listen-client-urls=https://127.0.0.1:2379,https://.......2379
  1. Backup

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /tmp/snapshot-pre-boot.db

NOTE: etcdctl is a command normally found on the master

  1. Restore ( for restore the parameter of "--initial-cluster-token" you can call it whatever you want )

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --name=master --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --data-dir /var/lib/etcd-from-backup --initial-cluster=master=https://127.0.0.1:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls=https://127.0.0.1:2380 snapshot restore /tmp/snapshot-pre-boot.db

  1. Change the parameters in the following file
    vi /etc/kubernetes/manifests/etcd.yaml

--data-dir=/var/lib/etcd-from-backup ## Update --data-dir to use new target location (put in the previous restore command)

--initial-cluster-token=etcd-cluster-1 ## (put in the previous restore command)

volumeMounts:

  • mountPath: /var/lib/etcd-from-backup ## changes with the path set in the previous restore command
    name: etcd-data

hostPath:
path: /var/lib/etcd-from-backup ## changes with the path set in the previous restore command
type: DirectoryOrCreate
name: etcd-data

  1. Check the kubernetes environment after the changes

See if the container process is back on

docker ps -a | grep etcd

see if the cluster members have been recreated

ETCDCTL_API=3 etcdctl member list --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=127.0.0.1:2379

see if pods, deployments and services have been recreated

kubectl get pods,svc,deployments

@terzonstefano
Copy link

@terzonstefano terzonstefano commented Sep 1, 2020

"""" INSTALL KUBERNETES WITH KUBEADM """"

!!! CHECK ALL INSTALLATION PREREQUISITES BEFORE INSTALLING kubernetes ----> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ !!!

prerequisites: Check if it is already installed by running these commands on master nodes:

kubectl

kubeadm

prerequisites: check which version of linux you have: --> cat /etc/os-release

prerequisites: Letting iptables see bridged traffic

prerequisites: Check required ports

prerequisites: install docker on all nodes if not already installed --> https://kubernetes.io/docs/setup/production-environment/container-runtimes/

docker installation:

sudo -i

(Install Docker CE)

Set up the repository:

Install packages to allow apt to use a repository over HTTPS

apt-get update && apt-get install -y
apt-transport-https ca-certificates curl software-properties-common gnupg2

Add Docker’s official GPG key:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

Add the Docker apt repository:

add-apt-repository
"deb [arch=amd64] https://download.docker.com/linux/ubuntu
$(lsb_release -cs)
stable"

Install Docker CE

apt-get update && apt-get install -y containerd.io=1.2.13-2 docker-ce=5:19.03.113-0ubuntu-$(lsb_release -cs) docker-ce-cli=5:19.03.113-0ubuntu-$(lsb_release -cs)

Set up the Docker daemon

cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
EOF

mkdir -p /etc/systemd/system/docker.service.d

Restart Docker

systemctl daemon-reload
systemctl restart docker

per vedere se docker è attivo:

systemctl status docker.service

############################end pre-requisites##############################################

INSTALL KUBERNETES

Return to the manual --> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ and install the following components:

kubeadm: the command to bootstrap the cluster.

kubelet: the component that runs on all of the machines in your cluster and does things like starting pods and containers.

kubectl: the command line util to talk to your cluster.

I report below the steps of the manual, do it on each node:

sudo apt-get update && sudo apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

systemctl daemon-reload
systemctl restart kubelet

Vedere la versione di kubeadm --> kubeadm version -o short
Vedere la versione di kubelet --> kubelet --version

we won't install "Configure cgroup driver" as you do when you don't have docker installed

Go to the bottom of the link page mentioned above "What's next":

will take you to the following link --> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/

Give this command only on the master:

do it with normal user

kubeadm init

Once installed, copy the output that appears and create the directories as follows on the master only:

do it with normal user

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

generate the token on the master --> "kubeadm token create --print-join-command" and then copy the output to all workers

Still on the master, give the following command:

kubectl get nodes ## you will see that the nodes are not active since the network for the nodes and pods has not been installed

ENABLE THE NETWORK:

user root:

Enable the following file /proc/sys/net/bridge/bridge-nf-call-iptables to "1" for all CNI PLUGINS !!! by running this command below on all nodes:

sysctl net.bridge.bridge-nf-call-iptables=1

Install the network on the master with the normal user

kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

Check again if the nodes are in the running state

watch kubectl get nodes

##################################################

Install the etcd and etcdctl command on master node

cat /etc/kubernetes/manifests/etcd.yml | grep -i image ## see the etcd version on master, then download it with the website instructions below

https://github.com/etcd-io/etcd/releases/

wget -q --show-progress --https-only --timestamping "https://github.com/etcd-io/etcd/releases/download/v3.4.13/etcd-v3.4.13-linux-amd64.tar.gz"

tar -xvf etcd-v3.4.13-linux-amd64.tar.gz

sudo mv etcd-v3.4.13-linux-amd64/etcd* /usr/local/bin/

cd /usr/local/bin/

chown root:root etcd*

check that everything is working

ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key

@terzonstefano
Copy link

@terzonstefano terzonstefano commented Nov 11, 2020

Installation of the METRICS SERVER on a MASTER without minkube

https://gitlab.datahub.erdc.ericsson.net/syafiq/assignment_3-4/tree/8042e20d34b883620f8d254a37a432b76f6683f7/metrics-server

copy the link of the zip file

  1. wget https://gitlab.datahub.erdc.ericsson.net/syafiq/assignment_3-4/-/archive/8042e20d34b883620f8d254a37a432b76f6683f7/assignment_3-4-8042e20d34b883620f8d254a37a432b76f6683f7.zip

  2. Enter on directory "assignment...." and subdirectory "metrics-server"

  3. kubectl create -f deploy/1.8+/

  4. wait

  5. kubectl top nodes && kubectl top pods

@terzonstefano
Copy link

@terzonstefano terzonstefano commented Nov 21, 2020

Deploying flannel network manually

Flannel can be added to any existing Kubernetes cluster though it's simplest to add flannel before any pods using the pod network have been started.

For Kubernetes v1.17+ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

#############################################################

Quickstart for Calico network on Kubernetes

  1. https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ ##

  2. Initialize the master using the following command --> kubeadm init --pod-network-cidr=192.168.0.0/16

  3. Execute the following commands to configure kubectl (also returned by kubeadm init)

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

  1. Install the Tigera Calico operator and custom resource definitions

kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml

  1. Install Calico by creating the necessary custom resource

kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml

Note: Before creating this manifest, read its contents and make sure its settings are correct for your environment. For example, you may need to change the default IP pool CIDR to match your pod network CIDR

  1. Confirm that all of the pods are running with the following command.

watch kubectl get pods -n calico-system

Note: The Tigera operator installs resources in the calico-system namespace. Other install methods may use the kube-system namespace instead

  1. Remove the taints on the master so that you can schedule pods on it

kubectl taint nodes --all node-role.kubernetes.io/master-

  1. Confirm that you now have a node in your cluster with the following command.

kubectl get nodes -o wide
It should return something like the following.

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
Ready master 52m v1.12.2 10.128.0.28 Ubuntu 18.04.1 LTS 4.15.0-1023-gcp docker://18.6.1

--------------

About installing calicoctl

calicoctl allows you to create, read, update, and delete Calico objects from the command line. Calico objects are stored in one of two datastores, either etcd or Kubernetes. The choice of datastore is determined at the time Calico is installed. Typically for Kubernetes installations the Kubernetes datastore is the default.

You can run calicoctl on any host with network access to the Calico datastore as either a binary or a container. For step-by-step instructions, refer to the section that corresponds to your desired deployment.

Installing calicoctl as a Kubernetes pod
Use the YAML that matches your datastore type to deploy the calicoctl container to your nodes.

etcd

kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl-etcd.yaml

Kubernetes API datastore

kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
Note: You can also view the YAML in a new tab.

You can then run commands using kubectl as shown below.

kubectl exec -ti -n kube-system calicoctl -- /calicoctl get profiles -o wide
An example response follows.

NAME TAGS
kns.default kns.default
kns.kube-system kns.kube-system
We recommend setting an alias as follows.

alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
Note: In order to use the calicoctl alias when reading manifests, redirect the file into stdin, for example:

calicoctl create -f - < my_manifest.yaml

@goginenigvk
Copy link

@goginenigvk goginenigvk commented Jan 7, 2021

we are using kops. can someone help me on this
showing errors when checking systemctl status kubelet
Jan 07 00:42:11 ip-172-50-2-100 kubelet[5299]: E0107 00:42:11.991760 5299 kubelet.go:2268] node "ip-172-50-2-100.ec2.internal" not found
Jan 07 00:42:12 ip-172-50-2-100 kubelet[5299]: E0107 00:42:12.091892 5299 kubelet.go:2268] node "ip-172-50-2-100.ec2.internal" not found
Jan 07 00:42:12 ip-172-50-2-100 kubelet[5299]: W0107 00:42:12.167515 5299 container.go:409] Failed to create summary reader for "/system.slice/docker-healthcheck.service": none
of the resources are being tracked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug lifecycle/frozen priority/critical-urgent sig/api-machinery sig/cluster-lifecycle
Projects
None yet