Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding steps to recover cluster #3438

Merged
merged 3 commits into from
Jun 7, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions docs/book/src/topics/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,136 @@ $ aws iam get-instance-profile --instance-profile-name control-plane.cluster-api

```
If instance profile does not look as expected, you may try recreating the CloudFormation stack using `clusterawsadm` as explained in the above sections.


## Recover a management cluster after losing the api server load balancer

### **Access the api server locally**

1. ssh to a control plane node and modify the `/etc/kubernetes/admin.conf`

* Replace the `server` with `server: https://localhost:6443`

* Add `insecure-skip-tls-verify: true`

* Comment out `certificate-authority-data:`

2. Export the kubeconfig and ensure you can connect

```bash
export KUBECONFIG=/etc/kubernetes/admin.conf`
warroyo marked this conversation as resolved.
Show resolved Hide resolved
kubectl get nodes
```


### **Get rid of the lingering duplicate cluster**
warroyo marked this conversation as resolved.
Show resolved Hide resolved

1. since there is a duplicate cluster that is trying to be deleted and can't due to some resources being unable to cleanup since they are in use we need to stop the conflicting reconciliation process. Edit the duplicate aws cluster object and remove the `finalizers`

```bash
kubectl edit awscluster <clustername>
```
2. `kubectl get clusters` to verify it's gone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we can use describe function meanwhile to check if finalizers are finally removed from the yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just updated this to add the describe. let me know if this looks good now @shivi28



### **Make at least one node `Ready`**

1. Right now all endpoints are down due to nodes not being ready. this is problematic for coredns adn cni pods in particular. let's get one control plane node back healthy. on the control plane node we logged into edit the `/etc/kubernetes/kubelet.conf`

* Replace the `server` with `server: https://localhost:6443`

* Add `insecure-skip-tls-verify: true`

* Comment out `certificate-authority-data:`

* Restart the kubelet `systemctl restart kubelet`

2. `kubectl get nodes` and validate that the node is in a ready state.
3. After a few minutes most things should start scheduling themselves on the new node. The pods that did not restart on their own that were causing issues were core-dns,kube-proxy, and cni pods.Those should be restart manually.
4. (optional) tail the capa logs to see the load balancer start to reconcile

```bash
kubectl logs -f -n capa-system deployments.apps/capa-controller-manager`
```

### **Update the control plane nodes with new LB settings**

1. To be safe we will do this on all CP nodes rather than having them recreate to avoid potential data loss issues. Follow the following steps for **each** CP node.

2. Regenrate the certs for the api server using the new name. Make sure to update your service cidr and endpoint in the below command.

```bash
rm /etc/kubernetes/pki/apiserver.crt
rm /etc/kubernetes/pki/apiserver.key

kubeadm init phase certs apiserver --control-plane-endpoint="mynewendpoint.com" --service-cidr=100.64.0.0/13 -v10
```

3. Update settings in `/etc/kubernetes/admin.conf`

* Replace the `server` with `server: https://<your-new-lb.com>:6443`

* Remove `insecure-skip-tls-verify: true`

* Uncomment `certificate-authority-data:`

* Export the kubeconfig and ensure you can connect

```bash
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes
```

4. Update the settings in `/etc/kubernetes/kubelet.conf`

* Replace the `server` with `server: https://your-new-lb.com:6443`

* Remove `insecure-skip-tls-verify: true`

* Uncomment `certificate-authority-data:`

* restart the kubelet `systemctl restart kubelet`

5. Just as we did before we need new pods to pick up api server cache changes so you will want to force restart pods like cni pods, kube-proxy, core-dns , etc.

### Update capi settings for new LB DNS name

1. Update the control plane endpoint on the `awscluster` and `cluster` objects. To do this we need to disable the validatingwebhooks. We will back them up and then delete so we can apply later.

```bash
kubectl get validatingwebhookconfigurations capa-validating-webhook-configuration -o yaml > capa-webhook && kubectl delete validatingwebhookconfigurations capa-validating-webhook-configuration

kubectl get validatingwebhookconfigurations capi-validating-webhook-configuration -o yaml > capi-webhook && kubectl delete validatingwebhookconfigurations capi-validating-webhook-configuration
```

2. Edit the `spec.controlPlaneEndpoint.host` field on both `awscluster` and `cluster` to have the new endpoint

3. Re-apply your webhooks

```bash
kubectl apply -f capi-webhook
kubectl apply -f capa-webhook
```


4. Update the following config maps and replace the old control plane name with the new one.

```bash
kubectl edit cm -n kube-system kubeadm-config
kubectl edit cm -n kube-system kube-proxy
kubectl edit cm -n kube-public cluster-info
```

5. Edit the cluster kubeconfig secret that capi uses to talk to the management cluster. You will need to decode teh secret, replace the endpoint and re-encode and save.

```bash
kubectl edit secret -n <namespace> <cluster-name>-kubeconfig`
```
6. At this point things should start to reconcile on their own, but we can use the commands in the next step to force it.


### Roll all of the nodes to make sure everything is fresh

1. `kubectl patch kcp <clusternamekcp> -n namespace --type merge -p "{\"spec\":{\"rolloutAfter\":\"`date +'%Y-%m-%dT%TZ'`\"}}"`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tilda before date is closing the parent tilda, which is before kubectl
Suggestion: Better to write it as

1. kubectl patch kcp <clusternamekcp> -n namespace --type merge -p "{\"spec\":{\"rolloutAfter\":\"`date +'%Y-%m-%dT%TZ'`\"}}"
2. kubectl patch machinedeployment CLUSTER_NAME-md-0 -n namespace --type merge -p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"date\":\"`date +'%s'`\"}}}}}"

See if it looks ok, otherwise we need to relook at the regex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just updated this one it should be formatted properly now. @shivi28

2. `kubectl patch machinedeployment CLUSTER_NAME-md-0 -n namespace --type merge -p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"date\":\"`date +'%s'`\"}}}}}"`