Update patching playbook to utilize kubernetes.core collection #859

PrymalInstynct · 2023-07-18T16:05:32Z

Recommend replacing shell and command modules with kubernetes.core.k8s_drain

The kubernetes.core.k8s_drain module is well supported and conducts the checks being done as a part of the command and shell tasks already written in this playbook.
In the upcoming kubernetes.core v2.5.0 collection release pod_selectors and label_selectors will be supported to make the drain process faster/more accurate see commented example on lines 43-45. add ability to filter the list of pods to be drained by a pod label selector
Added Check to determine if the node requires a reboot based on installed packages before executing the reboot task
NOTE: this is only applicable to Debian based nodes

onedr0p · 2023-07-18T17:48:02Z

This looks awesome @PrymalInstynct were you able to test this and all seemed fine?

PrymalInstynct · 2023-07-18T18:07:56Z

So I did not test the conversion from template to playbooks, but I wrote my own role about a month ago that did the same thing along with sending notifications to discord based on the status of the various tasks so I could patch weekly on a cron and know when the cluster rebooted. The logic is sound in my role and I copy pasted most of it, but I am happy to test this PR later today/tomorrow.

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2

…aml.j2

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2

onedr0p · 2023-07-18T18:31:29Z

I just ran this, the first node cordon, drained fine. The error on the second node is probably not a fault of this playbook but more rook being a pain in the ass:

PLAY [Cluster rollout update] **************************************************************************************

TASK [Gathering Facts] *********************************************************************************************
ok: [k8s-0]

TASK [Get the node's details] **************************************************************************************
ok: [k8s-0]

TASK [Cordon node k8s-0] *******************************************************************************************
changed: [k8s-0]

TASK [Drain node k8s-0] ********************************************************************************************
[WARNING]: cannot delete mirror Pods using API server: kube-system/kube-vip-k8s-0.
[WARNING]: Deleting Pods with local storage: default/immich-machine-learning-76fbdcf747-pcrjt,kube-system/cilium-
ghljq,monitoring/alertmanager-kube-prometheus-
stack-2,monitoring/gatus-f5bb8fd77-rr9c2,monitoring/grafana-6fd54b8448-kn2tn,monitoring/loki-
backend-2,monitoring/vector-agent-4zzd7,rook-ceph/csi-cephfsplugin-mvdx4,rook-ceph/csi-cephfsplugin-
provisioner-645b765b47-p6j92,rook-ceph/csi-rbdplugin-provisioner-745d56cbd5-qk4lb,rook-ceph/csi-rbdplugin-
xxdhv,rook-ceph/rook-ceph-mds-ceph-filesystem-b-85df87c797-865f6,volsync/volsync-666b75945c-x9j6l.
[WARNING]: Ignoring DaemonSet-managed Pods: kube-system/intel-gpu-exporter-wxc6f,kube-system/intel-gpu-plugin-
kdjbr,kube-system/node-feature-discovery-worker-tb7mv,monitoring/kube-prometheus-stack-prometheus-node-
exporter-5vh8w,monitoring/smartctl-exporter-0-crvln.
[WARNING]: timeout reached while pods were still running.
changed: [k8s-0]

TASK [Update k8s-0] ************************************************************************************************
ok: [k8s-0]

TASK [Check for reboot k8s-0] **************************************************************************************
ok: [k8s-0]

TASK [Reboot k8s-0] ************************************************************************************************
skipping: [k8s-0]

TASK [Uncordon k8s-0] **********************************************************************************************
changed: [k8s-0]

PLAY [Cluster rollout update] **************************************************************************************

TASK [Gathering Facts] *********************************************************************************************
ok: [k8s-1]

TASK [Pausing for 5 seconds...] ************************************************************************************
Pausing for 5 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [k8s-1]

TASK [Get the node's details] **************************************************************************************
ok: [k8s-1 -> k8s-0(192.168.42.10)]

TASK [Cordon node k8s-1] *******************************************************************************************
changed: [k8s-1 -> k8s-0(192.168.42.10)]

TASK [Drain node k8s-1] ********************************************************************************************
fatal: [k8s-1 -> k8s-0(192.168.42.10)]: FAILED! => changed=false 
  msg: 'Failed to delete pod rook-ceph/rook-ceph-osd-5-bf6f7f699-5b9xx due to: Too Many Requests'

NO MORE HOSTS LEFT *************************************************************************************************

PLAY RECAP *********************************************************************************************************
k8s-0                      : ok=8    changed=3    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
k8s-1                      : ok=4    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2

…aml.j2

PrymalInstynct · 2023-07-18T18:40:49Z

So you will notice I am still running the shell module in my role for the drain task because I can include the pod_selectors flag to ignore the app rook-ceph-osd which causes me the same problem you had above (similar issues exist with other object storage like Longhorn too). As mentioned in the initial post once kubernetes.core v2.5.0 is released this can be addressed in the k8s_drain module per my commented out example.

So we could keep using the command module for that task until v2.5.0 is published. I will let you make that call.

PrymalInstynct

Makes sense, when I pulled the kubeconfig parameter out I assumed that a control plane node would automatically know where the kubeconfig file lived.

…aml.j2 Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2

…aml.j2 Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>

PrymalInstynct · 2023-07-19T12:38:34Z

Original Playbook

I ran the original playbook and it has been sitting at the drain task on the first node for 20 minutes.

I am using the following command to monitor what is going on with the pods
kubectl get pods -A | grep -v Running

And here is my output at the moment

NAMESPACE        NAME                                                           READY   STATUS      RESTARTS	 AGE
kyverno          kyverno-cleanup-admission-reports-28162750-7xbbq               0/1     Completed   0             7m14s
kyverno          kyverno-cleanup-cluster-admission-reports-28162750-q9ntj	0/1     Completed   0             7m14s
rook-ceph        rook-ceph-crashcollector-k8s-control-prod-0-6cc4b59d4c-c9kbf   0/1     Pending     0             20m
rook-ceph        rook-ceph-crashcollector-k8s-control-prod-0-79d4dc5fdf-fj5vd   0/1     Pending     0             20m
rook-ceph        rook-ceph-mon-a-554c94d5bd-5xn4p                               0/2     Pending     0             20m
rook-ceph        rook-ceph-osd-prepare-k8s-control-prod-1-zwwlc                 0/1     Completed   0             28m
rook-ceph        rook-ceph-osd-prepare-k8s-control-prod-2-74mjl                 0/1     Completed   0             28m
rook-ceph        rook-ceph-osd-prepare-k8s-worker-prod-3-jz78t                  0/1     Completed   0             28m
rook-ceph        rook-ceph-osd-prepare-k8s-worker-prod-4-5tswj                  0/1     Completed   0             28m

So I updated the original playbook to include the --pod-selector='app!=rook-ceph-osd' option and re-ran it.
This addressed the very long hang I was experiencing from rook-ceph but I ran into a hang when grafana was being migrated so I added --grace-period=120

The original playbook ran in `9 minutes 3 seconds`

Updated Playbook

Knowing that the pod_selectors parameter for the k8s_drain ansible module is not releases yet I used the same drain task as the original playbook, just to make it an apples vs apples comparison.

The updated playbook ran in `9 minutes 20 seconds`

Conclusion

I believe that the updated playbook is worth merging as it provides a very similar level of performance which using Ansible modules that meets their project standards of status/error checking, and idempotence.

I have committed an updated playbook template to this PR, that utilizes the command module for the drain task including the above modifications, the drain grace_period set to 300, and has a complete k8s_drain task commented out ready for transition once the kubernetes.core v2.5.0 collection is released.

onedr0p · 2023-07-19T15:05:36Z

Sounds good, sorry I just commit a change where you have to resolve the conflicts again (my bad). I forgot I committed a fix the playbook in my last PR.

PrymalInstynct · 2023-07-19T15:25:32Z

No worrires, I think its ready to go now.

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2

…aml.j2 Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2

…aml.j2

Update patching playbook to utilize kubernetes.core collection

9cf9948

github-actions bot added the area/bootstrap label Jul 18, 2023

onedr0p reviewed Jul 18, 2023

View reviewed changes

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2 Outdated Show resolved Hide resolved

onedr0p reviewed Jul 18, 2023

View reviewed changes

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2 Outdated Show resolved Hide resolved

onedr0p and others added 3 commits July 18, 2023 14:11

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

4d26909

…aml.j2

Merge branch 'onedr0p:main' into ansible_kubernetes_core

2311c8d

address errors found through peer review

11eac1e

onedr0p requested changes Jul 18, 2023

View reviewed changes

onedr0p reviewed Jul 18, 2023

View reviewed changes

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2 Outdated Show resolved Hide resolved

onedr0p added 3 commits July 18, 2023 14:35

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

faf670d

…aml.j2

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

ffe7cf3

…aml.j2

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

8b1a132

…aml.j2

PrymalInstynct commented Jul 18, 2023

View reviewed changes

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

62362f2

…aml.j2 Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>

PrymalInstynct requested a review from onedr0p July 18, 2023 18:47

onedr0p requested changes Jul 18, 2023

View reviewed changes

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2 Outdated Show resolved Hide resolved

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

a359f08

…aml.j2 Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>

onedr0p marked this pull request as draft July 18, 2023 19:46

PrymalInstynct added 2 commits July 19, 2023 05:30

Merge branch 'onedr0p:main' into ansible_kubernetes_core

10af3ef

Updated patching template based on testing

32efde9

PrymalInstynct marked this pull request as ready for review July 19, 2023 12:39

Merge branch 'main' into ansible_kubernetes_core

211c71d

onedr0p reviewed Jul 19, 2023

View reviewed changes

bootstrap/templates/ansible/playbooks/cluster-rollout-update.yaml.j2 Outdated Show resolved Hide resolved

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

c6be8b1

…aml.j2 Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>

onedr0p requested changes Jul 19, 2023

View reviewed changes

onedr0p added 3 commits July 19, 2023 11:51

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

f76ce1c

…aml.j2

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

969ba3d

…aml.j2

Update bootstrap/templates/ansible/playbooks/cluster-rollout-update.y…

1b0c3b9

…aml.j2

onedr0p approved these changes Jul 19, 2023

View reviewed changes

onedr0p merged commit 74c70ad into onedr0p:main Jul 19, 2023
1 check passed

PrymalInstynct deleted the ansible_kubernetes_core branch July 19, 2023 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update patching playbook to utilize kubernetes.core collection #859

Update patching playbook to utilize kubernetes.core collection #859

PrymalInstynct commented Jul 18, 2023

onedr0p commented Jul 18, 2023

PrymalInstynct commented Jul 18, 2023

onedr0p commented Jul 18, 2023 •

edited

PrymalInstynct commented Jul 18, 2023

PrymalInstynct left a comment

PrymalInstynct commented Jul 19, 2023

onedr0p commented Jul 19, 2023

PrymalInstynct commented Jul 19, 2023

Update patching playbook to utilize kubernetes.core collection #859

Update patching playbook to utilize kubernetes.core collection #859

Conversation

PrymalInstynct commented Jul 18, 2023

onedr0p commented Jul 18, 2023

PrymalInstynct commented Jul 18, 2023

onedr0p commented Jul 18, 2023 • edited

PrymalInstynct commented Jul 18, 2023

PrymalInstynct left a comment

Choose a reason for hiding this comment

PrymalInstynct commented Jul 19, 2023

Original Playbook

The original playbook ran in 9 minutes 3 seconds

Updated Playbook

The updated playbook ran in 9 minutes 20 seconds

Conclusion

onedr0p commented Jul 19, 2023

PrymalInstynct commented Jul 19, 2023

onedr0p commented Jul 18, 2023 •

edited

The original playbook ran in `9 minutes 3 seconds`

The updated playbook ran in `9 minutes 20 seconds`