Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update patching playbook to utilize kubernetes.core collection #859

Merged
merged 16 commits into from
Jul 19, 2023
Merged

Update patching playbook to utilize kubernetes.core collection #859

merged 16 commits into from
Jul 19, 2023

Conversation

PrymalInstynct
Copy link
Contributor

Recommend replacing shell and command modules with kubernetes.core.k8s_drain

  • The kubernetes.core.k8s_drain module is well supported and conducts the checks being done as a part of the command and shell tasks already written in this playbook.

  • In the upcoming kubernetes.core v2.5.0 collection release pod_selectors and label_selectors will be supported to make the drain process faster/more accurate see commented example on lines 43-45. add ability to filter the list of pods to be drained by a pod label selector

  • Added Check to determine if the node requires a reboot based on installed packages before executing the reboot task
    NOTE: this is only applicable to Debian based nodes

@onedr0p
Copy link
Owner

onedr0p commented Jul 18, 2023

This looks awesome @PrymalInstynct were you able to test this and all seemed fine?

@PrymalInstynct
Copy link
Contributor Author

So I did not test the conversion from template to playbooks, but I wrote my own role about a month ago that did the same thing along with sending notifications to discord based on the status of the various tasks so I could patch weekly on a cron and know when the cluster rebooted. The logic is sound in my role and I copy pasted most of it, but I am happy to test this PR later today/tomorrow.

@onedr0p
Copy link
Owner

onedr0p commented Jul 18, 2023

I just ran this, the first node cordon, drained fine. The error on the second node is probably not a fault of this playbook but more rook being a pain in the ass:

PLAY [Cluster rollout update] **************************************************************************************

TASK [Gathering Facts] *********************************************************************************************
ok: [k8s-0]

TASK [Get the node's details] **************************************************************************************
ok: [k8s-0]

TASK [Cordon node k8s-0] *******************************************************************************************
changed: [k8s-0]

TASK [Drain node k8s-0] ********************************************************************************************
[WARNING]: cannot delete mirror Pods using API server: kube-system/kube-vip-k8s-0.
[WARNING]: Deleting Pods with local storage: default/immich-machine-learning-76fbdcf747-pcrjt,kube-system/cilium-
ghljq,monitoring/alertmanager-kube-prometheus-
stack-2,monitoring/gatus-f5bb8fd77-rr9c2,monitoring/grafana-6fd54b8448-kn2tn,monitoring/loki-
backend-2,monitoring/vector-agent-4zzd7,rook-ceph/csi-cephfsplugin-mvdx4,rook-ceph/csi-cephfsplugin-
provisioner-645b765b47-p6j92,rook-ceph/csi-rbdplugin-provisioner-745d56cbd5-qk4lb,rook-ceph/csi-rbdplugin-
xxdhv,rook-ceph/rook-ceph-mds-ceph-filesystem-b-85df87c797-865f6,volsync/volsync-666b75945c-x9j6l.
[WARNING]: Ignoring DaemonSet-managed Pods: kube-system/intel-gpu-exporter-wxc6f,kube-system/intel-gpu-plugin-
kdjbr,kube-system/node-feature-discovery-worker-tb7mv,monitoring/kube-prometheus-stack-prometheus-node-
exporter-5vh8w,monitoring/smartctl-exporter-0-crvln.
[WARNING]: timeout reached while pods were still running.
changed: [k8s-0]

TASK [Update k8s-0] ************************************************************************************************
ok: [k8s-0]

TASK [Check for reboot k8s-0] **************************************************************************************
ok: [k8s-0]

TASK [Reboot k8s-0] ************************************************************************************************
skipping: [k8s-0]

TASK [Uncordon k8s-0] **********************************************************************************************
changed: [k8s-0]

PLAY [Cluster rollout update] **************************************************************************************

TASK [Gathering Facts] *********************************************************************************************
ok: [k8s-1]

TASK [Pausing for 5 seconds...] ************************************************************************************
Pausing for 5 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [k8s-1]

TASK [Get the node's details] **************************************************************************************
ok: [k8s-1 -> k8s-0(192.168.42.10)]

TASK [Cordon node k8s-1] *******************************************************************************************
changed: [k8s-1 -> k8s-0(192.168.42.10)]

TASK [Drain node k8s-1] ********************************************************************************************
fatal: [k8s-1 -> k8s-0(192.168.42.10)]: FAILED! => changed=false 
  msg: 'Failed to delete pod rook-ceph/rook-ceph-osd-5-bf6f7f699-5b9xx due to: Too Many Requests'

NO MORE HOSTS LEFT *************************************************************************************************

PLAY RECAP *********************************************************************************************************
k8s-0                      : ok=8    changed=3    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
k8s-1                      : ok=4    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

@PrymalInstynct
Copy link
Contributor Author

So you will notice I am still running the shell module in my role for the drain task because I can include the pod_selectors flag to ignore the app rook-ceph-osd which causes me the same problem you had above (similar issues exist with other object storage like Longhorn too). As mentioned in the initial post once kubernetes.core v2.5.0 is released this can be addressed in the k8s_drain module per my commented out example.

So we could keep using the command module for that task until v2.5.0 is published. I will let you make that call.

Copy link
Contributor Author

@PrymalInstynct PrymalInstynct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, when I pulled the kubeconfig parameter out I assumed that a control plane node would automatically know where the kubeconfig file lived.

…aml.j2

Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>
…aml.j2

Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>
@onedr0p onedr0p marked this pull request as draft July 18, 2023 19:46
@PrymalInstynct
Copy link
Contributor Author

Original Playbook

I ran the original playbook and it has been sitting at the drain task on the first node for 20 minutes.

I am using the following command to monitor what is going on with the pods
kubectl get pods -A | grep -v Running

And here is my output at the moment

NAMESPACE        NAME                                                           READY   STATUS      RESTARTS	 AGE
kyverno          kyverno-cleanup-admission-reports-28162750-7xbbq               0/1     Completed   0             7m14s
kyverno          kyverno-cleanup-cluster-admission-reports-28162750-q9ntj	0/1     Completed   0             7m14s
rook-ceph        rook-ceph-crashcollector-k8s-control-prod-0-6cc4b59d4c-c9kbf   0/1     Pending     0             20m
rook-ceph        rook-ceph-crashcollector-k8s-control-prod-0-79d4dc5fdf-fj5vd   0/1     Pending     0             20m
rook-ceph        rook-ceph-mon-a-554c94d5bd-5xn4p                               0/2     Pending     0             20m
rook-ceph        rook-ceph-osd-prepare-k8s-control-prod-1-zwwlc                 0/1     Completed   0             28m
rook-ceph        rook-ceph-osd-prepare-k8s-control-prod-2-74mjl                 0/1     Completed   0             28m
rook-ceph        rook-ceph-osd-prepare-k8s-worker-prod-3-jz78t                  0/1     Completed   0             28m
rook-ceph        rook-ceph-osd-prepare-k8s-worker-prod-4-5tswj                  0/1     Completed   0             28m

So I updated the original playbook to include the --pod-selector='app!=rook-ceph-osd' option and re-ran it.
This addressed the very long hang I was experiencing from rook-ceph but I ran into a hang when grafana was being migrated so I added --grace-period=120

The original playbook ran in 9 minutes 3 seconds

Updated Playbook

Knowing that the pod_selectors parameter for the k8s_drain ansible module is not releases yet I used the same drain task as the original playbook, just to make it an apples vs apples comparison.

The updated playbook ran in 9 minutes 20 seconds

Conclusion

I believe that the updated playbook is worth merging as it provides a very similar level of performance which using Ansible modules that meets their project standards of status/error checking, and idempotence.

I have committed an updated playbook template to this PR, that utilizes the command module for the drain task including the above modifications, the drain grace_period set to 300, and has a complete k8s_drain task commented out ready for transition once the kubernetes.core v2.5.0 collection is released.

@PrymalInstynct PrymalInstynct marked this pull request as ready for review July 19, 2023 12:39
@onedr0p
Copy link
Owner

onedr0p commented Jul 19, 2023

Sounds good, sorry I just commit a change where you have to resolve the conflicts again (my bad). I forgot I committed a fix the playbook in my last PR.

@PrymalInstynct
Copy link
Contributor Author

No worrires, I think its ready to go now.

…aml.j2

Co-authored-by: Devin Buhl <onedr0p@users.noreply.github.com>
@onedr0p onedr0p merged commit 74c70ad into onedr0p:main Jul 19, 2023
1 check passed
@PrymalInstynct PrymalInstynct deleted the ansible_kubernetes_core branch July 19, 2023 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants