replace bash for loop when checking API server SANs #8050

rptaylor · 2021-10-06T01:45:01Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Following best practices, use appropriate design patterns built in to Ansible instead of doing logic and scripting in bash.
This also avoids using the ansible shell module (Ansible documentation: If you want to execute a command securely and predictably, it may be better to use the ansible.builtin.command module instead. Best practices when writing playbooks will follow the trend of using ansible.builtin.command unless the ansible.builtin.shell module is explicitly required.)
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/shell_module.html

Special notes for your reviewer:
Bash for loop was introduced by #7463

Does this PR introduce a user-facing change?:
No

k8s-ci-robot · 2021-10-06T01:45:09Z

Hi @rptaylor. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rptaylor · 2021-10-06T01:45:46Z

While I am testing it
/hold

champtar · 2021-10-06T01:55:31Z

NACK, I've introduced bash for loop because it's way faster

rptaylor · 2021-10-07T23:46:35Z

@champtar Thanks for the info. Maybe there was a performance regression in Ansible (I tested with 2.9.24 and it was okay) or some other problem that caused it to be slow? In my testing I found no evidence of a performance problem with this PR.
I ran the upgrade-cluster.yml play multiple times on my small dev cluster of 6 nodes, each time it takes about 30 minutes to complete (some of the biggest contributors are downloading and draining). The bash loop takes ~ 1-2s and the ansible way takes ~3-4 seconds (even with become: yes to invoke sudo) for the checking task; either way it is negligible compared to the total play time. Here is a summary of execution times showing only 4 seconds to check the SAN hosts:

Thursday 07 October 2021  23:28:30 +0000 (0:00:00.138)       0:32:10.385 ****** 
=============================================================================== 
upgrade/pre-upgrade : Drain node --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 314.52s
kubernetes/control-plane : kubeadm | Upgrade first master --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 19.90s
kubernetes/control-plane : kubeadm | Upgrade other masters -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.57s
kubernetes/control-plane : kubeadm | Upgrade other masters -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.57s
upgrade/pre-upgrade : Drain node ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 12.53s
bootstrap-os : Install libselinux python package ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 10.77s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.14s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates --------------------------------------------------------------------------------------------------------------------------------------------------------- 8.31s
download | Download files / images --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.04s
kubernetes/preinstall : Install packages requirements -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.86s
network_plugin/calico : Start Calico resources --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.33s
kubernetes/preinstall : Get current calico cluster version --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.30s
kubernetes/control-plane : kubeadm | Check apiserver.crt SAN hosts ------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.27s
network_plugin/calico : Calico | Create calico manifests ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.93s
download | Download files / images --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.92s
kubernetes/preinstall : Get current calico cluster version --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.92s
network_plugin/calico : Calico | Create calico manifests ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.90s
network_plugin/calico : Calico | Create calico manifests ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.61s
kubernetes/control-plane : kubeadm | clean kubectl cache to refresh api types -------------------------------------------------------------------------------------------------------------------------------------------------- 3.53s
download | Download files / images --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.40s

rptaylor · 2021-10-07T23:53:36Z

Aside from leveraging Ansible design patterns, the benefit of changing this is making it easier to read, maintain, and debug, using Ansible features instead of bash features, and the output will provide better visibility for users so they will be able to see and understand what is happening if something fails (i.e. reducing support requests).
For example, the user will be able to see each SAN host and IP during the loop iteration, and identify which ones are good or which are missing/changed/misconfigured.
This example output shows that I injected a host "somethingrandom" for testing purposes and you can see that particular item is identified as changed, and it triggered the cert regeneration. Otherwise I confirmed that it works as expected, all the items are okay and the certs are not regenerated.



TASK [kubernetes/control-plane : kubeadm | Check apiserver.crt SAN IPs] ***************************************************************************************************************************************************************
ok: [cluster-dev-k8s-master-1] => (item=10.233.0.1)
ok: [cluster-dev-k8s-master-1] => (item=127.0.0.1)
ok: [cluster-dev-k8s-master-1] => (item=172.5.0.165)
ok: [cluster-dev-k8s-master-1] => (item=172.5.0.166)
ok: [cluster-dev-k8s-master-1] => (item=172.5.0.167)
ok: [cluster-dev-k8s-master-1] => (item=10.5.7.80)
ok: [cluster-dev-k8s-master-1] => (item=10.5.7.77)
ok: [cluster-dev-k8s-master-1] => (item=10.5.7.85)
Thursday 07 October 2021  23:03:19 +0000 (0:00:02.658)       0:06:58.705 ****** 

TASK [kubernetes/control-plane : kubeadm | Check apiserver.crt SAN hosts] *************************************************************************************************************************************************************
ok: [cluster-dev-k8s-master-1] => (item=kubernetes)
ok: [cluster-dev-k8s-master-1] => (item=kubernetes.default)
ok: [cluster-dev-k8s-master-1] => (item=kubernetes.default.svc)
ok: [cluster-dev-k8s-master-1] => (item=kubernetes.default.svc.cluster-dev.local)
ok: [cluster-dev-k8s-master-1] => (item=localhost)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-1)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-2)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-3)
ok: [cluster-dev-k8s-master-1] => (item=lb-apiserver.kubernetes.local)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev.cloud.example.ca)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-1.cluster-dev.local)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-2.cluster-dev.local)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-3.cluster-dev.local)
changed: [cluster-dev-k8s-master-1] => (item=somethingrandom)
Thursday 07 October 2021  23:03:23 +0000 (0:00:04.034)       0:07:02.740 ****** 

TASK [kubernetes/control-plane : kubeadm | regenerate apiserver cert 1/2] *************************************************************************************************************************************************************
changed: [cluster-dev-k8s-master-1] => (item=apiserver.crt)
changed: [cluster-dev-k8s-master-1] => (item=apiserver.key)
Thursday 07 October 2021  23:03:23 +0000 (0:00:00.582)       0:07:03.322 ****** 

TASK [kubernetes/control-plane : kubeadm | regenerate apiserver cert 2/2] *************************************************************************************************************************************************************
changed: [cluster-dev-k8s-master-1]
Thursday 07 October 2021  23:03:26 +0000 (0:00:02.261)       0:07:05.583 ****** 
Thursday 07 October 2021  23:03:26 +0000 (0:00:00.040)       0:07:05.624 ****** 
Thursday 07 October 2021  23:03:26 +0000 (0:00:00.047)       0:07:05.671 ****** 
Thursday 07 October 2021  23:03:26 +0000 (0:00:00.050)       0:07:05.721 ******

rptaylor · 2021-10-07T23:57:35Z

/unhold
/assign @Miouge1

rptaylor · 2021-10-08T00:16:25Z

Besides, as can be inferred from the timers (which only shows the top contributors), aside from a few slow tasks that stand out, the bulk of the time is made up of a very long tail of tasks; pretty much all tasks take ~ 2-4s and there are quite a lot of them, so it shouldn't make much difference whether one single task (which only runs on master nodes) takes 2s or 4s.

champtar · 2021-10-08T02:28:51Z

When you have high latency going from 1 task to 21 items definitely makes a difference. I'm in favor of speed as that's a weakness of kubespray, but I also understand your argument. @floryut @oomichi I'll let you decide.

rptaylor · 2021-10-08T16:37:11Z

@champtar thanks for your consideration. I agree speed is a weakness of Kubespray, there are surely areas that can be streamlined; anything in the downloads role particularly would have a big impact. I also have another cluster with ~ 200 nodes that can be very slow to operate on, I wonder if mitogen or SSH pipelining would help. I think the execution for all items in a loop are transmitted in one go that way.

champtar · 2021-10-08T16:55:06Z

Pipelining is already enabled most of the time, mitogen seems not to be maintained anymore sadly.

rptaylor · 2021-10-18T23:20:53Z

@holmsten @oomichi can you please take a look?

floryut · 2021-10-19T07:18:45Z

Not sure what to think about this, on one hand I agree that it's best to follow guidelines and best practices, no argument there; but on the other hands, it this allow some performance improvement I don't see the harm in keeping it that way.

When there is a choice between deprecated things, security concerns etc.. and performance the choice is more of a no brainer, but here if I get this right it's only "cosmetic" changes, so if we are doing a vote, I'm more in keeping it the way it is.

/cc @oomichi @EppO @holmsten if you want to share your thoughts.

rptaylor · 2021-10-19T19:07:37Z

@floryut Does a difference of ~ 0.1% (about 2s out of 30m for a typical small cluster in a normal latency deployment scenario) qualify as a performance issue? It is imperceptible.
What is it like on a high latency link? Are we talking about ~ 30 seconds out of 2 hours or 5 minutes out of 8 hours... ?

rptaylor · 2021-11-04T20:07:17Z

Running Kubespray 2.16 on a somewhat large cluster (~100 nodes) took 2.5 hours for the upgrade cluster play to complete.

The download role was among the biggest offenders, even with download caching, and when all files had already been previously downloaded, it took about 1 hour to do effectively nothing (just checking the downloads were there). Checking the SANs was still about ~2-4 seconds.

k8s-triage-robot · 2022-02-02T23:52:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rptaylor · 2022-02-03T00:00:45Z

/remove-lifecycle stale

k8s-ci-robot · 2022-04-12T18:35:22Z

@rptaylor: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-07-05T22:43:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rptaylor

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 6, 2021

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 6, 2021

k8s-ci-robot requested review from holmsten and oomichi October 6, 2021 01:45

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2021

rptaylor force-pushed the 20211005-fix-bash-loops branch from 6d8beaf to 73c50b7 Compare October 7, 2021 23:08

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 7, 2021

rptaylor changed the title ~~remove bash for loop when checking API server SANs~~ replace bash for loop when checking API server SANs Oct 7, 2021

rptaylor force-pushed the 20211005-fix-bash-loops branch 2 times, most recently from 79dcbff to 87baf07 Compare October 7, 2021 23:30

k8s-ci-robot assigned Miouge1 Oct 7, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 7, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2022

rptaylor mentioned this pull request Jul 5, 2022

increase ansible fact_caching_timeout #9059

Merged

rptaylor closed this Jul 5, 2022

rptaylor force-pushed the 20211005-fix-bash-loops branch from 87baf07 to 7896bc7 Compare July 5, 2022 22:43

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 5, 2022

rptaylor mentioned this pull request Jul 5, 2022

replace bash for loop when checking API server SANs #9060

Merged

rptaylor mentioned this pull request Sep 15, 2022

kubespray play spends a lot of time doing nothing #9279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace bash for loop when checking API server SANs #8050

replace bash for loop when checking API server SANs #8050

rptaylor commented Oct 6, 2021 •

edited

Loading

k8s-ci-robot commented Oct 6, 2021

rptaylor commented Oct 6, 2021

champtar commented Oct 6, 2021

rptaylor commented Oct 7, 2021 •

edited

Loading

rptaylor commented Oct 7, 2021 •

edited

Loading

rptaylor commented Oct 7, 2021

rptaylor commented Oct 8, 2021 •

edited

Loading

champtar commented Oct 8, 2021

rptaylor commented Oct 8, 2021

champtar commented Oct 8, 2021

rptaylor commented Oct 18, 2021

floryut commented Oct 19, 2021 •

edited

Loading

rptaylor commented Oct 19, 2021 •

edited

Loading

rptaylor commented Nov 4, 2021 •

edited

Loading

k8s-triage-robot commented Feb 2, 2022

rptaylor commented Feb 3, 2022

k8s-ci-robot commented Apr 12, 2022

k8s-ci-robot commented Jul 5, 2022

replace bash for loop when checking API server SANs #8050

replace bash for loop when checking API server SANs #8050

Conversation

rptaylor commented Oct 6, 2021 • edited Loading

k8s-ci-robot commented Oct 6, 2021

rptaylor commented Oct 6, 2021

champtar commented Oct 6, 2021

rptaylor commented Oct 7, 2021 • edited Loading

rptaylor commented Oct 7, 2021 • edited Loading

rptaylor commented Oct 7, 2021

rptaylor commented Oct 8, 2021 • edited Loading

champtar commented Oct 8, 2021

rptaylor commented Oct 8, 2021

champtar commented Oct 8, 2021

rptaylor commented Oct 18, 2021

floryut commented Oct 19, 2021 • edited Loading

rptaylor commented Oct 19, 2021 • edited Loading

rptaylor commented Nov 4, 2021 • edited Loading

k8s-triage-robot commented Feb 2, 2022

rptaylor commented Feb 3, 2022

k8s-ci-robot commented Apr 12, 2022

k8s-ci-robot commented Jul 5, 2022

rptaylor commented Oct 6, 2021 •

edited

Loading

rptaylor commented Oct 7, 2021 •

edited

Loading

rptaylor commented Oct 7, 2021 •

edited

Loading

rptaylor commented Oct 8, 2021 •

edited

Loading

floryut commented Oct 19, 2021 •

edited

Loading

rptaylor commented Oct 19, 2021 •

edited

Loading

rptaylor commented Nov 4, 2021 •

edited

Loading