Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace bash for loop when checking API server SANs #8050

Closed

Conversation

rptaylor
Copy link
Contributor

@rptaylor rptaylor commented Oct 6, 2021

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Following best practices, use appropriate design patterns built in to Ansible instead of doing logic and scripting in bash.
This also avoids using the ansible shell module (Ansible documentation: If you want to execute a command securely and predictably, it may be better to use the ansible.builtin.command module instead. Best practices when writing playbooks will follow the trend of using ansible.builtin.command unless the ansible.builtin.shell module is explicitly required.)
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/shell_module.html

Special notes for your reviewer:
Bash for loop was introduced by #7463

Does this PR introduce a user-facing change?:
No

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 6, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @rptaylor. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 6, 2021
@rptaylor
Copy link
Contributor Author

rptaylor commented Oct 6, 2021

While I am testing it
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2021
@champtar
Copy link
Contributor

champtar commented Oct 6, 2021

NACK, I've introduced bash for loop because it's way faster

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 7, 2021
@rptaylor rptaylor changed the title remove bash for loop when checking API server SANs replace bash for loop when checking API server SANs Oct 7, 2021
@rptaylor rptaylor force-pushed the 20211005-fix-bash-loops branch 2 times, most recently from 79dcbff to 87baf07 Compare October 7, 2021 23:30
@rptaylor
Copy link
Contributor Author

rptaylor commented Oct 7, 2021

@champtar Thanks for the info. Maybe there was a performance regression in Ansible (I tested with 2.9.24 and it was okay) or some other problem that caused it to be slow? In my testing I found no evidence of a performance problem with this PR.
I ran the upgrade-cluster.yml play multiple times on my small dev cluster of 6 nodes, each time it takes about 30 minutes to complete (some of the biggest contributors are downloading and draining). The bash loop takes ~ 1-2s and the ansible way takes ~3-4 seconds (even with become: yes to invoke sudo) for the checking task; either way it is negligible compared to the total play time. Here is a summary of execution times showing only 4 seconds to check the SAN hosts:

Thursday 07 October 2021  23:28:30 +0000 (0:00:00.138)       0:32:10.385 ****** 
=============================================================================== 
upgrade/pre-upgrade : Drain node --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 314.52s
kubernetes/control-plane : kubeadm | Upgrade first master --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 19.90s
kubernetes/control-plane : kubeadm | Upgrade other masters -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.57s
kubernetes/control-plane : kubeadm | Upgrade other masters -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.57s
upgrade/pre-upgrade : Drain node ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 12.53s
bootstrap-os : Install libselinux python package ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 10.77s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.14s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates --------------------------------------------------------------------------------------------------------------------------------------------------------- 8.31s
download | Download files / images --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.04s
kubernetes/preinstall : Install packages requirements -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.86s
network_plugin/calico : Start Calico resources --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.33s
kubernetes/preinstall : Get current calico cluster version --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.30s
kubernetes/control-plane : kubeadm | Check apiserver.crt SAN hosts ------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.27s
network_plugin/calico : Calico | Create calico manifests ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.93s
download | Download files / images --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.92s
kubernetes/preinstall : Get current calico cluster version --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.92s
network_plugin/calico : Calico | Create calico manifests ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.90s
network_plugin/calico : Calico | Create calico manifests ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.61s
kubernetes/control-plane : kubeadm | clean kubectl cache to refresh api types -------------------------------------------------------------------------------------------------------------------------------------------------- 3.53s
download | Download files / images --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.40s

@rptaylor
Copy link
Contributor Author

rptaylor commented Oct 7, 2021

Aside from leveraging Ansible design patterns, the benefit of changing this is making it easier to read, maintain, and debug, using Ansible features instead of bash features, and the output will provide better visibility for users so they will be able to see and understand what is happening if something fails (i.e. reducing support requests).
For example, the user will be able to see each SAN host and IP during the loop iteration, and identify which ones are good or which are missing/changed/misconfigured.
This example output shows that I injected a host "somethingrandom" for testing purposes and you can see that particular item is identified as changed, and it triggered the cert regeneration. Otherwise I confirmed that it works as expected, all the items are okay and the certs are not regenerated.



TASK [kubernetes/control-plane : kubeadm | Check apiserver.crt SAN IPs] ***************************************************************************************************************************************************************
ok: [cluster-dev-k8s-master-1] => (item=10.233.0.1)
ok: [cluster-dev-k8s-master-1] => (item=127.0.0.1)
ok: [cluster-dev-k8s-master-1] => (item=172.5.0.165)
ok: [cluster-dev-k8s-master-1] => (item=172.5.0.166)
ok: [cluster-dev-k8s-master-1] => (item=172.5.0.167)
ok: [cluster-dev-k8s-master-1] => (item=10.5.7.80)
ok: [cluster-dev-k8s-master-1] => (item=10.5.7.77)
ok: [cluster-dev-k8s-master-1] => (item=10.5.7.85)
Thursday 07 October 2021  23:03:19 +0000 (0:00:02.658)       0:06:58.705 ****** 

TASK [kubernetes/control-plane : kubeadm | Check apiserver.crt SAN hosts] *************************************************************************************************************************************************************
ok: [cluster-dev-k8s-master-1] => (item=kubernetes)
ok: [cluster-dev-k8s-master-1] => (item=kubernetes.default)
ok: [cluster-dev-k8s-master-1] => (item=kubernetes.default.svc)
ok: [cluster-dev-k8s-master-1] => (item=kubernetes.default.svc.cluster-dev.local)
ok: [cluster-dev-k8s-master-1] => (item=localhost)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-1)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-2)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-3)
ok: [cluster-dev-k8s-master-1] => (item=lb-apiserver.kubernetes.local)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev.cloud.example.ca)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-1.cluster-dev.local)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-2.cluster-dev.local)
ok: [cluster-dev-k8s-master-1] => (item=cluster-dev-k8s-master-3.cluster-dev.local)
changed: [cluster-dev-k8s-master-1] => (item=somethingrandom)
Thursday 07 October 2021  23:03:23 +0000 (0:00:04.034)       0:07:02.740 ****** 

TASK [kubernetes/control-plane : kubeadm | regenerate apiserver cert 1/2] *************************************************************************************************************************************************************
changed: [cluster-dev-k8s-master-1] => (item=apiserver.crt)
changed: [cluster-dev-k8s-master-1] => (item=apiserver.key)
Thursday 07 October 2021  23:03:23 +0000 (0:00:00.582)       0:07:03.322 ****** 

TASK [kubernetes/control-plane : kubeadm | regenerate apiserver cert 2/2] *************************************************************************************************************************************************************
changed: [cluster-dev-k8s-master-1]
Thursday 07 October 2021  23:03:26 +0000 (0:00:02.261)       0:07:05.583 ****** 
Thursday 07 October 2021  23:03:26 +0000 (0:00:00.040)       0:07:05.624 ****** 
Thursday 07 October 2021  23:03:26 +0000 (0:00:00.047)       0:07:05.671 ****** 
Thursday 07 October 2021  23:03:26 +0000 (0:00:00.050)       0:07:05.721 ****** 

@rptaylor
Copy link
Contributor Author

rptaylor commented Oct 7, 2021

/unhold
/assign @Miouge1

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 7, 2021
@rptaylor
Copy link
Contributor Author

rptaylor commented Oct 8, 2021

Besides, as can be inferred from the timers (which only shows the top contributors), aside from a few slow tasks that stand out, the bulk of the time is made up of a very long tail of tasks; pretty much all tasks take ~ 2-4s and there are quite a lot of them, so it shouldn't make much difference whether one single task (which only runs on master nodes) takes 2s or 4s.

@champtar
Copy link
Contributor

champtar commented Oct 8, 2021

When you have high latency going from 1 task to 21 items definitely makes a difference. I'm in favor of speed as that's a weakness of kubespray, but I also understand your argument. @floryut @oomichi I'll let you decide.

@rptaylor
Copy link
Contributor Author

rptaylor commented Oct 8, 2021

@champtar thanks for your consideration. I agree speed is a weakness of Kubespray, there are surely areas that can be streamlined; anything in the downloads role particularly would have a big impact. I also have another cluster with ~ 200 nodes that can be very slow to operate on, I wonder if mitogen or SSH pipelining would help. I think the execution for all items in a loop are transmitted in one go that way.

@champtar
Copy link
Contributor

champtar commented Oct 8, 2021

Pipelining is already enabled most of the time, mitogen seems not to be maintained anymore sadly.

@rptaylor
Copy link
Contributor Author

@holmsten @oomichi can you please take a look?

@floryut
Copy link
Member

floryut commented Oct 19, 2021

Not sure what to think about this, on one hand I agree that it's best to follow guidelines and best practices, no argument there; but on the other hands, it this allow some performance improvement I don't see the harm in keeping it that way.

When there is a choice between deprecated things, security concerns etc.. and performance the choice is more of a no brainer, but here if I get this right it's only "cosmetic" changes, so if we are doing a vote, I'm more in keeping it the way it is.

/cc @oomichi @EppO @holmsten if you want to share your thoughts.

@rptaylor
Copy link
Contributor Author

rptaylor commented Oct 19, 2021

@floryut Does a difference of ~ 0.1% (about 2s out of 30m for a typical small cluster in a normal latency deployment scenario) qualify as a performance issue? It is imperceptible.
What is it like on a high latency link? Are we talking about ~ 30 seconds out of 2 hours or 5 minutes out of 8 hours... ?

@rptaylor
Copy link
Contributor Author

rptaylor commented Nov 4, 2021

Running Kubespray 2.16 on a somewhat large cluster (~100 nodes) took 2.5 hours for the upgrade cluster play to complete.

The download role was among the biggest offenders, even with download caching, and when all files had already been previously downloaded, it took about 1 hour to do effectively nothing (just checking the downloads were there). Checking the SANs was still about ~2-4 seconds.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2022
@rptaylor
Copy link
Contributor Author

rptaylor commented Feb 3, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2022
@k8s-ci-robot
Copy link
Contributor

@rptaylor: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rptaylor

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants