Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. #1200

timflannagan · 2020-04-29T22:50:06Z

In 2.9.6, changes were made to how the template caching mechanism works. In previous versions, all variables (i.e. jinja2 expressions) were being cached, instead of the intended behavior which is a single variable gets cached. This was obviously problematic, especially in the case where you're trying to generate a password a couple of times, which meant the first password result gets cached, and any subsequent calls to generating passwords would result in the value of the first, cached password being assigned to the corresponding variable.

In the case of Metering, our Ansible role heavily relies on the notion of lazy evaluation - Ansible evaluates any variables at the last possible second. When this change to 2.9.6 was made, Metering saw a significant performance degradation, going from an average of 3-7 minutes to finish role execution, to 45+ minutes.

This was because we were relying on a buggy implementation of the templating caching mechanism. In our current implementation, we pass a variable to the name field of the k8s_info module. This meant that this module needed to make a filesystem lookup call many, many times for a particular value stored in this meteringconfig_spec variable, which is a large value dictionary and essentially serves as the single source of truth that Metering references.

A more concrete example:

ASK [meteringconfig : Check for the existence of the Presto TLS secret] *******
task path: /opt/ansible/roles/meteringconfig/tasks/configure_presto_tls.yml:15
 
Wednesday 29 April 2020  21:43:56 +0000 (0:00:00.380)       0:00:38.695 *******
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
...

That variable in turn also relies on the value of the meteringconfig_default_values, which is a dictionary containing the default helm chart values. Previously, the result of that expression was cached. Now, due to how lazy evaluation works, the meteringconfig_default_values needed to be re-evaluated every time it's used, causing the aforementioned performance issues. The workaround is to cache or finalize, for a lack of a better phrase, the resultant of this meteringconfig_default_values expression.

This would help performance as we're no longer re-evaluating this variable every time we want to template something, and all changes made go through the meteringconfig_spec dictionary so there's no risk of creating this variable early in the role.

Also included in the changeset is the switch to the k8s_info module, as the k8s_facts module was depreciated in 2.9, and the k8s_info has the same usage.

openshift-ci-robot · 2020-04-30T14:05:22Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Require ansible version to be less than 2.9.6.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-04-30T15:47:41Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-04-30T15:48:28Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-04-30T18:35:17Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timflannagan · 2020-04-30T22:05:33Z

Not too sure what's happening as I was able to get this working locally on a 4.5 cluster, but having a hard time getting this working on a 4.4 one.
/hold

…h k8s_info. The k8s_fact module is being depreciated in 2.9 in favor of the [`k8s_info` module](https://docs.ansible.com/ansible/latest/modules/k8s_info_module.html)

openshift-ci-robot · 2020-05-01T20:50:27Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In 2.9.6, changes were made to how the template caching mechanism works. In previous versions, all variables (i.e. jinja2 expressions) were being cached, instead of the intended behavior which is a single variable gets cached. This was obviously problematic, especially in the case where you're trying to generate a password a couple of times, which meant the first password result gets cached, and any subsequent calls to generating passwords would result in the value of the first, cached password being assigned to the corresponding variable. In the case of Metering, our Ansible role heavily relies on the notion of lazy evaluation - Ansible evaluates any variables at the last possible second. When this change to 2.9.6 was made, Metering saw a significant performance degradation, going from an average of 3-7 minutes to finish role execution, to 45+ minutes. This was because we were relying on a buggy implementation of the templating caching mechanism. In our current implementation, we pass a variable to the name field of the module. This meant that this module needed to make a filesystem lookup call many, many times for a particular value stored in this meteringconfig_spec variable, which is a large value dictionary and essentially serves as the single source of truth that Metering references. A more concrete example: ```yaml ASK [meteringconfig : Check for the existence of the Presto TLS secret] ******* task path: /opt/ansible/roles/meteringconfig/tasks/configure_presto_tls.yml:15 Wednesday 29 April 2020 21:43:56 +0000 (0:00:00.380) 0:00:38.695 ******* File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file ... ``` That variable in turn also relies on the value of the `meteringconfig_default_values`, which is a dictionary containing the default helm chart values. Previously, the result of that expression was cached. Now, due to how lazy evaluation works, the `meteringconfig_default_values` needed to be re-evaluated every time it's used, causing the aforementioned performance issues. The workaround is to cache or finalize, for a lack of a better phrase, the resultant of this `meteringconfig_default_values` expression. This would help performance as we're no longer re-evaluating this variable every time we want to template something, and all changes made go through the meteringconfig_spec dictionary so there's no risk of creating this variable early in the role. Also included in the changeset is the switch to the [`k8s_info` module](https://docs.ansible.com/ansible/latest/modules/k8s_info_module.html), as the `k8s_facts` module was depreciated in 2.9, and the `k8s_info` has the same usage.

timflannagan · 2020-05-01T20:51:51Z

/hold cancel

EmilyM1

/lgtm

openshift-ci-robot · 2020-05-01T20:59:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EmilyM1, timflannagan1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [timflannagan1]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-05-01T21:01:19Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timflannagan · 2020-05-01T21:14:02Z

/cherry-pick release-4.4

openshift-cherrypick-robot · 2020-05-01T21:14:03Z

@timflannagan1: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-01T22:06:30Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-01T22:27:47Z

@timflannagan1: All pull requests linked via external trackers have merged: kube-reporting/metering-operator#1200. Bugzilla bug 1829836 has been moved to the MODIFIED state.

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timflannagan · 2020-05-01T22:32:23Z

/cherry-pick release-4.4

openshift-merge-robot · 2020-05-04T14:14:51Z

This PR was merged in the time period when ci-operator was mistakenly reporting failed tests as passing.

If this repository has ci-operator jobs, please inspect their test results even if passing, and consider the need for fixing or even reverting.

openshift-ci-robot requested review from bentito and EmilyM1 April 29, 2020 22:50

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2020

timflannagan force-pushed the use-ansible-2.9.5-max branch from 8a7ef3c to 18927d4 Compare April 30, 2020 00:44

timflannagan changed the title ~~Dockerfile: Require ansible version to be less than 2.9.6.~~ Bug 1829836: Require ansible version to be less than 2.9.6. Apr 30, 2020

openshift-ci-robot added bugzilla/severity-high bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 30, 2020

timflannagan force-pushed the use-ansible-2.9.5-max branch from 18927d4 to 0d52c08 Compare April 30, 2020 15:43

timflannagan changed the title ~~Bug 1829836: Require ansible version to be less than 2.9.6.~~ Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info. Apr 30, 2020

timflannagan force-pushed the use-ansible-2.9.5-max branch from 0d52c08 to 3c8a6af Compare April 30, 2020 15:48

timflannagan closed this Apr 30, 2020

timflannagan reopened this Apr 30, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 30, 2020

timflannagan force-pushed the use-ansible-2.9.5-max branch from 7628eff to 237b1bf Compare May 1, 2020 20:02

timflannagan added 2 commits May 1, 2020 16:44

images: Replace all instances of the depreciated k8s_facts module wit…

2c5c277

…h k8s_info. The k8s_fact module is being depreciated in 2.9 in favor of the [`k8s_info` module](https://docs.ansible.com/ansible/latest/modules/k8s_info_module.html)

Dockerfile: Stop pinning the Ansible version to 2.8.

5a552f8

timflannagan force-pushed the use-ansible-2.9.5-max branch from 237b1bf to 31c5d56 Compare May 1, 2020 20:45

timflannagan changed the title ~~Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info.~~ Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. May 1, 2020

timflannagan force-pushed the use-ansible-2.9.5-max branch from 31c5d56 to f578925 Compare May 1, 2020 20:50

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 1, 2020

EmilyM1 approved these changes May 1, 2020

View reviewed changes

openshift-ci-robot assigned EmilyM1 May 1, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020

openshift-merge-robot merged commit 6f2470e into kube-reporting:master May 1, 2020

timflannagan deleted the use-ansible-2.9.5-max branch May 1, 2020 22:28

openshift-ci-robot mentioned this pull request May 4, 2020

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. #1208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. #1200

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. #1200

timflannagan commented Apr 29, 2020 •

edited

openshift-ci-robot commented Apr 30, 2020

openshift-ci-robot commented Apr 30, 2020

openshift-ci-robot commented Apr 30, 2020

openshift-ci-robot commented Apr 30, 2020

timflannagan commented Apr 30, 2020

openshift-ci-robot commented May 1, 2020

timflannagan commented May 1, 2020

EmilyM1 left a comment

openshift-ci-robot commented May 1, 2020

openshift-ci-robot commented May 1, 2020

timflannagan commented May 1, 2020

openshift-cherrypick-robot commented May 1, 2020

openshift-ci-robot commented May 1, 2020

openshift-ci-robot commented May 1, 2020

timflannagan commented May 1, 2020

openshift-merge-robot commented May 4, 2020

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. #1200

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. #1200

Conversation

timflannagan commented Apr 29, 2020 • edited

openshift-ci-robot commented Apr 30, 2020

openshift-ci-robot commented Apr 30, 2020

openshift-ci-robot commented Apr 30, 2020

openshift-ci-robot commented Apr 30, 2020

timflannagan commented Apr 30, 2020

openshift-ci-robot commented May 1, 2020

timflannagan commented May 1, 2020

EmilyM1 left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented May 1, 2020

openshift-ci-robot commented May 1, 2020

timflannagan commented May 1, 2020

openshift-cherrypick-robot commented May 1, 2020

openshift-ci-robot commented May 1, 2020

openshift-ci-robot commented May 1, 2020

timflannagan commented May 1, 2020

openshift-merge-robot commented May 4, 2020

timflannagan commented Apr 29, 2020 •

edited