Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. #1200

Conversation

timflannagan
Copy link
Member

@timflannagan timflannagan commented Apr 29, 2020

In 2.9.6, changes were made to how the template caching mechanism works. In previous versions, all variables (i.e. jinja2 expressions) were being cached, instead of the intended behavior which is a single variable gets cached. This was obviously problematic, especially in the case where you're trying to generate a password a couple of times, which meant the first password result gets cached, and any subsequent calls to generating passwords would result in the value of the first, cached password being assigned to the corresponding variable.

In the case of Metering, our Ansible role heavily relies on the notion of lazy evaluation - Ansible evaluates any variables at the last possible second. When this change to 2.9.6 was made, Metering saw a significant performance degradation, going from an average of 3-7 minutes to finish role execution, to 45+ minutes.

This was because we were relying on a buggy implementation of the templating caching mechanism. In our current implementation, we pass a variable to the name field of the k8s_info module. This meant that this module needed to make a filesystem lookup call many, many times for a particular value stored in this meteringconfig_spec variable, which is a large value dictionary and essentially serves as the single source of truth that Metering references.

A more concrete example:

ASK [meteringconfig : Check for the existence of the Presto TLS secret] *******
task path: /opt/ansible/roles/meteringconfig/tasks/configure_presto_tls.yml:15
 
Wednesday 29 April 2020  21:43:56 +0000 (0:00:00.380)       0:00:38.695 *******
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
 
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
...

That variable in turn also relies on the value of the meteringconfig_default_values, which is a dictionary containing the default helm chart values. Previously, the result of that expression was cached. Now, due to how lazy evaluation works, the meteringconfig_default_values needed to be re-evaluated every time it's used, causing the aforementioned performance issues. The workaround is to cache or finalize, for a lack of a better phrase, the resultant of this meteringconfig_default_values expression.

This would help performance as we're no longer re-evaluating this variable every time we want to template something, and all changes made go through the meteringconfig_spec dictionary so there's no risk of creating this variable early in the role.

Also included in the changeset is the switch to the k8s_info module, as the k8s_facts module was depreciated in 2.9, and the k8s_info has the same usage.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2020
@timflannagan timflannagan changed the title Dockerfile: Require ansible version to be less than 2.9.6. Bug 1829836: Require ansible version to be less than 2.9.6. Apr 30, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 30, 2020
@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Require ansible version to be less than 2.9.6.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@timflannagan timflannagan changed the title Bug 1829836: Require ansible version to be less than 2.9.6. Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info. Apr 30, 2020
@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@timflannagan
Copy link
Member Author

Not too sure what's happening as I was able to get this working locally on a 4.5 cluster, but having a hard time getting this working on a 4.4 one.
/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 30, 2020
@timflannagan timflannagan changed the title Bug 1829836: Replace all instances of the depreciated k8s_facts module with k8s_info. Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions. May 1, 2020
@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In 2.9.6, changes were made to how the template caching mechanism works. In previous versions, all variables (i.e. jinja2 expressions) were being cached, instead of the intended behavior which is a single variable gets cached. This was obviously problematic, especially in the case where you're trying to generate a password a couple of times, which meant the first password result gets cached, and any subsequent calls to generating passwords would result in the value of the first, cached password being assigned to the corresponding variable.

In the case of Metering, our Ansible role heavily relies on the notion of lazy evaluation - Ansible evaluates any variables at the last possible second. When this change to 2.9.6 was made, Metering saw a significant performance degradation, going from an average of 3-7 minutes to finish role execution, to 45+ minutes.

This was because we were relying on a buggy implementation of the templating caching mechanism. In our current implementation, we pass a variable to the name field of the module. This meant that this module needed to make a filesystem lookup call many, many times for a particular value stored in this meteringconfig_spec variable, which is a large value dictionary and essentially serves as the single source of truth that Metering references.

A more concrete example:
```yaml
ASK [meteringconfig : Check for the existence of the Presto TLS secret] *******
task path: /opt/ansible/roles/meteringconfig/tasks/configure_presto_tls.yml:15

Wednesday 29 April 2020  21:43:56 +0000 (0:00:00.380)       0:00:38.695 *******
File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file

File lookup using /opt/ansible/charts/openshift-metering/values.yaml as file
...
```

That variable in turn also relies on the value of the `meteringconfig_default_values`, which is a dictionary containing the default helm chart values. Previously, the result of that expression was cached. Now, due to how lazy evaluation works, the `meteringconfig_default_values` needed to be re-evaluated every time it's used, causing the aforementioned performance issues. The workaround is to cache or finalize, for a lack of a better phrase, the resultant of this `meteringconfig_default_values` expression. This would help performance as we're no longer re-evaluating this variable every time we want to template something, and all changes made go through the meteringconfig_spec dictionary so there's no risk of creating this variable early in the role.

Also included in the changeset is the switch to the  [`k8s_info` module](https://docs.ansible.com/ansible/latest/modules/k8s_info_module.html), as the `k8s_facts` module was depreciated in 2.9, and the `k8s_info` has the same usage.
@timflannagan
Copy link
Member Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 1, 2020
Copy link
Member

@EmilyM1 EmilyM1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EmilyM1, timflannagan1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@timflannagan
Copy link
Member Author

/cherry-pick release-4.4

@openshift-cherrypick-robot

@timflannagan1: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 6f2470e into kube-reporting:master May 1, 2020
@openshift-ci-robot
Copy link

@timflannagan1: All pull requests linked via external trackers have merged: kube-reporting/metering-operator#1200. Bugzilla bug 1829836 has been moved to the MODIFIED state.

In response to this:

Bug 1829836: Fix performance issues when deploying Metering on Ansible 2.9.6+ versions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@timflannagan timflannagan deleted the use-ansible-2.9.5-max branch May 1, 2020 22:28
@timflannagan
Copy link
Member Author

/cherry-pick release-4.4

@openshift-merge-robot
Copy link
Contributor

This PR was merged in the time period when ci-operator was mistakenly reporting failed tests as passing.

If this repository has ci-operator jobs, please inspect their test results even if passing, and consider the need for fixing or even reverting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants