Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. #1208

timflannagan · 2020-05-04T19:57:36Z

Overview

With the template caching changes made in 2.9.6, having to evaluate the result of the ansible_failed_task expression was greatly slowing down the role and doesn't provide much value for debuggability purposes.

Changed:

Removed all ansible_failed_task references
Normalized most of the status updates
Stored the result of the meteringconfig_spec variable locally.

Storing the result of the `meteringconfig_spec` variable locally

At that point in the role, we've finished overriding any variables that reference this meteringconfig_spec variable, which is a giant dictionary containing all of the overridden MeteringConfig default values, and we can safely store this result locally.

Before this change, we copied this variable to a tmp file that the reconcile_* task files reference to deploy metering-related resources:

- name: Store MeteringConfig spec into values file
  copy: content="{{ meteringconfig_spec }}" dest=/tmp/metering-values.yaml

- include_tasks: "{{ item }}"
  loop:
    - reconcile_metering.yml
    - reconcile_monitoring.yml
    - reconcile_hdfs.yml
    - reconcile_hive.yml
    - reconcile_presto.yml
    - reconcile_reporting_operator.yml
    - reconcile_reporting.yml

The problem, however, is that we still use this meteringconfig_spec variable under-the-hood to determine whether or not to create a resource:

- name: Deploy presto resources
  include_tasks: deploy_resources.yml
  vars:
    values_file: /tmp/metering-values.yaml
    resources:
    ...
      - template_file: templates/presto/presto-auth-secrets.yaml
        apis: [ {kind: secret} ]
        prune_label_value: presto-auth-secrets
        create: "{{ meteringconfig_create_presto_auth_secrets }}"

Here, the meteringconfig_create_presto_auth_secrets variable is defined in the role's defaults/main.yml as the following:

meteringconfig_create_presto_auth_secrets: "{{ _presto_spec.config.auth.enabled and _presto_spec.config.auth.createSecret | default(false) }}"

Where the value of this _presto_spec dictionary is pulled out of the result of meteringconfig_spec.presto.spec.

If we instead store the value of this variable locally before reconciling resources, role execution is greatly improved as we no longer need to template these variables (due to lazy evaluation) every time we're looping over this list of resource definitions.

…esources. At this point in the role, we've finished overriding any variables that reference this meteringconfig_spec variable, which is a giant dictionary containing all of the potential MeteringConfig values, and we can safely store this result locally. Before, we copied this variable to a tmp file that the reconcile_* task files reference to deploy metering-related resources: ```yaml - name: Store MeteringConfig spec into values file copy: content="{{ meteringconfig_spec }}" dest=/tmp/metering-values.yaml - include_tasks: "{{ item }}" loop: - reconcile_metering.yml - reconcile_monitoring.yml - reconcile_hdfs.yml - reconcile_hive.yml - reconcile_presto.yml - reconcile_reporting_operator.yml - reconcile_reporting.yml ``` The problem, however, is that we still use this `meteringconfig_spec` variable under-the-hood to determine whether or not to create a resource: ```yaml - name: Deploy presto resources include_tasks: deploy_resources.yml vars: values_file: /tmp/metering-values.yaml resources: ... - template_file: templates/presto/presto-auth-secrets.yaml apis: [ {kind: secret} ] prune_label_value: presto-auth-secrets create: "{{ meteringconfig_create_presto_auth_secrets }}" ``` Here, the `meteringconfig_create_presto_auth_secrets` variable is defined in the role's defaults/main.yml as the following: ```yaml meteringconfig_create_presto_auth_secrets: "{{ _presto_spec.config.auth.enabled and _presto_spec.config.auth.createSecret | default(false) }}" ``` Where the value of this `_presto_spec` dictionary is pulled out of the result of `meteringconfig_spec.presto.spec`. If we instead store the value of this variable locally before reconciling resources, role execution is greatly improved as we no longer need to template these variables (due to lazy evaluation) everytime we're looping over this list of resource definitions.

With the template caching changes made in 2.9.6, having to evaluate the result of the ansible_failed_task expression was greatly slowing down the role, and doesn't provide much value for debuggability purposes.

bentito · 2020-05-04T20:24:39Z

images/metering-ansible-operator/roles/meteringconfig/tasks/validate_reporting_operator_tls.yml

@@ -1,4 +1,7 @@
 ---
+#
+# Note: all of these tasks are wrapped in a block/rescue at the task file call site


bentito · 2020-05-04T20:25:00Z

/lgtm

openshift-ci-robot · 2020-05-04T20:29:44Z

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

EmilyM1

/lgtm

openshift-ci-robot · 2020-05-04T21:09:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EmilyM1, timflannagan1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [timflannagan1]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

timflannagan · 2020-05-04T21:49:39Z

The e2e test flaked on an irrelevant test and passed on everything else.
/override ci/prow/metering-e2e-aws

Also from the e2e runs:

Monday 04 May 2020  20:56:31 +0000 (0:00:00.024)       0:02:44.543 ************ 
=============================================================================== 
meteringconfig -------------------------------------------------------- 163.30s
gather_facts ------------------------------------------------------------ 1.11s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
total ----------------------------------------------------------------- 164.41s
Playbook run took 0 days, 0 hours, 2 minutes, 44 seconds

After making the changes to how the meteringconfig_spec variable was stored, we can see that the role finished execution in 2:44m for the dynamic input test (shown above), and 3:00m for the static input test. This is close to a 50% reduction in the previous time it took to install metering in the testing suite.

openshift-ci-robot · 2020-05-04T21:49:42Z

@timflannagan1: Overrode contexts on behalf of timflannagan1: ci/prow/metering-e2e-aws

In response to this:

The e2e test flaked on an irrelevant test and passed on everything else.
/override ci/prow/metering-e2e-aws

Also from the e2e runs:
Monday 04 May 2020  20:56:31 +0000 (0:00:00.024)       0:02:44.543 ************ 
=============================================================================== 
meteringconfig -------------------------------------------------------- 163.30s
gather_facts ------------------------------------------------------------ 1.11s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
total ----------------------------------------------------------------- 164.41s
Playbook run took 0 days, 0 hours, 2 minutes, 44 seconds
After making the changes to how the meteringconfig_spec variable was stored, we can see that the role finished execution in 2:44m for the dynamic input test (shown above), and 3:00m for the static input test. This is close to a 50% reduction in the previous time it took to install metering in the testing suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-04T21:54:04Z

@timflannagan1: All pull requests linked via external trackers have merged: kube-reporting/metering-operator#1200, kube-reporting/metering-operator#1208. Bugzilla bug 1829836 has been moved to the MODIFIED state.

In response to this:

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timflannagan added 2 commits May 4, 2020 15:52

images: Remove the ansible_failed_task from a block/rescue instances.

7375d96

With the template caching changes made in 2.9.6, having to evaluate the result of the ansible_failed_task expression was greatly slowing down the role, and doesn't provide much value for debuggability purposes.

openshift-ci-robot requested review from bentito and EmilyM1 May 4, 2020 19:59

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2020

bentito reviewed May 4, 2020

View reviewed changes

openshift-ci-robot assigned bentito May 4, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2020

timflannagan changed the title ~~images: Remove the ansible_failed_task variable from all block/rescue instances.~~ Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. May 4, 2020

openshift-ci-robot added bugzilla/severity-high bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 4, 2020

EmilyM1 approved these changes May 4, 2020

View reviewed changes

openshift-ci-robot assigned EmilyM1 May 4, 2020

openshift-merge-robot merged commit 6937d1c into kube-reporting:master May 4, 2020

timflannagan deleted the release-4.5-fix-k8s-status-instances branch May 4, 2020 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. #1208

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. #1208

timflannagan commented May 4, 2020 •

edited

bentito May 4, 2020

bentito commented May 4, 2020

openshift-ci-robot commented May 4, 2020

EmilyM1 left a comment

openshift-ci-robot commented May 4, 2020

timflannagan commented May 4, 2020 •

edited

openshift-ci-robot commented May 4, 2020

openshift-ci-robot commented May 4, 2020

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. #1208

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. #1208

Conversation

timflannagan commented May 4, 2020 • edited

Overview

Changed:

Storing the result of the meteringconfig_spec variable locally

bentito May 4, 2020

Choose a reason for hiding this comment

bentito commented May 4, 2020

openshift-ci-robot commented May 4, 2020

EmilyM1 left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented May 4, 2020

timflannagan commented May 4, 2020 • edited

openshift-ci-robot commented May 4, 2020

openshift-ci-robot commented May 4, 2020

timflannagan commented May 4, 2020 •

edited

Storing the result of the `meteringconfig_spec` variable locally

timflannagan commented May 4, 2020 •

edited