Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. #1208

Conversation

timflannagan
Copy link
Member

@timflannagan timflannagan commented May 4, 2020

Overview

With the template caching changes made in 2.9.6, having to evaluate the result of the ansible_failed_task expression was greatly slowing down the role and doesn't provide much value for debuggability purposes.

Changed:

  • Removed all ansible_failed_task references
  • Normalized most of the status updates
  • Stored the result of the meteringconfig_spec variable locally.

Storing the result of the meteringconfig_spec variable locally

At that point in the role, we've finished overriding any variables that reference this meteringconfig_spec variable, which is a giant dictionary containing all of the overridden MeteringConfig default values, and we can safely store this result locally.

Before this change, we copied this variable to a tmp file that the reconcile_* task files reference to deploy metering-related resources:

- name: Store MeteringConfig spec into values file
  copy: content="{{ meteringconfig_spec }}" dest=/tmp/metering-values.yaml

- include_tasks: "{{ item }}"
  loop:
    - reconcile_metering.yml
    - reconcile_monitoring.yml
    - reconcile_hdfs.yml
    - reconcile_hive.yml
    - reconcile_presto.yml
    - reconcile_reporting_operator.yml
    - reconcile_reporting.yml

The problem, however, is that we still use this meteringconfig_spec variable under-the-hood to determine whether or not to create a resource:

- name: Deploy presto resources
  include_tasks: deploy_resources.yml
  vars:
    values_file: /tmp/metering-values.yaml
    resources:
    ...
      - template_file: templates/presto/presto-auth-secrets.yaml
        apis: [ {kind: secret} ]
        prune_label_value: presto-auth-secrets
        create: "{{ meteringconfig_create_presto_auth_secrets }}"

Here, the meteringconfig_create_presto_auth_secrets variable is defined in the role's defaults/main.yml as the following:

meteringconfig_create_presto_auth_secrets: "{{ _presto_spec.config.auth.enabled and _presto_spec.config.auth.createSecret | default(false) }}"

Where the value of this _presto_spec dictionary is pulled out of the result of meteringconfig_spec.presto.spec.

If we instead store the value of this variable locally before reconciling resources, role execution is greatly improved as we no longer need to template these variables (due to lazy evaluation) every time we're looping over this list of resource definitions.

…esources.

At this point in the role, we've finished overriding any variables that reference this meteringconfig_spec variable, which is a giant dictionary containing all of the potential MeteringConfig values, and we can safely store this result locally.

Before, we copied this variable to a tmp file that the reconcile_* task files reference to deploy metering-related resources:

```yaml
- name: Store MeteringConfig spec into values file
  copy: content="{{ meteringconfig_spec }}" dest=/tmp/metering-values.yaml

- include_tasks: "{{ item }}"
  loop:
    - reconcile_metering.yml
    - reconcile_monitoring.yml
    - reconcile_hdfs.yml
    - reconcile_hive.yml
    - reconcile_presto.yml
    - reconcile_reporting_operator.yml
    - reconcile_reporting.yml
```

The problem, however, is that we still use this `meteringconfig_spec` variable under-the-hood to determine whether or not to create a resource:

```yaml
- name: Deploy presto resources
  include_tasks: deploy_resources.yml
  vars:
    values_file: /tmp/metering-values.yaml
    resources:
    ...
      - template_file: templates/presto/presto-auth-secrets.yaml
        apis: [ {kind: secret} ]
        prune_label_value: presto-auth-secrets
        create: "{{ meteringconfig_create_presto_auth_secrets }}"
```

Here, the `meteringconfig_create_presto_auth_secrets` variable is defined in the role's defaults/main.yml as the following:

```yaml
meteringconfig_create_presto_auth_secrets: "{{ _presto_spec.config.auth.enabled and _presto_spec.config.auth.createSecret | default(false) }}"
```

Where the value of this `_presto_spec` dictionary is pulled out of the result of `meteringconfig_spec.presto.spec`.

If we instead store the value of this variable locally before reconciling resources, role execution is greatly improved as we no longer need to template these variables (due to lazy evaluation) everytime we're looping over this list of resource definitions.
With the template caching changes made in 2.9.6, having to evaluate the result of the ansible_failed_task expression was greatly slowing down the role, and doesn't provide much value for debuggability purposes.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2020
@@ -1,4 +1,7 @@
---
#
# Note: all of these tasks are wrapped in a block/rescue at the task file call site
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bentito
Copy link
Member

bentito commented May 4, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2020
@timflannagan timflannagan changed the title images: Remove the ansible_failed_task variable from all block/rescue instances. Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances. May 4, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 4, 2020
@openshift-ci-robot
Copy link

@timflannagan1: This pull request references Bugzilla bug 1829836, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@EmilyM1 EmilyM1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EmilyM1, timflannagan1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@timflannagan
Copy link
Member Author

timflannagan commented May 4, 2020

The e2e test flaked on an irrelevant test and passed on everything else.
/override ci/prow/metering-e2e-aws

Also from the e2e runs:

Monday 04 May 2020  20:56:31 +0000 (0:00:00.024)       0:02:44.543 ************ 
=============================================================================== 
meteringconfig -------------------------------------------------------- 163.30s
gather_facts ------------------------------------------------------------ 1.11s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
total ----------------------------------------------------------------- 164.41s
Playbook run took 0 days, 0 hours, 2 minutes, 44 seconds

After making the changes to how the meteringconfig_spec variable was stored, we can see that the role finished execution in 2:44m for the dynamic input test (shown above), and 3:00m for the static input test. This is close to a 50% reduction in the previous time it took to install metering in the testing suite.

@openshift-ci-robot
Copy link

@timflannagan1: Overrode contexts on behalf of timflannagan1: ci/prow/metering-e2e-aws

In response to this:

The e2e test flaked on an irrelevant test and passed on everything else.
/override ci/prow/metering-e2e-aws

Also from the e2e runs:

Monday 04 May 2020  20:56:31 +0000 (0:00:00.024)       0:02:44.543 ************ 
=============================================================================== 
meteringconfig -------------------------------------------------------- 163.30s
gather_facts ------------------------------------------------------------ 1.11s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
total ----------------------------------------------------------------- 164.41s
Playbook run took 0 days, 0 hours, 2 minutes, 44 seconds

After making the changes to how the meteringconfig_spec variable was stored, we can see that the role finished execution in 2:44m for the dynamic input test (shown above), and 3:00m for the static input test. This is close to a 50% reduction in the previous time it took to install metering in the testing suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 6937d1c into kube-reporting:master May 4, 2020
@openshift-ci-robot
Copy link

@timflannagan1: All pull requests linked via external trackers have merged: kube-reporting/metering-operator#1200, kube-reporting/metering-operator#1208. Bugzilla bug 1829836 has been moved to the MODIFIED state.

In response to this:

Bug 1829836: Remove the ansible_failed_task variable from all block/rescue instances.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@timflannagan timflannagan deleted the release-4.5-fix-k8s-status-instances branch May 4, 2020 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants