OCPBUGS-14193: pao e2e: Split e2e PAO update lane to more lanes #631

jlojosnegros · 2023-04-24T09:17:53Z

ci/prow/e2e-gcp-pao-updating-profile functional test lane contains slow tests that require reboots to worker nodes and as a result long waits for mcp/tuned/other statuses to be updated.

The lane is reaching its maximum timeout of 4 hours

This calls for a need to split this lane to 1 or more test lanes that could run in parallel and in less amount of time so u/s PRs will not be blocked/long waiting

openshift-ci · 2023-04-24T09:17:58Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jlojosnegros · 2023-04-24T09:18:10Z

/test all

hack/run-test.sh

jlojosnegros · 2023-04-24T09:23:07Z

/test all

jlojosnegros · 2023-04-24T09:42:23Z

/test all

jlojosnegros · 2023-04-26T09:39:28Z

cc: @yanirq @mrniranjan

jlojosnegros · 2023-04-28T10:20:52Z

/test e2e-gcp-pao
/test e2e-aws-ovn

jlojosnegros · 2023-05-03T14:31:12Z

/test e2e-upgrade
infra issue

ffromani

I totally get and like the direction here, few comments inside.
I'm not super excited about generalizing the runner script, while I see the code duplication argument I'm still biased towards having really boring and fully independent runners. Another option could have been extracting only the common prelude into a include bash helper, but it's ok.

hack/run-test.sh

ffromani · 2023-05-04T06:18:53Z

hack/run-test.sh

@@ -12,6 +12,7 @@ usage() {
    print "    -h                Help for ${CURRENT_SCRIPT}"
    print "    -t                list of space separated paths to Testsuites to execute"
    print "    -p                string with extra Params for ginkgo"
+    print "    -r                string with report Params for ginkgo (these params will go after the list of suites)"


proper positioning of ginkgo flags seems to be the responsability of this wrapper. So the wrapper can tolerate wrong order of flags, and the clarification between parens is unnecessary (a little layering violation)

Two things here.
While I have found problems when --junit-report is before require-suite I think @mrniranjan found some time ago, just the opposite, that everything after require-suite was ignored by Ginkgo. I have been looking around but could not found anything backing any of the two ideas.

Regarding, the wrapper being responsible of the param order, I am totally agree, in fact I do not like this approach but, as ginkgo params are passed as a string, for the wrapper to handle the proper order should have need to parse the params string and I thought that could lead to many problems, so this approach was simpler.

ffromani · 2023-05-04T06:19:37Z

hack/run-test.sh

@@ -79,7 +83,8 @@ main() {
    MESSAGE="${HEADER_MESSAGE}: ${GINKGO_SUITS}"
    print ${MESSAGE}

-    GINKGO_FLAGS="${NO_COLOR} ${EXTRA_PARAMS} --require-suite ${GINKGO_SUITS}"
+    GINKGO_FLAGS="${NO_COLOR} ${EXTRA_PARAMS} --require-suite ${GINKGO_SUITS} ${REPORT_PARAMS}"
+    print "Command to run: GOFLAGS=-mod=vendor ginkgo ${GINKGO_FLAGS}"


maybe add a dry-run flag, so the script just prints the full command and does not actually execute it?

The main target here was to allow a later inspect of the params used to run ginkgo in case of test failure.
--dry-run will make ginkgo to walk the test hierarchy and print some additional output, even with the succinct flag.
That is more info that I was looking for, but if it could be useful for later debugging we can go for it.

@ffromani makes sense? or do you still think it would be better to go for --dry-run option?

Sorry, I expressed myself poorly. I meant a --dry-run option managed by the hack/run-test.sh wrapper, which then will emit (but not run) the ginkgo commands

Added new options to run-tests.sh one to execute an actual dry-run to be able to see the list of test that would be run and another one to just see the command line without executing it.

...ceprofile/functests/8_performance_workloadhints/test_suite_performance_workloadhints_test.go

ffromani · 2023-05-04T06:23:46Z

do we know how much time we save now, roughly? E.g. down from 4h to XXX hours

test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go

Tal-or

Looking good, left small comment.
We need to make sure to have a follow-up PR on openshift/release for running the workloadHints lane

Tal-or · 2023-05-04T07:57:51Z

Makefile

@@ -199,7 +199,7 @@ pao-functests: cluster-label-worker-cnf pao-functests-only
 pao-functests-only:
 	@echo "Cluster Version"
 	hack/show-cluster-version.sh
-	hack/run-functests.sh
+	hack/run-test.sh -t "test/e2e/performanceprofile/functests" -p "--v -r --fail-fast --skip-package='5_latency_testing,2_performance_update' --flake-attempts=2 --junit-report=report.xml" -m "Running Functional Tests"


If we're already going with the generic approach I would use the common parameters by default, so we don't need to specify them for every test.
For example:
--v -r --fail-fast -m "Running Functional Tests" , etc.
in case we want to change the default we can always specify explicitly with the desired values.

I'm not super excited about default parameters in general, because I think they obscure the way a function/script works and could lead to hard to find errors ...
I usually prefer explicit over implicit.

I'm still inclined towards making it more readable and less repetitive because this was the whole idea - reduce duplication.
But we can stick with explicitly specifying the parameters, no biggie.

@Tal-or and I have slightly different pov here, any other thought about this? @yanirq @ffromani

jlojosnegros · 2023-05-04T08:51:05Z

I totally get and like the direction here, few comments inside. I'm not super excited about generalizing the runner script, while I see the code duplication argument I'm still biased towards having really boring and fully independent runners. Another option could have been extracting only the common prelude into a include bash helper, but it's ok.

I went for this approach because of the code duplication issue but, tbh I do not have a strong opinion against it in this specific situation, so if we think that the "common prelude" approach would be better it is easy to change direction right now.

jlojosnegros · 2023-05-04T09:59:47Z

do we know how much time we save now, roughly? E.g. down from 4h to XXX hours

Assuming all the other steps in e2e-gcp-pao-updating-profile took the same amount of time... we can estimate a difference of ~2h.
Example:

Here is a run from May 3 without this change : Ran for 4h12m53s
Here is the last successful run in this PR (without the Workloadhints tests): Ran for 2h5m37s

jlojosnegros · 2023-05-04T10:00:40Z

Looking good, left small comment. We need to make sure to have a follow-up PR on openshift/release for running the workloadHints lane

There already is one openshift/release#38754 just waiting for this to be merged.
In fact any comment on that PR is more than wellcomed.

Thanks :)

jlojosnegros · 2023-05-31T10:45:23Z

just rebasing over last master changes

jlojosnegros · 2023-05-31T10:53:35Z

/hold

jlojosnegros · 2023-06-01T08:06:53Z

e2e-gcp-pao-updating-profile was running workloadhints tests and it passed.

jlojosnegros · 2023-06-01T08:08:58Z

Removed last commit ( was just to check new workloadhints lane )
and /unhold

jlojosnegros · 2023-06-01T10:24:30Z

/unhold

jlojosnegros · 2023-06-05T11:03:42Z

e2e-gcp-pao fails in 5_latency_testing_suite because can't find the test executable, not sure why yet, but that has reveal a fail in the AfterSuite function that is addressed here -> #677

jlojosnegros · 2023-06-07T06:36:23Z

/test e2e-gcp-pao-updating-profile

jlojosnegros · 2023-06-07T09:04:23Z

/test e2e-hypershift

jlojosnegros · 2023-06-07T14:05:45Z

/hold
need to rebase onto #679 once merged

jlojosnegros · 2023-06-12T11:13:47Z

/hold cancel
as #679 has been already merged

We have too many duplicated bash scripts to run different ginkgo testsuites, so lets try to make a generic script to reduce code duplication.

As these tests seems to take a lot of the execution time lets extract them so we could execute them in a different lane.

yanirq · 2023-06-12T12:42:18Z

/lgtm

ffromani · 2023-06-12T12:57:37Z

/approve

we want this split.

openshift-ci · 2023-06-12T13:03:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, jlojosnegros

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlojosnegros · 2023-06-12T13:55:25Z

/test e2e-upgrade

openshift-ci · 2023-06-12T19:38:51Z

@jlojosnegros: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-06-12T19:41:43Z

@jlojosnegros: Jira Issue OCPBUGS-14193: All pull requests linked via external trackers have merged:

openshift/cluster-node-tuning-operator#631

Jira Issue OCPBUGS-14193 has been moved to the MODIFIED state.

In response to this:

ci/prow/e2e-gcp-pao-updating-profile functional test lane contains slow tests that require reboots to worker nodes and as a result long waits for mcp/tuned/other statuses to be updated.

The lane is reaching its maximum timeout of 4 hours

This calls for a need to split this lane to 1 or more test lanes that could run in parallel and in less amount of time so u/s PRs will not be blocked/long waiting

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2023

jlojosnegros force-pushed the split-ci-lanes branch from 3749e0c to 78ab097 Compare April 24, 2023 09:18

jlojosnegros commented Apr 24, 2023

View reviewed changes

hack/run-test.sh Outdated Show resolved Hide resolved

jlojosnegros force-pushed the split-ci-lanes branch from 78ab097 to 8c57eab Compare April 24, 2023 09:41

jlojosnegros force-pushed the split-ci-lanes branch from 8c57eab to 93828d4 Compare April 26, 2023 09:36

jlojosnegros marked this pull request as ready for review April 26, 2023 09:37

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2023

openshift-ci bot requested review from jmencak and Tal-or April 26, 2023 09:38

jlojosnegros mentioned this pull request Apr 26, 2023

NTO: Extract PAO workloadhints test's to new lane openshift/release#38754

Merged

jlojosnegros force-pushed the split-ci-lanes branch from bf12f01 to 31a6eec Compare April 27, 2023 13:25

yanirq mentioned this pull request May 2, 2023

OCPBUGS-11083: pao e2e: fix update test suit timeouts #626

Merged

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023

jlojosnegros force-pushed the split-ci-lanes branch from 31a6eec to 3d099bb Compare May 3, 2023 10:35

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023

ffromani reviewed May 4, 2023

View reviewed changes

yanirq reviewed May 4, 2023

View reviewed changes

test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go Outdated Show resolved Hide resolved

Tal-or reviewed May 4, 2023

View reviewed changes

jlojosnegros force-pushed the split-ci-lanes branch from 3d099bb to 156fed2 Compare May 4, 2023 09:47

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 31, 2023

jlojosnegros force-pushed the split-ci-lanes branch from becc967 to b887b73 Compare June 1, 2023 08:07

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 1, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2023

jlojosnegros force-pushed the split-ci-lanes branch from 81457cd to 8ac008a Compare June 12, 2023 08:57

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2023

jlojosnegros added 2 commits June 12, 2023 13:35

Make running bash functions more generic

7622338

We have too many duplicated bash scripts to run different ginkgo testsuites, so lets try to make a generic script to reduce code duplication.

Extract Workloadhints tests

59eaa7d

As these tests seems to take a lot of the execution time lets extract them so we could execute them in a different lane.

jlojosnegros force-pushed the split-ci-lanes branch from 8ac008a to 59eaa7d Compare June 12, 2023 11:35

openshift-ci bot assigned yanirq Jun 12, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 12, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2023

openshift-merge-robot merged commit d4fa19a into openshift:master Jun 12, 2023
12 checks passed

OCPBUGS-14193: pao e2e: Split e2e PAO update lane to more lanes #631

OCPBUGS-14193: pao e2e: Split e2e PAO update lane to more lanes #631

Conversation

jlojosnegros commented Apr 24, 2023

openshift-ci bot commented Apr 24, 2023

jlojosnegros commented Apr 24, 2023

jlojosnegros commented Apr 24, 2023

jlojosnegros commented Apr 24, 2023

jlojosnegros commented Apr 26, 2023

jlojosnegros commented Apr 28, 2023

jlojosnegros commented May 3, 2023

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani commented May 4, 2023

Tal-or left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlojosnegros commented May 4, 2023

jlojosnegros commented May 4, 2023

jlojosnegros commented May 4, 2023 • edited

jlojosnegros commented May 31, 2023

jlojosnegros commented May 31, 2023

jlojosnegros commented Jun 1, 2023

jlojosnegros commented Jun 1, 2023

jlojosnegros commented Jun 1, 2023

jlojosnegros commented Jun 5, 2023

jlojosnegros commented Jun 7, 2023

jlojosnegros commented Jun 7, 2023

jlojosnegros commented Jun 7, 2023

jlojosnegros commented Jun 12, 2023

yanirq commented Jun 12, 2023

ffromani commented Jun 12, 2023

openshift-ci bot commented Jun 12, 2023

jlojosnegros commented Jun 12, 2023

openshift-ci bot commented Jun 12, 2023

openshift-ci-robot commented Jun 12, 2023

jlojosnegros commented May 4, 2023 •

edited