Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-14193: pao e2e: Split e2e PAO update lane to more lanes #631

Merged
merged 2 commits into from Jun 12, 2023

Conversation

jlojosnegros
Copy link
Contributor

ci/prow/e2e-gcp-pao-updating-profile functional test lane contains slow tests that require reboots to worker nodes and as a result long waits for mcp/tuned/other statuses to be updated.

The lane is reaching its maximum timeout of 4 hours

This calls for a need to split this lane to 1 or more test lanes that could run in parallel and in less amount of time so u/s PRs will not be blocked/long waiting

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 24, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@jlojosnegros
Copy link
Contributor Author

/test all

hack/run-test.sh Outdated Show resolved Hide resolved
@jlojosnegros
Copy link
Contributor Author

/test all

@jlojosnegros
Copy link
Contributor Author

/test all

@jlojosnegros jlojosnegros marked this pull request as ready for review April 26, 2023 09:37
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2023
@openshift-ci openshift-ci bot requested review from jmencak and Tal-or April 26, 2023 09:38
@jlojosnegros
Copy link
Contributor Author

cc: @yanirq @mrniranjan

@jlojosnegros
Copy link
Contributor Author

/test e2e-gcp-pao
/test e2e-aws-ovn

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2023
@jlojosnegros
Copy link
Contributor Author

/test e2e-upgrade
infra issue

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally get and like the direction here, few comments inside.
I'm not super excited about generalizing the runner script, while I see the code duplication argument I'm still biased towards having really boring and fully independent runners. Another option could have been extracting only the common prelude into a include bash helper, but it's ok.

hack/run-test.sh Outdated Show resolved Hide resolved
hack/run-test.sh Outdated Show resolved Hide resolved
@@ -12,6 +12,7 @@ usage() {
print " -h Help for ${CURRENT_SCRIPT}"
print " -t list of space separated paths to Testsuites to execute"
print " -p string with extra Params for ginkgo"
print " -r string with report Params for ginkgo (these params will go after the list of suites)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proper positioning of ginkgo flags seems to be the responsability of this wrapper. So the wrapper can tolerate wrong order of flags, and the clarification between parens is unnecessary (a little layering violation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here.
While I have found problems when --junit-report is before require-suite I think @mrniranjan found some time ago, just the opposite, that everything after require-suite was ignored by Ginkgo. I have been looking around but could not found anything backing any of the two ideas.

Regarding, the wrapper being responsible of the param order, I am totally agree, in fact I do not like this approach but, as ginkgo params are passed as a string, for the wrapper to handle the proper order should have need to parse the params string and I thought that could lead to many problems, so this approach was simpler.

@@ -79,7 +83,8 @@ main() {
MESSAGE="${HEADER_MESSAGE}: ${GINKGO_SUITS}"
print ${MESSAGE}

GINKGO_FLAGS="${NO_COLOR} ${EXTRA_PARAMS} --require-suite ${GINKGO_SUITS}"
GINKGO_FLAGS="${NO_COLOR} ${EXTRA_PARAMS} --require-suite ${GINKGO_SUITS} ${REPORT_PARAMS}"
print "Command to run: GOFLAGS=-mod=vendor ginkgo ${GINKGO_FLAGS}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a dry-run flag, so the script just prints the full command and does not actually execute it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main target here was to allow a later inspect of the params used to run ginkgo in case of test failure.
--dry-run will make ginkgo to walk the test hierarchy and print some additional output, even with the succinct flag.
That is more info that I was looking for, but if it could be useful for later debugging we can go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffromani makes sense? or do you still think it would be better to go for --dry-run option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I expressed myself poorly. I meant a --dry-run option managed by the hack/run-test.sh wrapper, which then will emit (but not run) the ginkgo commands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new options to run-tests.sh one to execute an actual dry-run to be able to see the list of test that would be run and another one to just see the command line without executing it.

@ffromani
Copy link
Contributor

ffromani commented May 4, 2023

do we know how much time we save now, roughly? E.g. down from 4h to XXX hours

Copy link
Contributor

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left small comment.
We need to make sure to have a follow-up PR on openshift/release for running the workloadHints lane

Makefile Outdated
@@ -199,7 +199,7 @@ pao-functests: cluster-label-worker-cnf pao-functests-only
pao-functests-only:
@echo "Cluster Version"
hack/show-cluster-version.sh
hack/run-functests.sh
hack/run-test.sh -t "test/e2e/performanceprofile/functests" -p "--v -r --fail-fast --skip-package='5_latency_testing,2_performance_update' --flake-attempts=2 --junit-report=report.xml" -m "Running Functional Tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're already going with the generic approach I would use the common parameters by default, so we don't need to specify them for every test.
For example:
--v -r --fail-fast -m "Running Functional Tests" , etc.
in case we want to change the default we can always specify explicitly with the desired values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super excited about default parameters in general, because I think they obscure the way a function/script works and could lead to hard to find errors ...
I usually prefer explicit over implicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still inclined towards making it more readable and less repetitive because this was the whole idea - reduce duplication.
But we can stick with explicitly specifying the parameters, no biggie.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tal-or and I have slightly different pov here, any other thought about this? @yanirq @ffromani

@jlojosnegros
Copy link
Contributor Author

I totally get and like the direction here, few comments inside. I'm not super excited about generalizing the runner script, while I see the code duplication argument I'm still biased towards having really boring and fully independent runners. Another option could have been extracting only the common prelude into a include bash helper, but it's ok.

I went for this approach because of the code duplication issue but, tbh I do not have a strong opinion against it in this specific situation, so if we think that the "common prelude" approach would be better it is easy to change direction right now.

@jlojosnegros
Copy link
Contributor Author

do we know how much time we save now, roughly? E.g. down from 4h to XXX hours

Assuming all the other steps in e2e-gcp-pao-updating-profile took the same amount of time... we can estimate a difference of ~2h.
Example:

  • Here is a run from May 3 without this change : Ran for 4h12m53s
  • Here is the last successful run in this PR (without the Workloadhints tests): Ran for 2h5m37s

@jlojosnegros
Copy link
Contributor Author

jlojosnegros commented May 4, 2023

Looking good, left small comment. We need to make sure to have a follow-up PR on openshift/release for running the workloadHints lane

There already is one openshift/release#38754 just waiting for this to be merged.
In fact any comment on that PR is more than wellcomed.

Thanks :)

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2023
@jlojosnegros
Copy link
Contributor Author

just rebasing over last master changes

@jlojosnegros
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 31, 2023
@jlojosnegros
Copy link
Contributor Author

e2e-gcp-pao-updating-profile was running workloadhints tests and it passed.

@jlojosnegros
Copy link
Contributor Author

Removed last commit ( was just to check new workloadhints lane )
and /unhold

@jlojosnegros
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 1, 2023
@jlojosnegros
Copy link
Contributor Author

e2e-gcp-pao fails in 5_latency_testing_suite because can't find the test executable, not sure why yet, but that has reveal a fail in the AfterSuite function that is addressed here -> #677

@jlojosnegros
Copy link
Contributor Author

/test e2e-gcp-pao-updating-profile

@jlojosnegros
Copy link
Contributor Author

/test e2e-hypershift

@jlojosnegros
Copy link
Contributor Author

/hold
need to rebase onto #679 once merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2023
@jlojosnegros
Copy link
Contributor Author

/hold cancel
as #679 has been already merged

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2023
We have too many duplicated bash scripts to run different ginkgo
testsuites, so lets try to make a generic script to reduce code
duplication.
As these tests seems to take a lot of the execution time lets extract
them so we could execute them in a different lane.
@yanirq
Copy link
Contributor

yanirq commented Jun 12, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 12, 2023
@ffromani
Copy link
Contributor

/approve

we want this split.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, jlojosnegros

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2023
@jlojosnegros
Copy link
Contributor Author

/test e2e-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 12, 2023

@jlojosnegros: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit d4fa19a into openshift:master Jun 12, 2023
12 checks passed
@openshift-ci-robot
Copy link
Contributor

@jlojosnegros: Jira Issue OCPBUGS-14193: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-14193 has been moved to the MODIFIED state.

In response to this:

ci/prow/e2e-gcp-pao-updating-profile functional test lane contains slow tests that require reboots to worker nodes and as a result long waits for mcp/tuned/other statuses to be updated.

The lane is reaching its maximum timeout of 4 hours

This calls for a need to split this lane to 1 or more test lanes that could run in parallel and in less amount of time so u/s PRs will not be blocked/long waiting

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants