Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add periodic job for OCP with realtime workers #7154

Merged
merged 4 commits into from
Mar 16, 2020

Conversation

slintes
Copy link
Member

@slintes slintes commented Feb 13, 2020

Recently a new feature was introduced for MachineConfigs to install a realtime kernel on nodes, see [0] and [1]. To detect regressions with the rt kernel early, this PR introduces a new periodic job.

Todos (AFAIK):

  • Create a new periodic job for OCP 4.4, using a new CLUSTER_VARIANT "rt" The job needs to run on GCP, as the rt kernel does not work on all AWS machine types.
  • Handle the new variant in cluster-launch-installer-e2e.yaml
  • When the job works, add it to release-ocp-4.4.json
  • When that also works, repeat for OCP 4.5 and 4.6

[0] openshift/enhancements#166
[1] openshift/machine-config-operator#1330

FYI @MarSik

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 13, 2020
@slintes
Copy link
Member Author

slintes commented Feb 13, 2020

Nice, it seems to work :) In the ci/rehearse/release-openshift-ocp-installer-e2e-gcp-rt-4.4 job I see:

  • the realtime kernel setting in the rendered worker MachineConfig
  • kernel: Linux version 4.18.0-147.5.1.rt24.98.el8_1.x86_64 in workers journal

But I also see 5 failed tests... are those flaky, or can it be a real issue?

@slintes
Copy link
Member Author

slintes commented Feb 13, 2020

@slintes
Copy link
Member Author

slintes commented Feb 13, 2020

/test pj-rehearse

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 13, 2020
@slintes slintes changed the title [WIP] Add periodic job for OCP with realtime workers Add periodic job for OCP with realtime workers Feb 13, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 13, 2020
@slintes
Copy link
Member Author

slintes commented Feb 13, 2020

2nd run had different failures, so I assume they are flakes.
I added jobs for 4.5 and 4.6, and added all to the relevant release json configs.
That's all I know I have to do, please let me know if anything is missing, thanks!

@slintes
Copy link
Member Author

slintes commented Feb 13, 2020

/test all

@slintes
Copy link
Member Author

slintes commented Feb 14, 2020

/test pj-rehearse

@slintes
Copy link
Member Author

slintes commented Feb 14, 2020

looking at other jobs, the unable to import latest release image: timed out waiting for the condition failure seems to be normal on 4.5 and 4.6?

@cgwalters
Copy link
Member

/approve

@slintes
Copy link
Member Author

slintes commented Feb 20, 2020

/assign @smarterclayton

@slintes
Copy link
Member Author

slintes commented Feb 24, 2020

ping @smarterclayton, can I get a review please :)

openshift-install --dir=/tmp/artifacts/installer/ create manifests
echo "${CLUSTER_NETWORK_MANIFEST}" > /tmp/artifacts/installer/manifests/cluster-network-03-config.yml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should probably be flattened out:

if [[ -n "${CLUSTER_NETWORK_MANIFEST:-}" ]] || has_variant "rt"; then
    openshift-install --dir=/tmp/artifacts/installer/ create manifests
fi
if [[ -n "${CLUSTER_NETWORK_MANIFEST:-}" ]]; then
    echo ...
fi
if has_variant "rt"; then
    cat > ...
fi

or at least untangled:

if [[ -n "${CLUSTER_NETWORK_MANIFEST:-}" ]]; then
    openshift-install...
    echo ...
fi
if has_variant "rt"; then
    openshift-install...
    cat >...
fi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!

About your 1st suggestion: not sure, doesn't it make more sense to deal with both features which need to modify manifests inside the if branch which creates them?

About the 2nd: that might probably break if both features are used simultaneously, it would call create manifests twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, keep the variant separate. CLUSTER_NETWORK_MANIFEST is a highly specialized use case and support is limited.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split them, and then insight has_variant "rt" have an error condition that fails if CLUSTER_NETWORK_MANIFEST is set:

if has_variant "rt"; then
  if [[ -n "{CLUSTER_NETWORK_MANIFEST:-}" ]]; then
    echo 'error: CLUSTER_NETWORK_MANIFEST is incompatible with the `rt` variant'
    exit 1
  fi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@jstuever
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2020
@jwforres
Copy link
Member

jwforres commented Mar 9, 2020

/test pj-rehearse

Comment on lines 1634 to 1638
- mountPath: /usr/local/pull-secret
name: pull-secret
- mountPath: /usr/local/pull-secret
name: release-pull-secret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got bit by a rebase issue, you should just have release-pull-secret, rename all references to pull-secret in this PR to release-pull-secret and drop the dupes. There was a rename here 91ef459

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ha, thanks, fixed

@slintes
Copy link
Member Author

slintes commented Mar 9, 2020

The e2e-gcp-rt-4.4 lane looks good, and surprisingly also the e2e-gcp-rt-4.5 lane.
Looking into the app-ci-config failure...

@slintes
Copy link
Member Author

slintes commented Mar 10, 2020

/test app-ci-config

@slintes
Copy link
Member Author

slintes commented Mar 10, 2020

/test pj-rehearse

@slintes
Copy link
Member Author

slintes commented Mar 10, 2020

ci/rehearse/release-openshift-ocp-installer-e2e-gcp-rt-4.4 is green with 5 test failures
ci/rehearse/release-openshift-ocp-installer-e2e-gcp-rt-4.5 is red but with only 6 failures
ci/rehearse/release-openshift-ocp-installer-e2e-gcp-rt-4.6 failed early with unable to import latest release image: timed out waiting for the condition, is this expected?

ci/rehearse/openshift/cloud-credential-operator/master/e2e-azure still running atm, but the job is unchanged and seems to be red usually anyway (looking at https://prow.svc.ci.openshift.org/?job=*master-e2e-azure)

@jstuever jstuever removed their assignment Mar 13, 2020
@slintes
Copy link
Member Author

slintes commented Mar 16, 2020

/test all

@slintes
Copy link
Member Author

slintes commented Mar 16, 2020

/test pj-rehearse

@slintes
Copy link
Member Author

slintes commented Mar 16, 2020

/retest

@slintes
Copy link
Member Author

slintes commented Mar 16, 2020

gcp-rt-4.6 still fails early, but other gcp-4.6 lanes fail with the same error.
Everything else looks good from my POV.

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
@slintes
Copy link
Member Author

slintes commented Mar 16, 2020

/test pj-rehearse

@smarterclayton
Copy link
Contributor

/lgtm

the rt-4.5 job appeared to boot into the right worker kernel

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jstuever, slintes, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2020
@openshift-merge-robot openshift-merge-robot merged commit 0258807 into openshift:master Mar 16, 2020
@openshift-ci-robot
Copy link
Contributor

@slintes: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/release-openshift-ocp-installer-e2e-gcp-rt-4.6 05e0471 link /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/master/e2e-gcp 05e0471 link /test pj-rehearse
ci/rehearse/release-openshift-ocp-installer-e2e-gcp-rt-4.5 05e0471 link /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/master/e2e-azure 05e0471 link /test pj-rehearse
ci/prow/pj-rehearse 05e0471 link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
7 participants