Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: e2e: bring up/down SRIOV DP just once #96219

Merged
merged 1 commit into from
Dec 9, 2020

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Nov 4, 2020

What type of PR is this?

/kind cleanup
/kind failing-test
/kind flake

What this PR does / why we need it:
Simplify the topology manager e2e tests to make the flow easier to follow, less complex and less fragile

Fixes #
N/A

Special notes for your reviewer:
Please check the commit message for a noteworthy benign side effect of this change.
It seems another instance of https://bugzilla.redhat.com/1878566
Deserves separate GH issue on k/k?
Obsoletes #95611

We didn't see this issue before because it largely depend on the HW on which the test runs.
Additionally, the prow jobs don't run with SRIOV hardware, so it cannot happen in prow.

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
N/A

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2020
@k8s-ci-robot
Copy link
Contributor

@fromanirh: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 4, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @fromanirh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 4, 2020
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 4, 2020
@ffromani
Copy link
Contributor Author

ffromani commented Nov 4, 2020

/assign @sjenning

@odinuge
Copy link
Member

odinuge commented Nov 4, 2020

/ok-to-test
/test pull-kubernetes-node-kubelet-serial-topology-manager

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 4, 2020
@ffromani
Copy link
Contributor Author

ffromani commented Nov 4, 2020

The issue I mentioned in #96219 (comment) is very similar to what's happening in https://bugzilla.redhat.com/1878566 but still investigating if it is the same. Will keep investigating and file a github issue on k/k if needed

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 12, 2020
The e2e topology manager want to test the resource alignment using
devices, and the easiest devices to use are the SRIOV devices at this
moment.

The resource alignment test cases are run for each supported policies,
in a loop.

The tests manage the SRIOV device plugin; up until now, the plugin
was set up and tore down at each loop.
There is no real need for that. Each loop must reconfigure (thus
restart) the kubelet, but the device plugin can set up and tore down
just once for all the policies, thus once.
The kubelet can reconnect just fine to a running device plugin.

This way, we greatly reduce the interactions and the complexity of the
test environment, making it easier to understand and more robust, and
we trim down some minutes from execution time.

However, this patch also hides (not solves) a test flake we observed
on some environment. The issue is hardly reproduceable and not well
understood, but seems caused by doing the sriov dp setup/teardown
in each policy testing loop.
Investigation so far suggests that the kubelet sometimes have a stale
state after the sriovdp teardown/setup cycle, leading to flakes and
false negatives.
We tried to address this in kubernetes#95611
with no conclusive results yet.

This patch was posted because overall we believe this patch gains
exceeds the drawbacks (hiding the aforementioned flake) and
because understanding the potential interaction issues between the
sriovdp and the kubelet deserve a separate test.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 13, 2020
@ffromani
Copy link
Contributor Author

/retest

@rphillips
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2020
@sjenning
Copy link
Contributor

sjenning commented Dec 1, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fromanirh, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2020
@k8s-ci-robot k8s-ci-robot merged commit dcc863a into kubernetes:master Dec 9, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Dec 9, 2020
@ffromani ffromani deleted the tm-e2e-sriovdp-usage branch July 17, 2023 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants