Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests #5002

Merged
merged 1 commit into from
Mar 7, 2021

Conversation

oshoval
Copy link
Contributor

@oshoval oshoval commented Feb 10, 2021

In case running multi sriov jobs on CI,
transient warnings such as the following, occur [1]
unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded
It happens only when there are 2+ jobs at the same time, and if each of the job runs
2 VMs at the same time.
In this case it happens to around 30% of the jobs.

Kubevirt itself has a re-enqueuing mechanism for this kind of warnings,
so their effect is transient, and the test would pass once continued.

Add a mechanism to allow switching selected warnings
to be non fatal.
Use this mechanism in order to update this warning severity
to log only instead of failing the test.

There are some open issues that point
its related to resource extensive usage [2] (tried the suggested method):
once there are lots of vms spinning, each vm needs its own cpu quota,
1 cpu goes to kube-reserve (0.5) and system-reserve (0.5) according kind config
which leaves us with 1 cpu total (i tried allocating 2 cpus total for the cluster),
and it seems its not enough.

See also [3] about open issues with system reserve,
and why its better to not try and solve it at this very moment, for kind.

Once we bump k8s to 1.19 in kind it might help, unless it affects only windows [4].

The timeout happens between the virt-handler and virt-launcher gRPC.
See [5] for more info.

[1]
https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_project-infra/822/rehearsal-pull-kubevirt-e2e-kind-1.17-sriov/1348156475482050563

[2] etcd-io/etcd#12234 (comment)
[3] kubernetes/kubernetes#72881
[4] kubernetes/kubernetes#95735 (comment)

[5] #5027

Signed-off-by: Or Shoval oshoval@redhat.com

None

@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/S release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 10, 2021
@oshoval
Copy link
Contributor Author

oshoval commented Feb 10, 2021

/cc @EdDev @qinqon

@EdDev
Copy link
Member

EdDev commented Feb 10, 2021

Ignoring warnings may mask real issues and we do not want that.
The reconciliation mechanism is there to recover from problems, but in tests we are suppose to control those or know what to expect explicitly.

If the specific warning that is detected is an acceptable one, then only it should be accepted and nothing more.
A detail explanation of what it is and why we can accept it under a specific condition is to be given (it has been given here partially, mainly through links).

Please also include the failure percentage due to this issue.

@oshoval
Copy link
Contributor Author

oshoval commented Feb 10, 2021

Ignoring warnings may mask real issues and we do not want that.
The reconciliation mechanism is there to recover from problems, but in tests we are suppose to control those or know what to expect explicitly.

If the specific warning that is detected is an acceptable one, then only it should be accepted and nothing more.
A detail explanation of what it is and why we can accept it under a specific condition is to be given (it has been given here partially, mainly through links).

Please also include the failure percentage due to this issue.

I think in this case we can create that mechanism in a following PR, and focus atm on the stuff that is more important (at least in my view).
Running several clusters with several VMs at the same time is resource extensive, so this specific warning occurs (will elaborate more after deeper research upon need).
About masking other warnings, the fact that there is retry mechanism in kubevirt and that e2e already have an option to ignore them, and that those warnings are transient, means imho, that we shouldn't be blocked in this very moment for presenting a new mechanism that filter specific warnings.
We will do it, but after the current effort is done if possible please.
@phoracek @EdDev @qinqon
Wdyt please, can we postpone the mechanism for a following PR, and focus on the critical path (and the right one imho) ?

Its just a warning, it should not fail the test, a better approach maybe is just to show the events in case the tests fails (unless it already shows the events) imo.

Other tests ignore warnings as well, and sriov meant to test SRIOV, not the VM spinning.
Moreover we are doing some relaxations such as etcd in ramFs, because its kind.
I think this change is also legit, at least at phase 1.
Doing phase 2 in the right time will allow us to come more prepared for it usually.

EDIT:
Moreover warnings should not fail the tests, its not an error.
Something that worth investing imo, is to add a mechanism to print the events, unless there is one already
and then in case a warning occurred and the test failed eventually, the warning will be visible there.
Update:
We do have the warning already in artifacts events file

           "reason": "SyncFailed",
            "message": "unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded",
            "source": {
                "component": "virt-handler",
                "host": "sriov-worker"
            },

The mechanism to ignoreWarnings existed before this PR, and used in several e2e test files.
WaitForSuccessfulVMIStartIgnoreWarnings
https://github.com/doc22940/kubevirt/search?q=WaitForSuccessfulVMIStartIgnoreWarnings
The mechanism:
https://github.com/doc22940/kubevirt/blob/ad8b5db41598834539be5ab83c12c0df099b4714/tests/utils.go#L2367

This PR suggest to ignore the warning, but it will still log it, and have it in the events (it checks for warning in parallel if requested else it just log the warning, and assert for errors, only thing that is worthy imo is to double check that its indeed in parallel).
https://github.com/doc22940/kubevirt/blob/ad8b5db41598834539be5ab83c12c0df099b4714/tests/utils.go#L335

@oshoval oshoval changed the title sriov, Ignore VM transient warnings sriov, Ignore SyncVMI DeadlineExceeded warning Feb 11, 2021
@oshoval
Copy link
Contributor Author

oshoval commented Feb 11, 2021

Addressed comments

@oshoval
Copy link
Contributor Author

oshoval commented Feb 12, 2021

/hold
We can still review please, but i will release the hold once its checked with new code of multi sriov on CI.
(i did check locally but with injection)

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 12, 2021
@oshoval
Copy link
Contributor Author

oshoval commented Feb 14, 2021

/hold cancel
tested on CI

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2021
@oshoval oshoval changed the title sriov, Ignore SyncVMI DeadlineExceeded warning sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests Feb 15, 2021
Copy link
Contributor

@qinqon qinqon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass, is good

tests/utils.go Show resolved Hide resolved
tests/utils.go Outdated Show resolved Hide resolved
tests/utils.go Outdated Show resolved Hide resolved
@oshoval
Copy link
Contributor Author

oshoval commented Feb 21, 2021

Addressed the resolved comments, (with little additional refactor)
Thanks

@oshoval oshoval requested a review from qinqon February 23, 2021 10:59
@oshoval
Copy link
Contributor Author

oshoval commented Feb 24, 2021

/retest

1 similar comment
@oshoval
Copy link
Contributor Author

oshoval commented Feb 24, 2021

/retest

Copy link
Contributor

@qinqon qinqon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good now

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2021
@oshoval
Copy link
Contributor Author

oshoval commented Feb 24, 2021

Thanks @qinqon @EdDev

@oshoval
Copy link
Contributor Author

oshoval commented Feb 24, 2021

/retest

1 similar comment
@oshoval
Copy link
Contributor Author

oshoval commented Feb 24, 2021

/retest

@oshoval
Copy link
Contributor Author

oshoval commented Mar 3, 2021

/assign @enp0s3

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021
In case running multi sriov jobs on CI,
transient errors such as [1]
"unknown error encountered sending command SyncVMI:
rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Kubevirt itself has a re-enqueuing mechanism for this kind of errors.
Kubevirt e2e tests already have a mechanism to ignore warnings as well,
used on some of the e2e tests.

We are using Kind with a non official mode, DinD (and even with ramFS for
the etcd).
As such, according the community there isn't a best practice
for this kind of problem, and there are some open issues that point
to resource extensive usage.

Add a mechanism to allow switching selected warnings
to be non fatal.
Use this mechanism in order to update this warning severity
to log only instead of failing the test.

[1]
https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_project-infra/822/rehearsal-pull-kubevirt-e2e-kind-1.17-sriov/1348156475482050563

See kubevirt#5027

Signed-off-by: Or Shoval <oshoval@redhat.com>
Copy link
Contributor

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! looks good, please see my comments below.
A general concern is that we don't have unit tests for our test tooling and we are adding more
functionality, but let's leave this task for a separate PR.

tests/utils.go Show resolved Hide resolved
tests/utils.go Show resolved Hide resolved
tests/utils.go Show resolved Hide resolved
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2021
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021
@oshoval
Copy link
Contributor Author

oshoval commented Mar 3, 2021

Rebased

@oshoval oshoval requested review from qinqon and enp0s3 March 3, 2021 14:09
Copy link
Contributor

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you
/approve

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2021
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enp0s3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2021
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

3 similar comments
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants