Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote gRPC probe e2e test to Conformance #115856

Merged

Conversation

lanycrost
Copy link
Contributor

What type of PR is this?

/kind cleanup
/area tests
/area conformance

@kubernetes/sig-architecture-pr-reviews @kubernetes/sig-node-pr-reviews @kubernetes/cncf-conformance-wg

What this PR does / why we need it:

adds tests in test/e2e/common/node/container_probe.go:

  • should not be restarted with a GRPC liveness probe.
  • should be restarted with a GRPC liveness probe.

Which issue(s) this PR fixes:

Fixes #115780

Special notes for your reviewer:

I'm thinking about the changing Port type of GRPCAction struct to intstr.IntOrString.

What you think about it?

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 17, 2023
@k8s-ci-robot
Copy link
Contributor

@lanycrost: The label(s) area/tests cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind cleanup
/area tests
/area conformance

@kubernetes/sig-architecture-pr-reviews @kubernetes/sig-node-pr-reviews @kubernetes/cncf-conformance-wg

What this PR does / why we need it:

adds tests in test/e2e/common/node/container_probe.go:

  • should not be restarted with a GRPC liveness probe.
  • should be restarted with a GRPC liveness probe.

Which issue(s) this PR fixes:

Fixes #115780

Special notes for your reviewer:

I'm thinking about the changing Port type of GRPCAction struct to intstr.IntOrString.

What you think about it?

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/conformance Issues or PRs related to kubernetes conformance tests sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 17, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @lanycrost. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@lanycrost: Reiterating the mentions to trigger a notification:
@kubernetes/sig-architecture-pr-reviews, @kubernetes/sig-node-pr-reviews

In response to this:

What type of PR is this?

/kind cleanup
/area tests
/area conformance

@kubernetes/sig-architecture-pr-reviews @kubernetes/sig-node-pr-reviews @kubernetes/cncf-conformance-wg

What this PR does / why we need it:

adds tests in test/e2e/common/node/container_probe.go:

  • should not be restarted with a GRPC liveness probe.
  • should be restarted with a GRPC liveness probe.

Which issue(s) this PR fixes:

Fixes #115780

Special notes for your reviewer:

I'm thinking about the changing Port type of GRPCAction struct to intstr.IntOrString.

What you think about it?

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 17, 2023
@liggitt
Copy link
Member

liggitt commented Feb 17, 2023

I'm thinking about the changing Port type of GRPCAction struct to intstr.IntOrString.

i'm not sure why that was mentioned in this pull request, but that's not a compatible change to make, and would break existing clients, so we can't do that

@lanycrost
Copy link
Contributor Author

@liggitt I'm using GRPCAction here.

For example look HTTPGetAction and TCPSocketAction also have Port field which type is intstr.IntOrString .

@liggitt
Copy link
Member

liggitt commented Feb 17, 2023

That was an explicit design choice when adding GRPC actions. From https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2727-grpc-probe/README.md#design-details:

Note that GRPCAction.Port is an int32, which is inconsistent with the other existing probe definitions. This is on purpose -- we want to move users away from using the (portNum, portName) union type.

Even if it wasn't intentional and we did want to change it, all API clients built since 1.23.0 would break if the server ever started returning string values in port field of a grpc action, so that wouldn't be a change we could make.

@lanycrost
Copy link
Contributor Author

OK, thanks, got it

@dims
Copy link
Member

dims commented Feb 17, 2023

@lanycrost we don't mark tests as conformance from the get go. we add them as normal tests and check for flakiness over a couple of weeks and then mark them for conformance if appropriate:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md

@bart0sh
Copy link
Contributor

bart0sh commented Feb 19, 2023

/release-note-none
/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 19, 2023
@bart0sh bart0sh added this to Needs Reviewer in SIG Node PR Triage Feb 19, 2023
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 26, 2023

@lanycrost: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-publishing-bot-validate b139ce9ea456c1e25a31dda64fdbd0fbb080fef4 link false /test pull-publishing-bot-validate
check-dependency-stats b139ce9ea456c1e25a31dda64fdbd0fbb080fef4 link false /test check-dependency-stats

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aojea
Copy link
Member

aojea commented Feb 26, 2023

the promoted test is flaking

Kubernetes e2e suite: [It] [sig-node] Probing container should be restarted with a GRPC liveness probe [NodeConformance] [Conformance] expand_less 4m34s
{ failed [FAILED] pod container-probe-7709/test-grpc-87ff2188-1d6f-4a44-8039-3e4e81e88bd3 - expected number of restarts: 1, found restarts: 0
In [It] at: test/e2e/common/node/container_probe.go:1000 @ 02/26/23 08:38:31.981

the condition for a Conformance test is to show stability, this failure has to be analyzed

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2023
@SergeyKanzhelev
Copy link
Member

@lanycrost are you interested to investigate failures?

Link: https://storage.googleapis.com/k8s-triage/index.html?test=GRPC%20liveness

Failures classified into "In [BeforeEach]" are not interesting. Looking at the mismatched number of restarts section.

In this section, click on occurrences of test failures and on stdout for the gRPC tests.

First example I see - liveness failed, but container doesn't seem to be restarted (or restartCount didn't increment):

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-containerd-gce-master/1630079839400628224

> Enter [BeforeEach] [sig-node] Probing container - set up framework | framework.go:188 @ 02/27/23 06:15:31.542
STEP: Creating a kubernetes client - test/e2e/framework/framework.go:208 @ 02/27/23 06:15:31.542
Feb 27 06:15:31.542: INFO: >>> kubeConfig: /workspace/.kube/config
STEP: Building a namespace api object, basename container-probe - test/e2e/framework/framework.go:247 @ 02/27/23 06:15:31.543
STEP: Waiting for a default service account to be provisioned in namespace - test/e2e/framework/framework.go:256 @ 02/27/23 06:15:31.671
STEP: Waiting for kube-root-ca.crt to be provisioned in namespace - test/e2e/framework/framework.go:259 @ 02/27/23 06:15:31.753
< Exit [BeforeEach] [sig-node] Probing container - set up framework | framework.go:188 @ 02/27/23 06:15:31.835 (293ms)
> Enter [BeforeEach] [sig-node] Probing container - test/e2e/framework/metrics/init/init.go:33 @ 02/27/23 06:15:31.835
< Exit [BeforeEach] [sig-node] Probing container - test/e2e/framework/metrics/init/init.go:33 @ 02/27/23 06:15:31.835 (0s)
> Enter [BeforeEach] [sig-node] Probing container - test/e2e/common/node/container_probe.go:62 @ 02/27/23 06:15:31.835
< Exit [BeforeEach] [sig-node] Probing container - test/e2e/common/node/container_probe.go:62 @ 02/27/23 06:15:31.835 (0s)
> Enter [It] should be restarted with a GRPC liveness probe [NodeConformance] - test/e2e/common/node/container_probe.go:546 @ 02/27/23 06:15:31.835
STEP: Creating pod test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b in namespace container-probe-4834 - test/e2e/common/node/container_probe.go:955 @ 02/27/23 06:15:31.835
Feb 27 06:15:46.235: INFO: Started pod test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b in namespace container-probe-4834
STEP: checking the pod's current state and verifying that restartCount is present - test/e2e/common/node/container_probe.go:966 @ 02/27/23 06:15:46.235
Feb 27 06:15:46.277: INFO: Initial restart count of pod test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b is 0
[FAILED] pod container-probe-4834/test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b - expected number of restarts: 1, found restarts: 0
In [It] at: test/e2e/common/node/container_probe.go:1000 @ 02/27/23 06:19:48.043
< Exit [It] should be restarted with a GRPC liveness probe [NodeConformance] - test/e2e/common/node/container_probe.go:546 @ 02/27/23 06:19:48.043 (4m16.208s)
> Enter [AfterEach] [sig-node] Probing container - test/e2e/framework/node/init/init.go:33 @ 02/27/23 06:19:48.043
Feb 27 06:19:48.043: INFO: Waiting up to 3m0s for all (but 3) nodes to be ready
< Exit [AfterEach] [sig-node] Probing container - test/e2e/framework/node/init/init.go:33 @ 02/27/23 06:19:48.095 (52ms)
> Enter [DeferCleanup (Each)] [sig-node] Probing container - test/e2e/common/node/container_probe.go:951 @ 02/27/23 06:19:48.095
STEP: deleting the pod - test/e2e/common/node/container_probe.go:952 @ 02/27/23 06:19:48.095
< Exit [DeferCleanup (Each)] [sig-node] Probing container - test/e2e/common/node/container_probe.go:951 @ 02/27/23 06:19:48.177 (82ms)
> Enter [DeferCleanup (Each)] [sig-node] Probing container - test/e2e/framework/metrics/init/init.go:35 @ 02/27/23 06:19:48.177
< Exit [DeferCleanup (Each)] [sig-node] Probing container - test/e2e/framework/metrics/init/init.go:35 @ 02/27/23 06:19:48.177 (0s)
> Enter [DeferCleanup (Each)] [sig-node] Probing container - dump namespaces | framework.go:206 @ 02/27/23 06:19:48.177
STEP: dump namespace information after failure - test/e2e/framework/framework.go:284 @ 02/27/23 06:19:48.177
STEP: Collecting events from namespace "container-probe-4834". - test/e2e/framework/debug/dump.go:42 @ 02/27/23 06:19:48.177
STEP: Found 6 events. - test/e2e/framework/debug/dump.go:46 @ 02/27/23 06:19:48.22
Feb 27 06:19:48.220: INFO: At 2023-02-27 06:15:31 +0000 UTC - event for test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b: {default-scheduler } Scheduled: Successfully assigned container-probe-4834/test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b to e2e-42e9cafe98-734c0-windows-node-group-xgw0
Feb 27 06:19:48.220: INFO: At 2023-02-27 06:15:39 +0000 UTC - event for test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b: {kubelet e2e-42e9cafe98-734c0-windows-node-group-xgw0} Pulled: Container image "registry.k8s.io/e2e-test-images/agnhost:2.43" already present on machine
Feb 27 06:19:48.220: INFO: At 2023-02-27 06:15:40 +0000 UTC - event for test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b: {kubelet e2e-42e9cafe98-734c0-windows-node-group-xgw0} Created: Created container agnhost
Feb 27 06:19:48.220: INFO: At 2023-02-27 06:15:44 +0000 UTC - event for test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b: {kubelet e2e-42e9cafe98-734c0-windows-node-group-xgw0} Started: Started container agnhost
Feb 27 06:19:48.220: INFO: At 2023-02-27 06:16:57 +0000 UTC - event for test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b: {kubelet e2e-42e9cafe98-734c0-windows-node-group-xgw0} Unhealthy: Liveness probe failed: timeout: failed to connect service "10.64.2.178:2333" within 5s: context deadline exceeded
Feb 27 06:19:48.220: INFO: At 2023-02-27 06:16:57 +0000 UTC - event for test-grpc-ee2a0179-f463-4fbc-80f6-3ffa03fbbb9b: {kubelet e2e-42e9cafe98-734c0-windows-node-group-xgw0} Killing: Container agnhost failed liveness probe, will be restarted

Looks like this is a windows Node. Looking at the kubelet.log (artifact link from the test failure). Kubelet log shows that the event was detected, but kubelet ignored the need for restart.

So it seems like gRPC worked fine, the issue is in Windows kubelet itself.

Looking at other types of probes: https://storage.googleapis.com/k8s-triage/index.html?test=liveness It is indeed the case

@SergeyKanzhelev
Copy link
Member

@lanycrost please find the linux test grid demonstrating the all green execution. I think all probe failures needs to be investigated separately. No need to block this KEP release on overall problems with Windows.

@SergeyKanzhelev
Copy link
Member

Filed: #116123

@lanycrost
Copy link
Contributor Author

@SergeyKanzhelev yes, sure. I will try to understand what's going on here to get it work correctly.

@SergeyKanzhelev
Copy link
Member

@aojea one example of non-flaking: https://testgrid.k8s.io/sig-windows-signal#capz-windows-containerd-master&include-filter-by-regex=GRPC

As I pointed out before - flakes are hapenning on all probe types and not related to gRPC. I'd suggest we unblock this PR.

@aojea
Copy link
Member

aojea commented Feb 28, 2023

@aojea one example of non-flaking: https://testgrid.k8s.io/sig-windows-signal#capz-windows-containerd-master&include-filter-by-regex=GRPC

As I pointed out before - flakes are hapenning on all probe types and not related to gRPC. I'd suggest we unblock this PR.

Is about the stability of the test on the monitored jobs, blocking and informing, https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness , the triage link picks a lot of errors and most of them are usually noise from jobs that are not stable.

These tests in presubmits and periodic release-blocking and release-informing are stable and monitored daily, do we know why this test failed https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/115856/pull-kubernetes-conformance-kind-ga-only-parallel/1629756997509320704

#115856 (comment)

That the test that is going to be promoted as conformance is just the one that fails in presubmit is at least something that deserve to be explained before merge, or we are being incoherent with ourselves allowing to merge with a flake

@SergeyKanzhelev
Copy link
Member

The failure in below is:

Feb 26 08:38:32.021: INFO: At 2023-02-26 08:35:10 +0000 UTC - event for test-grpc-87ff2188-1d6f-4a44-8039-3e4e81e88bd3: {kubelet kind-worker} Unhealthy: Liveness probe errored: missing probe handler for test-grpc-87ff2188-1d6f-4a44-8039-3e4e81e88bd3_container-probe-7709(a0f8d60d-50c8-485d-8db3-5482133f747e):agnhost

Seems like the ordering needs to change. So the GA should happen before we promoting to Conformance. I will submit a PR to GA the feature. I treated promotion to Conformance as a prerequisite

@SergeyKanzhelev SergeyKanzhelev moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Mar 1, 2023
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to PRs - Needs Approver in SIG Node CI/Test Board Mar 1, 2023
@SergeyKanzhelev
Copy link
Member

this is done: #116233

I wonder if tests will just pick it up or you need to rebase. I'd suggest rebasing to be on a safe side

@aojea
Copy link
Member

aojea commented Mar 4, 2023

I wonder if tests will just pick it up or you need to rebase. I'd suggest rebasing to be on a safe side

prow rebases, github actions does not (at least last time I checked)

/test pull-kubernetes-conformance-kind-ga-only-parallel

@SergeyKanzhelev
Copy link
Member

All tests are passing now. Time to merge.

@dims
Copy link
Member

dims commented Mar 6, 2023

@SergeyKanzhelev thanks!

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, lanycrost, saschagrunert, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lanycrost
Copy link
Contributor Author

lanycrost commented Mar 9, 2023

@SergeyKanzhelev seems we need to remove do-not-merge/hold label

@lanycrost
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2023
@k8s-ci-robot k8s-ci-robot merged commit da87af6 into kubernetes:master Mar 9, 2023
SIG Node CI/Test Board automation moved this from PRs - Needs Approver to Done Mar 9, 2023
SIG Node PR Triage automation moved this from Needs Approver to Done Mar 9, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Mar 9, 2023
@lanycrost lanycrost restored the e2e-115780-grpc-probe-tests branch April 18, 2023 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/conformance Issues or PRs related to kubernetes conformance tests area/dependency Issues or PRs related to dependency changes area/kubectl area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

Promote gRPC probe tests to conformance
9 participants