Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-17359: test/e2e: Don't use openshift/origin-node #970

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented Aug 4, 2023

test/e2e: Don't use "openshift/origin-node" image

Use the "openshift/tools" image from the cluster image registry instead of using the "openshift/origin-node" image pullspec in E2E tests.

Before this change, the E2E tests were inadvertently pulling the "openshift/origin-node" image from Docker Hub and getting rate-limited.

The choice to use "openshift/tools" is based on a similar change here: openshift/origin@4cbb844

Follow-up to #410 and #451.

  • test/e2e/util_test.go (buildEchoPod, buildSlowHTTPDPod): Replace the "openshift/origin-node" image pullspec with "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest".

TestHstsPolicyWorks: Dump events if test fails

  • test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Dump events in case of test failure, using the new dumpEventsInNamespace helper.
  • test/e2e/util_test.go (dumpEventsInNamespace): New helper function to log all events in a namespace.

TestHstsPolicyWorks: Wait for namespace to be provisioned

When creating a new namespace for the TestHstsPolicyWorks test, wait for the "default" ServiceAccount and the "system:image-pullers" RoleBinding to be provisioned in the newly created namespace before proceeding with the test. Make a similar change for the TestMTLSWithCRLsCerts test.

Before this change, TestHstsPolicyWorks sometimes failed because it tried to create a pod before the ServiceAccount had been provisioned and granted access to pull images. As a result, the test would randomly fail with the following error:

Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = reading manifest

This change should prevent such failures.

Because TestMTLSWithCRLsCerts also creates a namespace and then creates pods in this namespace, this PR makes the same change to this test as well. Some other tests create namespaces but do not create pods in those
namespaces; those tests do not necessarily need to wait for the ServiceAccount and RoleBinding.

Inspired by openshift/origin@877c652.

  • test/e2e/client_tls_test.go (TestMTLSWithCRLs):
  • test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Use the new createNamespace helper.
  • test/e2e/util_test.go (createNamespace): New helper function. Create a namespace with the specified name, register a cleanup handler to delete the namespace when the test finishes, wait for the "default" ServiceAccount and "system:image-pullers" RoleBinding to be created, and return the namespace.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Aug 4, 2023
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-17359, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Use the "openshift/tools" image from the cluster image registry instead of using the "openshift/origin-node" image pullspec in E2E tests.

Before this change, the E2E tests were inadvertently pulling the "openshift/origin-node" image from Docker Hub and getting rate-limited.

The choice to use "openshift/tools" is based on a similar change here: openshift/origin@4cbb844

Follow-up to #410 and #451.

  • test/e2e/util_test.go (buildEchoPod, buildSlowHTTPDPod): Replace the "openshift/origin-node" image pullspec with "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita
Copy link
Contributor

candita commented Aug 4, 2023

/retest-required

@frobware
Copy link
Contributor

frobware commented Aug 4, 2023

/lgtm
/approve

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 4, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4e7b2da and 2 for PR HEAD 01c2d8b in total

@candita
Copy link
Contributor

candita commented Aug 4, 2023

hsts_policy_test.go:147: failed to find header [max-age=0;preload;includesubdomains]: timed out waiting for the condition

/test e2e-azure-operator

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 833bc28 and 1 for PR HEAD 01c2d8b in total

@Miciah
Copy link
Contributor Author

Miciah commented Aug 4, 2023

Using a Cluster Bot cluster to run TestHstsPolicyWorks, I see that the kubelet is failing to pull "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" for the "hsts-policy-echo" pod:

Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       48s                default-scheduler  Successfully assigned hsts-policy-namespace/hsts-policy-echo to ip-10-0-26-86.ec2.internal
  Normal   AddedInterface  48s                multus             Add eth0 [10.129.2.3/23] from ovn-kubernetes
  Normal   BackOff         19s (x3 over 47s)  kubelet            Back-off pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
  Warning  Failed          19s (x3 over 47s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling         5s (x3 over 48s)   kubelet            Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
  Warning  Failed          5s (x3 over 48s)   kubelet            Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = reading manifest
latest in image-registry.openshift-image-registry.svc:5000/openshift/tools: authentication required
  Warning  Failed          5s (x3 over 48s)   kubelet            Error: ErrImagePull

@Miciah
Copy link
Contributor Author

Miciah commented Aug 5, 2023

Hm, sometimes the test is able to pull the image, and sometimes it gets that auth error, during repeated tests on the same cluster.

@Miciah
Copy link
Contributor Author

Miciah commented Aug 5, 2023

Watching oc -n openshift-image-registry logs -ldocker-registry=default --tail=0, the image registry is sometimes denying the image pull during one test run:

time="2023-08-05T00:26:31.931724609Z" level=error msg="OpenShift access denied: no opinion" go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=403cb81a-1e08-42bf-b3c3-c65f57d2b619 http.request.method=GET http.request.remoteaddr="100.64.0.7:39038" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" openshift.auth.user=anonymous vars.name=openshift/tools vars.reference=latest
time="2023-08-05T00:26:31.93178121Z" level=warning msg="error authorizing context: access denied" go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=403cb81a-1e08-42bf-b3c3-c65f57d2b619 http.request.method=GET http.request.remoteaddr="100.64.0.7:39038" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" vars.name=openshift/tools vars.reference=latest
time="2023-08-05T00:26:31.931848511Z" level=info msg=response go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=20c2a054-f06d-4f62-8a0c-b4e5185a6284 http.request.method=GET http.request.remoteaddr="100.64.0.7:39038" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" http.response.contenttype=application/json http.response.duration=2.437939ms http.response.status=401 http.response.written=158
time="2023-08-05T00:26:48.31509818Z" level=error msg="OpenShift access denied: no opinion" go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=c775a56a-3125-4211-9a5a-a8493e2ac309 http.request.method=GET http.request.remoteaddr="100.64.0.7:48090" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" openshift.auth.user=anonymous vars.name=openshift/tools vars.reference=latest
time="2023-08-05T00:26:48.315145541Z" level=warning msg="error authorizing context: access denied" go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=c775a56a-3125-4211-9a5a-a8493e2ac309 http.request.method=GET http.request.remoteaddr="100.64.0.7:48090" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" vars.name=openshift/tools vars.reference=latest
time="2023-08-05T00:26:48.315177572Z" level=info msg=response go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=1d4f1da9-9f01-40ea-9b21-571b7d5d691d http.request.method=GET http.request.remoteaddr="100.64.0.7:48090" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" http.response.contenttype=application/json http.response.duration=2.396133ms http.response.status=401 http.response.written=158

And then the image registry is allowing the pull in the next run:

time="2023-08-05T00:27:05.807299045Z" level=info msg="authorized request" go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=a2a449f4-ce12-4f30-80a9-b9f050782826 http.request.method=GET http.request.remoteaddr="100.64.0.7:58848" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" openshift.auth.user="system:serviceaccount:hsts-policy-namespace:default" vars.name=openshift/tools vars.reference=latest
time="2023-08-05T00:27:05.926022483Z" level=info msg="response completed" go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=a2a449f4-ce12-4f30-80a9-b9f050782826 http.request.method=GET http.request.remoteaddr="100.64.0.7:58848" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" http.response.contenttype=application/vnd.docker.distribution.manifest.v2+json http.response.duration=123.8248ms http.response.status=200 http.response.written=1252 openshift.auth.user="system:serviceaccount:hsts-policy-namespace:default" vars.name=openshift/tools vars.reference=latest
time="2023-08-05T00:27:05.926060873Z" level=info msg=response go.version="go1.20.5 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=8bfe698c-5476-48bc-b740-0c763cce1b30 http.request.method=GET http.request.remoteaddr="100.64.0.7:58848" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.27.1-4.rhaos4.14.gitab7845e.el9 go/go1.20.5 os/linux arch/amd64" http.response.contenttype=application/vnd.docker.distribution.manifest.v2+json http.response.duration=123.883201ms http.response.status=200 http.response.written=1252

@Miciah
Copy link
Contributor Author

Miciah commented Aug 5, 2023

The successful request has openshift.auth.user="system:serviceaccount:hsts-policy-namespace:default", and the failing request has openshift.auth.user=anonymous. 🤔...

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 5, 2023
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-17359, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

test/e2e: Don't use "openshift/origin-node" image

Use the "openshift/tools" image from the cluster image registry instead of using the "openshift/origin-node" image pullspec in E2E tests.

Before this change, the E2E tests were inadvertently pulling the "openshift/origin-node" image from Docker Hub and getting rate-limited.

The choice to use "openshift/tools" is based on a similar change here: openshift/origin@4cbb844

Follow-up to #410 and #451.

  • test/e2e/util_test.go (buildEchoPod, buildSlowHTTPDPod): Replace the "openshift/origin-node" image pullspec with "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest".

TestHstsPolicyWorks: Dump events if test fails

  • test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Dump events in case of test failure, using the new dumpEventsInNamespace helper.
  • test/e2e/util_test.go (dumpEventsInNamespace): New helper function to log all events in a namespace.

TestHstsPolicyWorks: Wait for namespace to be provisioned

When creating a new namespace for the TestHstsPolicyWorks test, wait for the "default" ServiceAccount and the "system:image-pullers" RoleBinding to be provisioned in the newly created namespace before proceeding with the test. Make a similar change for the TestMTLSWithCRLsCerts test.

Before this change, TestHstsPolicyWorks sometimes failed because it tried to create a pod before the ServiceAccount had been provisioned and granted access to pull images. As a result, the test would randomly fail with the following error:

Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = reading manifest

This change should prevent such failures.

Because TestMTLSWithCRLsCerts also creates a namespace and then creates pods in this namespace, this PR makes the same change to this test as well. Some other tests create namespaces but do not create pods in those
namespaces; those tests do not necessarily need to wait for the ServiceAccount and RoleBinding.

Inspired by openshift/origin@877c652.

  • test/e2e/client_tls_test.go (TestMTLSWithCRLs):
  • test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Use the new createNamespace helper.
  • test/e2e/util_test.go (createNamespace): New helper function. Create a namespace with the specified name, register a cleanup handler to delete the namespace when the test finishes, wait for the "default" ServiceAccount and "system:image-pullers" RoleBinding to be created, and return the namespace.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Aug 5, 2023

The two new commits seem to prevent the "authentication required" errors. In manual testing with these changes, I have seen the test pass over 20 times and fail 0 times.

@Miciah
Copy link
Contributor Author

Miciah commented Aug 5, 2023

e2e-aws-ovn-serial failed because Undiagnosed panic detected in pod failed:

{  pods/openshift-cloud-network-config-controller_cloud-network-config-controller-c5f776f49-j76xk_controller_previous.log.gz:E0805 03:33:08.727528       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}

I believe this is a known issue: OCPBUGS-17151.

Also, [sig-storage] PersistentVolumes-local Stress with local volumes [Serial] should be able to process many pods and reuse local volumes failed:

{  fail [test/e2e/storage/persistent_volumes-local.go:522]: persistentvolumes "local-pvkbdlx" not found
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}

This is also a known issue: OCPBUGS-14930.
/test e2e-aws-ovn-serial

e2e-aws-operator failed because must-gather failed.
/test e2e-aws-operator

Use the "openshift/tools" image from the cluster image registry instead of
using the "openshift/origin-node" image pullspec in E2E tests.

Before this commit, the E2E tests were inadvertently pulling the
"openshift/origin-node" image from Docker Hub and getting rate-limited.

The choice to use "openshift/tools" is based on a similar change here:
openshift/origin@4cbb844

Follow-up to commit 167bcc2
and commit a635566.

This commit fixes OCPBUGS-17359.

https://issues.redhat.com/browse/OCPBUGS-17359

* test/e2e/util_test.go (buildEchoPod, buildSlowHTTPDPod): Replace the
"openshift/origin-node" image pullspec with
"image-registry.openshift-image-registry.svc:5000/openshift/tools:latest".
@Miciah Miciah force-pushed the OCPBUGS-17359-test-slash-e2e-don't-use-openshift-slash-origin-node branch from 839d1b2 to 4216df0 Compare August 9, 2023 15:15
@Miciah
Copy link
Contributor Author

Miciah commented Aug 9, 2023

e2e-aws-ovn-serial failed with same failure for test [sig-storage] PersistentVolumes-local Stress with local volumes [Serial] should be able to process many pods and reuse local volumes:

{  fail [test/e2e/storage/persistent_volumes-local.go:522]: persistentvolumes "local-pvd2fz8" not found
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}

@Miciah
Copy link
Contributor Author

Miciah commented Aug 9, 2023

@Miciah
Copy link
Contributor Author

Miciah commented Aug 10, 2023

e2e-aws-operator failed because must-gather failed.
/test e2e-aws-operator

e2e-gcp-operator failed because TestAllowedSourceRanges, TestAllowedSourceRangesStatus, TestInternalLoadBalancer, and TestUserDefinedIngressController failed. It appears that these tests failed because it took too long for the LB that each tests creates to get created or updated. For example, the TestInternalLoadBalancer test creates an IngressController named "testinternalloadbalancer" that requests an LB, and the test then runs a polling loop with a 5-minute timeout waiting for the LB to be provisioned, but the LB is actually taking more than 5 minutes to become ready, counting from the time the pod is scheduled to the time the LB is reported ready:

% jq < events.json -c '.items|sort_by(.metadata.creationTimestamp)|.[]|select(.involvedObject.namespace=="openshift-ingress" and (.reason=="Scheduled" or .reason=="EnsuredLoadBalancer"))|[.metadata.creationTimestamp,.involvedObject.kind,.involvedObject.name,.reason]' | grep -w -e testinternalloadbalancer    
["2023-08-09T16:14:03Z","Pod","router-testinternalloadbalancer-84f74cb78b-qx9xq","Scheduled"]
["2023-08-09T16:19:29Z","Service","router-testinternalloadbalancer","EnsuredLoadBalancer"]

Similarly, TestUserDefinedIngressController creates an IngressController named "testuserdefinedingresscontroller" and waits 5 minutes for it to be ready, and it is taking more than 5 minutes to become ready:

% jq < events.json -c '.items|sort_by(.metadata.creationTimestamp)|.[]|select(.involvedObject.namespace=="openshift-ingress" and (.reason=="Scheduled" or .reason=="EnsuredLoadBalancer"))|[.metadata.creationTimestamp,.involvedObject.kind,.involvedObject.name,.reason]' | grep -w -e testuserdefinedingresscontroller
["2023-08-09T16:16:50Z","Pod","router-testuserdefinedingresscontroller-78bb97f5b6-675cl","Scheduled"]
["2023-08-09T16:22:13Z","Service","router-testuserdefinedingresscontroller","EnsuredLoadBalancer"]

The TestAllowedSourceRangesStatus test likewise creates an IngressController named "sourcerangesstatus", waits 5 minutes, then modifies configuration related to the LB and waits for the update:

% jq < events.json -c '.items|sort_by(.metadata.creationTimestamp)|.[]|select(.involvedObject.namespace=="openshift-ingress" and (.reason=="Scheduled" or .reason=="EnsuredLoadBalancer"))|[.metadata.creationTimestamp,.involvedObject.kind,.involvedObject.name,.reason]' | grep -w -e sourcerangesstatus
["2023-08-09T16:11:00Z","Pod","router-sourcerangesstatus-9cd78d588-dqbsv","Scheduled"]
["2023-08-09T16:16:11Z","Service","router-sourcerangesstatus","EnsuredLoadBalancer"]
["2023-08-09T16:32:03Z","Pod","router-sourcerangesstatus-9cd78d588-z88pw","Scheduled"]
["2023-08-09T16:32:53Z","Service","router-sourcerangesstatus","EnsuredLoadBalancer"]

TestAllowedSourceRanges in contrast creates a deliberately misconfigured IngressController named "sourcerange", so the IngressController doesn't get provisioned at first; then the test modifies the IngressController with valid configuration so that the LB can be provisioned. Then, the test waits 1 minute for the LB to be provisioned, but the LB takes over a minute to become ready:

% jq < events.json -c '.items|sort_by(.metadata.creationTimestamp)|.[]|select(.involvedObject.namespace=="openshift-ingress" and (.reason=="Scheduled" or .reason=="EnsuredLoadBalancer"))|[.metadata.creationTimestamp,.involvedObject.kind,.involvedObject.name,.reason]' | grep -w -e sourcerange
["2023-08-09T16:11:01Z","Pod","router-sourcerange-7b886595c8-prkt6","Scheduled"]
["2023-08-09T16:32:04Z","Pod","router-sourcerange-7b886595c8-cqhhz","Scheduled"]
["2023-08-09T16:33:44Z","Service","router-sourcerange","EnsuredLoadBalancer"]

We might need to adjust these timeouts, but if ELBs are consistently taking longer to provision than they used to, we also should look into that. Looking into the kube-controller logs, there is a rather conspicuous 3m39s gap from 16:14 to 16:17, between Error syncing endpoint slices for service "openshift-ingress/router-testinternalloadbalancer", retrying. Error: EndpointSlice informer cache is out of date and Ensuring load balancer for service openshift-ingress/router-testinternalloadbalancer, and then a somewhat conspicuous 1m43s gap from 16:17 to 16:29 before Ensured load balancer:

% grep -h -e router-testinternalloadbalancer -- namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-*/kube-controller-manager/kube-controller-manager/logs/*.log 
2023-08-09T16:14:03.482683918Z I0809 16:14:03.482609       1 replica_set.go:571] "Too few replicas" replicaSet="openshift-ingress/router-testinternalloadbalancer-84f74cb78b" need=1 creating=1
2023-08-09T16:14:03.483178036Z I0809 16:14:03.483086       1 event.go:307] "Event occurred" object="openshift-ingress/router-testinternalloadbalancer" fieldPath="" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set router-testinternalloadbalancer-84f74cb78b to 1"
2023-08-09T16:14:03.502208701Z I0809 16:14:03.502132       1 event.go:307] "Event occurred" object="openshift-ingress/router-testinternalloadbalancer-84f74cb78b" fieldPath="" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: router-testinternalloadbalancer-84f74cb78b-qx9xq"
2023-08-09T16:14:03.512974851Z I0809 16:14:03.512911       1 deployment_controller.go:503] "Error syncing deployment" deployment="openshift-ingress/router-testinternalloadbalancer" err="Operation cannot be fulfilled on deployments.apps \"router-testinternalloadbalancer\": the object has been modified; please apply your changes to the latest version and try again"
2023-08-09T16:14:07.057662221Z I0809 16:14:07.057585       1 replica_set.go:461] ReplicaSet "router-testinternalloadbalancer-84f74cb78b" will be enqueued after 30s for availability check
2023-08-09T16:14:07.082018038Z W0809 16:14:07.081951       1 endpointslice_controller.go:297] Error syncing endpoint slices for service "openshift-ingress/router-testinternalloadbalancer", retrying. Error: EndpointSlice informer cache is out of date
2023-08-09T16:17:46.872430243Z I0809 16:17:46.871676       1 controller.go:388] Ensuring load balancer for service openshift-ingress/router-testinternalloadbalancer
2023-08-09T16:17:46.872430243Z I0809 16:17:46.871722       1 controller.go:887] Adding finalizer to service openshift-ingress/router-testinternalloadbalancer
2023-08-09T16:17:46.872511683Z I0809 16:17:46.872423       1 event.go:307] "Event occurred" object="openshift-ingress/router-testinternalloadbalancer" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
2023-08-09T16:19:03.442314693Z I0809 16:19:03.442087       1 deployment_controller.go:597] "Deployment has been deleted" deployment="openshift-ingress/router-testinternalloadbalancer"
2023-08-09T16:19:03.442314693Z I0809 16:19:03.442161       1 garbagecollector.go:533] "Processing item" item="[monitoring.coreos.com/v1/ServiceMonitor, namespace: openshift-ingress, name: router-testinternalloadbalancer, uid: 56f210f1-9a74-46ba-8780-666834feae3b]" virtual=false
2023-08-09T16:19:03.442314693Z I0809 16:19:03.442249       1 garbagecollector.go:533] "Processing item" item="[apps/v1/ReplicaSet, namespace: openshift-ingress, name: router-testinternalloadbalancer-84f74cb78b, uid: 9ce1e076-3b17-4be1-a1d7-e1e40ba638dd]" virtual=false
2023-08-09T16:19:03.442314693Z I0809 16:19:03.442270       1 garbagecollector.go:533] "Processing item" item="[v1/Service, namespace: openshift-ingress, name: router-testinternalloadbalancer, uid: b4b97aca-b006-4ca4-a6aa-464896304086]" virtual=false
2023-08-09T16:19:03.465034758Z I0809 16:19:03.464471       1 garbagecollector.go:672] "Deleting item" item="[monitoring.coreos.com/v1/ServiceMonitor, namespace: openshift-ingress, name: router-testinternalloadbalancer, uid: 56f210f1-9a74-46ba-8780-666834feae3b]" propagationPolicy=Background
2023-08-09T16:19:03.465368730Z I0809 16:19:03.464575       1 garbagecollector.go:672] "Deleting item" item="[v1/Service, namespace: openshift-ingress, name: router-testinternalloadbalancer, uid: b4b97aca-b006-4ca4-a6aa-464896304086]" propagationPolicy=Background
2023-08-09T16:19:03.467089357Z I0809 16:19:03.466716       1 garbagecollector.go:672] "Deleting item" item="[apps/v1/ReplicaSet, namespace: openshift-ingress, name: router-testinternalloadbalancer-84f74cb78b, uid: 9ce1e076-3b17-4be1-a1d7-e1e40ba638dd]" propagationPolicy=Background
2023-08-09T16:19:03.486778335Z I0809 16:19:03.486567       1 garbagecollector.go:533] "Processing item" item="[v1/Pod, namespace: openshift-ingress, name: router-testinternalloadbalancer-84f74cb78b-qx9xq, uid: 2cdf06a1-aff2-45f4-ace4-9a07a829afa1]" virtual=false
2023-08-09T16:19:03.519219754Z I0809 16:19:03.508957       1 garbagecollector.go:672] "Deleting item" item="[v1/Pod, namespace: openshift-ingress, name: router-testinternalloadbalancer-84f74cb78b-qx9xq, uid: 2cdf06a1-aff2-45f4-ace4-9a07a829afa1]" propagationPolicy=Background
2023-08-09T16:19:04.660860221Z I0809 16:19:04.660794       1 garbagecollector.go:533] "Processing item" item="[apps/v1/Deployment, namespace: openshift-ingress, name: router-testinternalloadbalancer, uid: ef10ff41-d3b2-4e94-a64e-e3691694f65e]" virtual=true
2023-08-09T16:19:29.346434716Z I0809 16:19:29.346309       1 controller.go:928] Patching status for service openshift-ingress/router-testinternalloadbalancer
2023-08-09T16:19:29.346838911Z I0809 16:19:29.346792       1 event.go:307] "Event occurred" object="openshift-ingress/router-testinternalloadbalancer" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
2023-08-09T16:23:14.022928645Z I0809 16:23:14.022042       1 controller.go:369] Deleting existing load balancer for service openshift-ingress/router-testinternalloadbalancer
2023-08-09T16:23:14.022928645Z I0809 16:23:14.022684       1 event.go:307] "Event occurred" object="openshift-ingress/router-testinternalloadbalancer" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="DeletingLoadBalancer" message="Deleting load balancer"
2023-08-09T16:24:04.000418556Z I0809 16:24:04.000364       1 deployment_controller.go:597] "Deployment has been deleted" deployment="openshift-ingress/router-testinternalloadbalancer"
2023-08-09T16:24:20.865370064Z I0809 16:24:20.865241       1 controller.go:902] Removing finalizer from service openshift-ingress/router-testinternalloadbalancer
2023-08-09T16:24:20.885177033Z I0809 16:24:20.885124       1 controller.go:928] Patching status for service openshift-ingress/router-testinternalloadbalancer
2023-08-09T16:24:20.886273090Z I0809 16:24:20.886221       1 event.go:307] "Event occurred" object="openshift-ingress/router-testinternalloadbalancer" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="DeletedLoadBalancer" message="Deleted load balancer"
2023-08-09T16:24:20.886709000Z I0809 16:24:20.886664       1 garbagecollector.go:533] "Processing item" item="[discovery.k8s.io/v1/EndpointSlice, namespace: openshift-ingress, name: router-testinternalloadbalancer-klw88, uid: 6caae1a3-1b8a-4ac1-b596-d0f45c546cc7]" virtual=false
2023-08-09T16:24:20.936934069Z I0809 16:24:20.936882       1 garbagecollector.go:672] "Deleting item" item="[discovery.k8s.io/v1/EndpointSlice, namespace: openshift-ingress, name: router-testinternalloadbalancer-klw88, uid: 6caae1a3-1b8a-4ac1-b596-d0f45c546cc7]" propagationPolicy=Background

/test e2e-gcp-operator

@@ -712,3 +712,19 @@ func getRouteHost(t *testing.T, route *routev1.Route, router string) string {
t.Fatalf("failed to find host name for default router in route: %#v", route)
return ""
}

// dumpEventsInNamespace gets the namespaces in the specified namespace and logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be "dumpEventsInNamespace gets the events"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Dump events in case
of test failure, using the new dumpEventsInNamespace helper.
* test/e2e/util_test.go (dumpEventsInNamespace): New helper function to
log all events in a namespace.
When creating a new namespace for the TestHstsPolicyWorks test, wait for
the "default" ServiceAccount and the "system:image-pullers" RoleBinding to
be provisioned in the newly created namespace before proceeding with the
test.  Make a similar change for the TestMTLSWithCRLsCerts test.

Before this commit, TestHstsPolicyWorks sometimes failed because it tried
to create a pod before the ServiceAccount had been provisioned and granted
access to pull images.  As a result, the test would randomly fail with the
following error:

    Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = reading manifest

This change should prevent such failures.

Because TestMTLSWithCRLsCerts also creates a namespace and then creates
pods in this namespace, this commit makes the same change to this test as
well.  Some other tests create namespaces but do not create pods in those
namespaces; those tests do not necessarily need to wait for the
ServiceAccount and RoleBinding.

Inspired by openshift/origin@877c652.

* test/e2e/client_tls_test.go (TestMTLSWithCRLs):
* test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Use the new
createNamespace helper.
* test/e2e/util_test.go (createNamespace): New helper function.  Create a
namespace with the specified name, register a cleanup handler to delete the
namespace when the test finishes, wait for the "default" ServiceAccount and
"system:image-pullers" RoleBinding to be created, and return the namespace.
@Miciah Miciah force-pushed the OCPBUGS-17359-test-slash-e2e-don't-use-openshift-slash-origin-node branch from 4216df0 to 42c4f82 Compare August 10, 2023 15:00
@Miciah
Copy link
Contributor Author

Miciah commented Aug 10, 2023

@frobware
Copy link
Contributor

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 8f2f035 and 2 for PR HEAD 42c4f82 in total

@Miciah
Copy link
Contributor Author

Miciah commented Aug 10, 2023

e2e-gcp-operator failed because TestUserDefinedIngressController and TestUnmanagedDNSToManagedDNSInternalIngressController failed. From the test output, this looks like the issue described in #970 (comment). (To do: File a bug report for this issue.)
/test e2e-gcp-operator

e2e-hypershift failed because TestNodePool/NodePool_Tests_Group/TestNodepoolMachineconfigGetsRolledout/EnsureNoCrashingPods and TestNodePool/NodePool_Tests_Group/TestNTOMachineConfigGetsRolledOut/EnsureNoCrashingPods failed.
/test e2e-hypershift

e2e-aws-ovn-single-node failed because 338 of 3821 tests failed. The first failure in the list was an Undiagnosed panic detected in pod failure:

{  pods/openshift-controller-manager_controller-manager-7f55874d69-twwz8_controller-manager.log.gz:E0810 16:47:11.628946       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(0x38e6da0), concrete:(*runtime._type)(0x3aa81e0), asserted:(*runtime._type)(0x3d79a20), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.BuildConfig)
pods/openshift-controller-manager_controller-manager-7f55874d69-twwz8_controller-manager.log.gz:E0810 16:47:12.630035       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(0x38e6da0), concrete:(*runtime._type)(0x3aa81e0), asserted:(*runtime._type)(0x3d79a20), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.BuildConfig)}

I filed OCPBUGS-17632 for this panic. Let's see what happens if we retry the job.
/test e2e-aws-ovn-single-node

e2e-gcp-ovn failed because [sig-network] pods should successfully create sandboxes by adding pod to network failed. Let's see if that happens again.
/test e2e-gcp-ovn

@Miciah
Copy link
Contributor Author

Miciah commented Aug 11, 2023

e2e-gcp-operator failed because TestUnmanagedDNSToManagedDNSInternalIngressController failed again. I see this test failing for #872 too.
/test e2e-gcp-operator

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 56a00a7 and 1 for PR HEAD 42c4f82 in total

@Miciah
Copy link
Contributor Author

Miciah commented Aug 11, 2023

e2e-gcp-operator failed because TestInternalLoadBalancer failed because it timed out while waiting for the LB to be provisioned.

@Miciah
Copy link
Contributor Author

Miciah commented Aug 11, 2023

I filed OCPBUGS-17670 for the issue that LBs can take over 5 minutes to provision on GCP.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 11, 2023

@Miciah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 42c4f82 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Miciah
Copy link
Contributor Author

Miciah commented Aug 11, 2023

e2e-azure-operator failed because TestManagedDNSToUnmanagedDNSIngressController failed. I filed OCPBUGS-17671 for this issue.
/test e2e-azure-operator

@openshift-merge-robot openshift-merge-robot merged commit be01a22 into openshift:master Aug 12, 2023
13 of 14 checks passed
@openshift-ci-robot
Copy link
Contributor

@Miciah: Jira Issue OCPBUGS-17359: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-17359 has been moved to the MODIFIED state.

In response to this:

test/e2e: Don't use "openshift/origin-node" image

Use the "openshift/tools" image from the cluster image registry instead of using the "openshift/origin-node" image pullspec in E2E tests.

Before this change, the E2E tests were inadvertently pulling the "openshift/origin-node" image from Docker Hub and getting rate-limited.

The choice to use "openshift/tools" is based on a similar change here: openshift/origin@4cbb844

Follow-up to #410 and #451.

  • test/e2e/util_test.go (buildEchoPod, buildSlowHTTPDPod): Replace the "openshift/origin-node" image pullspec with "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest".

TestHstsPolicyWorks: Dump events if test fails

  • test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Dump events in case of test failure, using the new dumpEventsInNamespace helper.
  • test/e2e/util_test.go (dumpEventsInNamespace): New helper function to log all events in a namespace.

TestHstsPolicyWorks: Wait for namespace to be provisioned

When creating a new namespace for the TestHstsPolicyWorks test, wait for the "default" ServiceAccount and the "system:image-pullers" RoleBinding to be provisioned in the newly created namespace before proceeding with the test. Make a similar change for the TestMTLSWithCRLsCerts test.

Before this change, TestHstsPolicyWorks sometimes failed because it tried to create a pod before the ServiceAccount had been provisioned and granted access to pull images. As a result, the test would randomly fail with the following error:

Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = reading manifest

This change should prevent such failures.

Because TestMTLSWithCRLsCerts also creates a namespace and then creates pods in this namespace, this PR makes the same change to this test as well. Some other tests create namespaces but do not create pods in those
namespaces; those tests do not necessarily need to wait for the ServiceAccount and RoleBinding.

Inspired by openshift/origin@877c652.

  • test/e2e/client_tls_test.go (TestMTLSWithCRLs):
  • test/e2e/hsts_policy_test.go (TestHstsPolicyWorks): Use the new createNamespace helper.
  • test/e2e/util_test.go (createNamespace): New helper function. Create a namespace with the specified name, register a cleanup handler to delete the namespace when the test finishes, wait for the "default" ServiceAccount and "system:image-pullers" RoleBinding to be created, and return the namespace.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 25, 2023

/cherry-pick release-4.13

@openshift-cherrypick-robot

@Miciah: #970 failed to apply on top of branch "release-4.13":

Applying: test/e2e: Don't use openshift/origin-node
Applying: TestHstsPolicyWorks: Dump events if test fails
Using index info to reconstruct a base tree...
M	test/e2e/util_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/util_test.go
CONFLICT (content): Merge conflict in test/e2e/util_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 TestHstsPolicyWorks: Dump events if test fails
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants