Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: flake: workspace(s) not initialized in time #2479

Closed
ncdc opened this issue Dec 12, 2022 · 11 comments
Closed

bug: flake: workspace(s) not initialized in time #2479

ncdc opened this issue Dec 12, 2022 · 11 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@ncdc
Copy link
Member

ncdc commented Dec 12, 2022

Describe the bug

We've lately been seeing random test failures that look like this:

=== FAIL: test/e2e/apibinding TestProtectedAPI (30.83s)
    kcp.go:127: shared kcp server will target configuration "/go/src/github.com/kcp-dev/kcp/.kcp/admin.kubeconfig"
    assertions.go:1691: Waiting for condition, but got: workspace phase is Initializing, not Ready
    workspaces.go:182: 
        	Error Trace:	util.go:297
        	            				workspaces.go:182
        	            				apibinding_protected_test.go:50
        	Error:      	Condition never satisfied
        	Test:       	TestProtectedAPI
        	Messages:   	failed to wait for organization workspace e2e-org-xgk49 to become ready

Steps To Reproduce

  1. Run e2e
  2. Sometimes it fails like this

Expected Behaviour

No failure

Additional Context

No response

@ncdc ncdc added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2022
@stevekuznetsov
Copy link
Contributor

We need to amend that framework.Eventually to spit out to conditions/reasons and/or add the ClusterWorkspace object to the artifacts, or this is not something that can be debugged post-hoc.

@ncdc
Copy link
Member Author

ncdc commented Jan 4, 2023

One specific (new?) reason this happens: #2440 (comment)

@ncdc
Copy link
Member Author

ncdc commented Jan 24, 2023

Saw this happen again in https://github.com/kcp-dev/kcp/actions/runs/3968949742/jobs/6802786983. Did some digging into the logs, and it looks like the kcp-workspacetypes-bootstrap-root:universal and kcp-apibinder-initializer reconcilers fight with each other for a bit over patching the LogicalCluster. Eventually the bootstrapper bit completes (removing its initializer).

Then the apibinder initializer tries multiple times to remove its initializer. I see this over and over:

I0120 15:45:04.846704   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" reconciler="kcp-apibinder-initializer" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3861\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"initializers\":[\"root:universal\"]}}"
I0120 15:45:05.253906   38441 httplog.go:131] "HTTP" verb="PATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="404.302777ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="87f7af9a-3718-4ff1-90d2-130ed357acee" srcIP="10.1.0.36:56398" resp=409 addedInfo=<

It's trying to patch at resourceVersion 3861 to remove the system:apibindings initializer, leaving the root:universal initializer remaining. Except root:universal removed itself previously - that was what bumped the RV to 3861.

This is the timing of the attempts from the apibinder controller to remove its initializer, each resulting in a conflict:

  • 15:45:04.846704
  • 15:45:05.264883
  • 15:45:05.560427
  • 15:45:05.908089
  • 15:45:06.166070
  • 15:45:06.327768
  • 15:45:06.514522
  • 15:45:06.839948
  • 15:45:08.239655
  • 15:45:10.921759
  • 15:45:16.188496
  • 15:45:36.204823
  • 15:45:36.510246
  • 15:45:36.697661
  • 15:45:36.950332
  • 15:45:47.243815

It looks like it's gone into backoff. I am not clear why the reconciler continues to try to reconcile from RV 3861 instead of seeing a newer version in the cache.

The last entry (and another one that followed) were after the test had failed, cleanup had occurred, and the LogicalCluster no longer existed. Note that it's still trying RV 3861. I'm a little confused why it's showing both root:universal and system:apibindings initializers still both exist.

I0120 15:45:47.243815   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" reconciler="kcp-apibinder-initializer" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3861\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:47Z\",\"message\":\"encountered errors: an error on the server (\\\"error getting logicalcluster ukwxjtfpgmf4w4bi|cluster: LogicalCluster.core.kcp.io \\\\\\\"cluster\\\\\\\" not found\\\") has prevented the request from succeeding (post apibindings.apis.kcp.io)\",\"reason\":\"APIBindingErrors\",\"severity\":\"Error\",\"status\":\"False\",\"type\":\"APIBindingsInitialized\"},{\"lastTransitionTime\":\"2023-01-20T15:45:03Z\",\"message\":\"Initializers still exist: [root:universal system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"

The kcp-logicalcluster-deletion reconciler is able to see the newer RV (it gets a conflict on 3925 and then quickly retries with 3927 and succeeds):

I0120 15:45:37.385918   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" component="kcp" reconciler="kcp-logicalcluster-deletion" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3925\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:37Z\",\"message\":\"Some resources are remaining: apibindings.apis.kcp.io has 5 resource instances, namespaces. has 1 resource instances; Some content in the logical cluster has finalizers remaining: apis.kcp.io/apibinding-finalizer in 5 resource instances\",\"reason\":\"SomeResourcesRemain\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceContentDeleted\"},{\"lastTransitionTime\":\"2023-01-20T15:45:04Z\",\"message\":\"Initializers still exist: [system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"
I0120 15:45:37.407808   38354 httplog.go:131] "HTTP" verb="PATCH" URI="/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="19.772867ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-logicalcluster-deletion" audit-ID="ff2e5460-2780-42ec-8bfd-1c4b9ad2c1e9" srcIP="[::1]:45884" resp=409
I0120 15:45:38.578864   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" component="kcp" reconciler="kcp-logicalcluster-deletion" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3927\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:38Z\",\"message\":\"Some resources are remaining: namespaces. has 1 resource instances\",\"reason\":\"SomeResourcesRemain\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceContentDeleted\"},{\"lastTransitionTime\":\"2023-01-20T15:45:04Z\",\"message\":\"Initializers still exist: [system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"
I0120 15:45:38.604307   38354 httplog.go:131] "HTTP" verb="PATCH" URI="/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="16.559642ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-logicalcluster-deletion" audit-ID="72c39d0b-2609-43a4-ad30-be7154b5cf1d" srcIP="[::1]:45884" resp=200```

@ncdc
Copy link
Member Author

ncdc commented Jan 25, 2023

Is there any chance that the forwarding storage (pkg/virtual/framework/forwardingregistry) is causing this staleness?

@ncdc
Copy link
Member Author

ncdc commented Jan 25, 2023

Seeing stuff like this

I0120 15:45:04.086465   38354 panic.go:884] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" latency="1m0.009056546s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="b0e0a207-e7a4-4ac7-a4d6-2ed117f57688" srcIP="10.1.0.36:37458" resp=200
E0120 15:45:04.086556   38354 wrap.go:53] timeout or abort while handling: method=GET URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" audit-ID="b0e0a207-e7a4-4ac7-a4d6-2ed117f57688"
2023/01/20 15:45:04 httputil: ReverseProxy read error during body copy: context deadline exceeded
E0120 15:45:04.087098   38354 timeout.go:141] post-timeout activity - time-elapsed: 2.3µs, GET "/services/initializingworkspaces/system:apibindings/clusters/*/apis/core.kcp.io/v1alpha1/logicalclusters" result: net/http: abort Handler

@ncdc
Copy link
Member Author

ncdc commented Jan 25, 2023

Here's a filtered set of logs showing reflector activity plus httplog for list/watch of logicalclusters for the time period in question

I0120 15:44:56.984445   38457 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 8 items received
I0120 15:44:56.984169   38420 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=254&timeout=7m53s&timeoutSeconds=473&watch=true" latency="7m53.001359805s" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="64c30d7b-3cd6-44f8-9e5f-7766ff3676c8" srcIP="[::1]:43984" resp=200
I0120 15:45:03.852115   38420 reflector.go:255] Listing and watching *v1alpha1.LogicalCluster from k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167
I0120 15:45:03.906024   38420 httplog.go:131] "HTTP" verb="LIST" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=279" latency="610.324µs" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="c9dad497-7386-44fe-9770-5d3f3c9eae61" srcIP="[::1]:43984" resp=200
I0120 15:45:03.907749   38457 httplog.go:131] "HTTP" verb="LIST" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?resourceVersion=279" latency="51.544429ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="bc920e77-6217-438f-bd06-d8e204819a37" srcIP="10.1.0.36:50728" resp=200 addedInfo=<

        &{kcp-shard-1  [system:masters system:authenticated] map[]} is acting as &{system:apiserver f88b80c1-0901-4cdf-bff5-0dffa3fd694c [system:masters system:authenticated] map[]}
 >
I0120 15:45:04.086922   38441 httplog.go:131] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" latency="1m0.009058245s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="716a097c-9b87-4c8d-8e38-427b91d990b8" srcIP="10.1.0.36:56398" resp=200 addedInfo=<

        &{kcp-shard-0  [system:masters system:authenticated] map[]} is acting as &{system:apiserver 17fe5dc1-5ca5-45d3-be87-f20346c02209 [system:masters system:authenticated] map[]}
 >
W0120 15:45:04.105390   38354 reflector.go:442] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: watch of *v1alpha1.LogicalCluster ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 147; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
I0120 15:45:04.117330   38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=816&timeoutSeconds=488&watch=true" latency="1m0.026760041s" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="7b2f45c6-2231-4bf3-94de-6adc3c0bdd66" srcIP="[::1]:57604" resp=200
I0120 15:45:27.691653   38674 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 6 items received
I0120 15:45:27.691300   38420 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=278&timeout=8m18s&timeoutSeconds=498&watch=true" latency="8m18.001507714s" userAgent="kcp-front-proxy/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="2bd6fd88-6b73-49eb-9cbd-c354550e2eea" srcIP="10.1.0.36:53304" resp=200
I0120 15:45:36.672804   38354 reflector.go:255] Listing and watching *v1alpha1.LogicalCluster from k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167
I0120 15:45:36.803375   38354 httplog.go:131] "HTTP" verb="LIST" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=3861" latency="520.92µs" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="4bae6306-9f90-4a28-b9d3-2a7c70391de2" srcIP="[::1]:57604" resp=200
I0120 15:45:36.811418   38441 httplog.go:131] "HTTP" verb="LIST" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?resourceVersion=3861" latency="109.449805ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="ef7cd986-08ab-4b5e-8d0b-864e3b26364e" srcIP="10.1.0.36:56398" resp=200 addedInfo=<

        &{kcp-shard-0  [system:masters system:authenticated] map[]} is acting as &{system:apiserver 17fe5dc1-5ca5-45d3-be87-f20346c02209 [system:masters system:authenticated] map[]}
 >
I0120 15:45:51.694521   38674 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 699 items received
I0120 15:45:51.694323   38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=494&timeout=8m42s&timeoutSeconds=522&watch=true" latency="8m42.000832803s" userAgent="kcp-front-proxy/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="4831d362-1427-4954-acb7-de8ceebdd06a" srcIP="10.1.0.36:36384" resp=200
I0120 15:45:59.483599   38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=412&timeoutSeconds=544&watch=true" latency="9m4.000518017s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-partial-metadata-informers" audit-ID="ef068ff6-9ed1-449f-b7b0-a120c9a5bb60" srcIP="[::1]:45884" resp=200
I0120 15:46:03.911728   38457 httplog.go:131] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=279&timeout=9m7s&timeoutSeconds=547&watch=true" latency="1m0.002806565s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="a32a850e-13e2-47f9-8707-de168d253fcb" srcIP="10.1.0.36:50728" resp=200 addedInfo=<

@ncdc
Copy link
Member Author

ncdc commented Jan 25, 2023

Can repro the 1-minute timeout easily locally with cmd/sharded-test-server. Looking into why some reverse proxy somewhere is terminating watches after 1 minute.

@ncdc
Copy link
Member Author

ncdc commented Jan 25, 2023

Fun fact: if you use http.DefaultTransport and the request's Context doesn't have a Done() channel assigned (it's not cancelable), the code assigns a 1-minute timeout to avoid leaking goroutines...

transport := http.DefaultTransport.(*http.Transport).Clone()

@ncdc
Copy link
Member Author

ncdc commented Jan 25, 2023

Hmm, that's not it. Still digging

@ncdc
Copy link
Member Author

ncdc commented Jan 25, 2023

Aha, so:

  1. kcp receives a watch request for /services/initializingworkspaces/...
  2. The k8s RequestInfoFactory doesn't know how to parse this URL, so it returns a RequestInfo that only has these fields filled in:
  • IsResourceRequest: false
  • Path: /services/initializingworkspaces/...
  • Verb: "get" (tolower of http method, which is GET)

Because of this, the WithRequestDeadline portion of the handler chain doesn't see this as a long-running request, and installs a 60-second deadline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
Status: Done
Development

No branches or pull requests

4 participants