Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestWebhookConverter storage version wait is flaky #78913

Closed
liggitt opened this issue Jun 11, 2019 · 9 comments · Fixed by #79114
Closed

TestWebhookConverter storage version wait is flaky #78913

liggitt opened this issue Jun 11, 2019 · 9 comments · Fixed by #79114
Assignees
Labels
area/custom-resources kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Milestone

Comments

@liggitt
Copy link
Member

liggitt commented Jun 11, 2019

Which jobs are failing:

pull-kubernetes-integration

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=TestWebhookConverter

Which test(s) are failing:

TestWebhookConverterWithDefaulting

/assign @sttts
/priority important-soon
/area custom-resources
/sig api-machinery
/kind flake
/milestone v1.16

@liggitt liggitt added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jun 11, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Jun 11, 2019
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/custom-resources sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/flake Categorizes issue or PR as related to a flaky test. labels Jun 11, 2019
@liggitt liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 12, 2019
@liggitt liggitt added this to Required for GA, not started in Custom Resource Definitions Jun 12, 2019
@fedebongio
Copy link
Contributor

/cc @roycaihw

@liggitt
Copy link
Member Author

liggitt commented Jun 14, 2019

several of the webhook converter tests have similar flakes. it looks like the wait for storage version to become effective times out occasionally

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=TestWebhookConverter

Rare flakes seen in all of these tests:

  • TestWebhookConverterWithWatchCache/nontrivial-converter
  • TestWebhookConverterWithWatchCache/metadata-mutating-converter
  • TestWebhookConverterWithDefaulting/metadata-mutating-converter
  • TestWebhookConverterWithWatchCache/noop-converter
  • TestWebhookConverterWithDefaulting/empty-response
  • TestWebhookConverterWithoutWatchCache/nontrivial-converter/check-2
  • TestWebhookConverterWithoutWatchCache/noop-converter/check-0/v1beta2
  • TestWebhookConverterWithoutWatchCache/nontrivial-converter/check-0/v1beta2

@liggitt liggitt changed the title TestWebhookConverterWithDefaulting is flaky TestWebhookConverter storage version wait is flaky Jun 14, 2019
@liggitt
Copy link
Member Author

liggitt commented Jun 14, 2019

cc @jpbetz

@jpbetz
Copy link
Contributor

jpbetz commented Jun 14, 2019

Yikes, I’ll sort it out.

@roycaihw
Copy link
Member

iiuc sending an empty patch gets short-circuited and doesn't bump CR generation, even if the storage version gets changed, which means we do not re-write to etcd and go through the encoding path

if _, err := versionedClient.Patch(obj.GetName(), types.MergePatchType, []byte(`{}`), metav1.PatchOptions{}); err != nil {

this would be racing and doing noop's if the first pass of waitForStorageVersion didn't get the expected storage version

I will send a fix to make the patch do actual mutation

@liggitt
Copy link
Member Author

liggitt commented Jun 15, 2019

That's correct. It needs to be something that will persist to storage like an incrementing annotation. Good catch

It doesn't bump generation, but the bytes to persist in etcd would still be different if the storage version had changed, so they should persist. You can verify that by changing the storage version on a CRD yourself, do an empty patch, and see if the resourceVersion changes.

@roycaihw
Copy link
Member

you're right. Generation is for non-metadata change only, and the RV changes. I queried etcd locally and the apiVersion did change for an empty patch

(sidetracking: another observation is after changing the storage version (say v1 -> v2; noop convertor), empty-patching v2 endpoint would bump generation, while empty-patching v1 endpoint wouldn't)

@roycaihw
Copy link
Member

roycaihw commented Jun 17, 2019

another theory:

our cached storage strategy could be stale if the CRD informer merges multiple events together (e.g. the informer does a re-list). Two scenarios:

  1. delete-create-update events merged into an update event:
  • strategy with UID:a was created on demand for CRD Foo
  • Foo was deleted and recreated
  • strategy with UID:b was created on demand for the new Foo
  • new Foo got updated (e.g. switching storage version)

the strategy with UID:b is supposed to be deleted and re-created on demand, to reflect the last update. But if the CRD informer merges the delete-create-update events into a single update event, the strategy with UID:a will be deleted, but the strategy with UID:b won't.

  1. create-update events merged into a create event:
  • strategy with UID:a was created on demand for CRD Foo
  • Foo was updated (e.g. switching storage version)

the strategy with UID:a is supposed to be deleted and re-created on demand, to reflect the last update. But if the CRD informer merges the create-update events into a single create event, the strategy with UID:a won't be deleted (as we don't react on create events).


another race found in #79114 (comment), which could happen in a single update.

@sftim
Copy link
Contributor

sftim commented Jun 25, 2019

I think I just saw this:

I0625 12:46:48.924841 111985 serving.go:312] Generated self-signed cert (/tmp/apiextensions-apiserver571865376/apiserver.crt, /tmp/apiextensions-apiserver571865376/apiserver.key)
W0625 12:46:49.424865 111985 mutation_detector.go:48] Mutation detector is enabled, this will result in memory leakage.
I0625 12:46:49.429201 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.429362 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.429475 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.429829 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.430378 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.430437 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.430513 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.430518 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.430631 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.430985 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
W0625 12:46:49.433088 111985 mutation_detector.go:48] Mutation detector is enabled, this will result in memory leakage.
I0625 12:46:49.435033 111985 secure_serving.go:116] Serving securely on 127.0.0.1:35555
I0625 12:46:49.436666 111985 crd_finalizer.go:255] Starting CRDFinalizer
E0625 12:46:49.436721 111985 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Get http://127.1.2.3:12345/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.1.2.3:12345: connect: connection refused
I0625 12:46:49.438386 111985 establishing_controller.go:73] Starting EstablishingController
I0625 12:46:49.438465 111985 customresource_discovery_controller.go:208] Starting DiscoveryController
I0625 12:46:49.438505 111985 naming_controller.go:288] Starting NamingConditionController
I0625 12:46:49.438747 111985 nonstructuralschema_controller.go:191] Starting NonStructuralSchemaConditionController
I0625 12:46:49.925514 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.925546 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.925595 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.925659 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.926146 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}] I0625 12:46:49.936379 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.936402 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.936561 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.936644 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.937636 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
E0625 12:46:50.437720 111985 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Get http://127.1.2.3:12345/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.1.2.3:12345: connect: connection refused
I0625 12:46:50.454139 111985 client.go:354] parsed scheme: ""
I0625 12:46:50.454167 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:50.454213 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:50.454335 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.454858 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.455454 111985 client.go:354] parsed scheme: ""
I0625 12:46:50.455475 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:50.455509 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:50.455594 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.456995 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.457113 111985 client.go:354] parsed scheme: ""
I0625 12:46:50.457128 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:50.457157 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:50.457269 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.457841 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
testserver.go:141: runtime-config=map[api/all:true]
testserver.go:142: Starting apiextensions-apiserver on port 35555...
testserver.go:160: Waiting for /healthz to be ok

@liggitt liggitt moved this from Required for GA, not started to Required for GA, in progress in Custom Resource Definitions Jun 26, 2019
@liggitt liggitt moved this from Required for GA, in progress to Complete in Custom Resource Definitions Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/custom-resources kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
7 participants