TestWebhookConverter storage version wait is flaky #78913

liggitt · 2019-06-11T18:49:33Z

Which jobs are failing:

pull-kubernetes-integration

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=TestWebhookConverter

Which test(s) are failing:

TestWebhookConverterWithDefaulting

/assign @sttts
/priority important-soon
/area custom-resources
/sig api-machinery
/kind flake
/milestone v1.16

fedebongio · 2019-06-13T20:20:13Z

/cc @roycaihw

liggitt · 2019-06-14T13:21:18Z

several of the webhook converter tests have similar flakes. it looks like the wait for storage version to become effective times out occasionally

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=TestWebhookConverter

Rare flakes seen in all of these tests:

TestWebhookConverterWithWatchCache/nontrivial-converter
TestWebhookConverterWithWatchCache/metadata-mutating-converter
TestWebhookConverterWithDefaulting/metadata-mutating-converter
TestWebhookConverterWithWatchCache/noop-converter
TestWebhookConverterWithDefaulting/empty-response
TestWebhookConverterWithoutWatchCache/nontrivial-converter/check-2
TestWebhookConverterWithoutWatchCache/noop-converter/check-0/v1beta2
TestWebhookConverterWithoutWatchCache/nontrivial-converter/check-0/v1beta2

liggitt · 2019-06-14T13:21:42Z

cc @jpbetz

jpbetz · 2019-06-14T16:39:18Z

Yikes, I’ll sort it out.

roycaihw · 2019-06-15T00:57:19Z

iiuc sending an empty patch gets short-circuited and doesn't bump CR generation, even if the storage version gets changed, which means we do not re-write to etcd and go through the encoding path

kubernetes/staging/src/k8s.io/apiextensions-apiserver/test/integration/conversion/conversion_test.go

Line 958 in ec02afb

    
           if _, err := versionedClient.Patch(obj.GetName(), types.MergePatchType, []byte(`{}`), metav1.PatchOptions{}); err != nil {

this would be racing and doing noop's if the first pass of waitForStorageVersion didn't get the expected storage version

I will send a fix to make the patch do actual mutation

liggitt · 2019-06-15T01:15:36Z

~~That's correct. It needs to be something that will persist to storage like an incrementing annotation. Good catch~~

It doesn't bump generation, but the bytes to persist in etcd would still be different if the storage version had changed, so they should persist. You can verify that by changing the storage version on a CRD yourself, do an empty patch, and see if the resourceVersion changes.

roycaihw · 2019-06-15T05:20:30Z

you're right. Generation is for non-metadata change only, and the RV changes. I queried etcd locally and the apiVersion did change for an empty patch

(sidetracking: another observation is after changing the storage version (say v1 -> v2; noop convertor), empty-patching v2 endpoint would bump generation, while empty-patching v1 endpoint wouldn't)

roycaihw · 2019-06-17T21:41:35Z

another theory:

our cached storage strategy could be stale if the CRD informer merges multiple events together (e.g. the informer does a re-list). Two scenarios:

delete-create-update events merged into an update event:

strategy with UID:a was created on demand for CRD Foo
Foo was deleted and recreated
strategy with UID:b was created on demand for the new Foo
new Foo got updated (e.g. switching storage version)

the strategy with UID:b is supposed to be deleted and re-created on demand, to reflect the last update. But if the CRD informer merges the delete-create-update events into a single update event, the strategy with UID:a will be deleted, but the strategy with UID:b won't.

create-update events merged into a create event:

strategy with UID:a was created on demand for CRD Foo
Foo was updated (e.g. switching storage version)

the strategy with UID:a is supposed to be deleted and re-created on demand, to reflect the last update. But if the CRD informer merges the create-update events into a single create event, the strategy with UID:a won't be deleted (as we don't react on create events).

another race found in #79114 (comment), which could happen in a single update.

sftim · 2019-06-25T13:43:15Z

I think I just saw this:

I0625 12:46:48.924841 111985 serving.go:312] Generated self-signed cert (/tmp/apiextensions-apiserver571865376/apiserver.crt, /tmp/apiextensions-apiserver571865376/apiserver.key)
W0625 12:46:49.424865 111985 mutation_detector.go:48] Mutation detector is enabled, this will result in memory leakage.
I0625 12:46:49.429201 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.429362 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.429475 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.429829 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.430378 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.430437 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.430513 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.430518 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.430631 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.430985 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
W0625 12:46:49.433088 111985 mutation_detector.go:48] Mutation detector is enabled, this will result in memory leakage.
I0625 12:46:49.435033 111985 secure_serving.go:116] Serving securely on 127.0.0.1:35555
I0625 12:46:49.436666 111985 crd_finalizer.go:255] Starting CRDFinalizer
E0625 12:46:49.436721 111985 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Get http://127.1.2.3:12345/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.1.2.3:12345: connect: connection refused
I0625 12:46:49.438386 111985 establishing_controller.go:73] Starting EstablishingController
I0625 12:46:49.438465 111985 customresource_discovery_controller.go:208] Starting DiscoveryController
I0625 12:46:49.438505 111985 naming_controller.go:288] Starting NamingConditionController
I0625 12:46:49.438747 111985 nonstructuralschema_controller.go:191] Starting NonStructuralSchemaConditionController
I0625 12:46:49.925514 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.925546 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.925595 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.925659 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.926146 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}] I0625 12:46:49.936379 111985 client.go:354] parsed scheme: ""
I0625 12:46:49.936402 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:49.936561 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:49.936644 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:49.937636 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
E0625 12:46:50.437720 111985 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Get http://127.1.2.3:12345/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.1.2.3:12345: connect: connection refused
I0625 12:46:50.454139 111985 client.go:354] parsed scheme: ""
I0625 12:46:50.454167 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:50.454213 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:50.454335 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.454858 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.455454 111985 client.go:354] parsed scheme: ""
I0625 12:46:50.455475 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:50.455509 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:50.455594 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.456995 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.457113 111985 client.go:354] parsed scheme: ""
I0625 12:46:50.457128 111985 client.go:354] scheme "" not registered, fallback to default scheme
I0625 12:46:50.457157 111985 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0625 12:46:50.457269 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
I0625 12:46:50.457841 111985 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
testserver.go:141: runtime-config=map[api/all:true]
testserver.go:142: Starting apiextensions-apiserver on port 35555...
testserver.go:160: Waiting for /healthz to be ok

liggitt added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jun 11, 2019

k8s-ci-robot assigned sttts Jun 11, 2019

k8s-ci-robot added this to the v1.16 milestone Jun 11, 2019

liggitt mentioned this issue Jun 11, 2019

pr:pull-kubernetes-integration flaked 34 times in the past week #78909

Closed

liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 12, 2019

liggitt added this to Required for GA, not started in Custom Resource Definitions Jun 12, 2019

liggitt changed the title ~~TestWebhookConverterWithDefaulting is flaky~~ TestWebhookConverter storage version wait is flaky Jun 14, 2019

roycaihw mentioned this issue Jun 15, 2019

Flake fix: patch probe CR with an incrementing annotation #79059

Closed

roycaihw mentioned this issue Jun 17, 2019

crd-handler: re-create stale CR storage on update #79114

Merged

This was referenced Jun 19, 2019

test images: Removes linux/ prefix from agnhost BASEIMAGE #79151

Merged

tests: Replaces images used with agnhost (part 2) #78396

Merged

liggitt mentioned this issue Jun 25, 2019

flaky TestWebhookConverter tests #79358

Closed

This was referenced Jun 25, 2019

pr:pull-kubernetes-integration flaked 41 times in the past week #79177

Closed

fix vendor scripts shellcheck failures #79316

Merged

liggitt moved this from Required for GA, not started to Required for GA, in progress in Custom Resource Definitions Jun 26, 2019

k8s-ci-robot closed this as completed in #79114 Jun 26, 2019

liggitt moved this from Required for GA, in progress to Complete in Custom Resource Definitions Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestWebhookConverter storage version wait is flaky #78913

TestWebhookConverter storage version wait is flaky #78913

liggitt commented Jun 11, 2019 •

edited

fedebongio commented Jun 13, 2019

liggitt commented Jun 14, 2019

liggitt commented Jun 14, 2019

jpbetz commented Jun 14, 2019

roycaihw commented Jun 15, 2019

liggitt commented Jun 15, 2019 •

edited

roycaihw commented Jun 15, 2019

roycaihw commented Jun 17, 2019 •

edited

sftim commented Jun 25, 2019

TestWebhookConverter storage version wait is flaky #78913

TestWebhookConverter storage version wait is flaky #78913

Comments

liggitt commented Jun 11, 2019 • edited

fedebongio commented Jun 13, 2019

liggitt commented Jun 14, 2019

liggitt commented Jun 14, 2019

jpbetz commented Jun 14, 2019

roycaihw commented Jun 15, 2019

liggitt commented Jun 15, 2019 • edited

roycaihw commented Jun 15, 2019

roycaihw commented Jun 17, 2019 • edited

sftim commented Jun 25, 2019

liggitt commented Jun 11, 2019 •

edited

liggitt commented Jun 15, 2019 •

edited

roycaihw commented Jun 17, 2019 •

edited