Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestWebhookConverter storage version wait is flaky #78913

Open
liggitt opened this issue Jun 11, 2019 · 8 comments

Comments

@liggitt
Copy link
Member

commented Jun 11, 2019

Which jobs are failing:

pull-kubernetes-integration

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=TestWebhookConverter

Which test(s) are failing:

TestWebhookConverterWithDefaulting

/assign @sttts
/priority important-soon
/area custom-resources
/sig api-machinery
/kind flake
/milestone v1.16

@fedebongio

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

/cc @roycaihw

@liggitt

This comment has been minimized.

Copy link
Member Author

commented Jun 14, 2019

several of the webhook converter tests have similar flakes. it looks like the wait for storage version to become effective times out occasionally

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=TestWebhookConverter

Rare flakes seen in all of these tests:

  • TestWebhookConverterWithWatchCache/nontrivial-converter
  • TestWebhookConverterWithWatchCache/metadata-mutating-converter
  • TestWebhookConverterWithDefaulting/metadata-mutating-converter
  • TestWebhookConverterWithWatchCache/noop-converter
  • TestWebhookConverterWithDefaulting/empty-response
  • TestWebhookConverterWithoutWatchCache/nontrivial-converter/check-2
  • TestWebhookConverterWithoutWatchCache/noop-converter/check-0/v1beta2
  • TestWebhookConverterWithoutWatchCache/nontrivial-converter/check-0/v1beta2

@liggitt liggitt changed the title TestWebhookConverterWithDefaulting is flaky TestWebhookConverter storage version wait is flaky Jun 14, 2019

@liggitt

This comment has been minimized.

Copy link
Member Author

commented Jun 14, 2019

cc @jpbetz

@jpbetz

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

Yikes, I’ll sort it out.

@roycaihw

This comment has been minimized.

Copy link
Member

commented Jun 15, 2019

iiuc sending an empty patch gets short-circuited and doesn't bump CR generation, even if the storage version gets changed, which means we do not re-write to etcd and go through the encoding path

if _, err := versionedClient.Patch(obj.GetName(), types.MergePatchType, []byte(`{}`), metav1.PatchOptions{}); err != nil {

this would be racing and doing noop's if the first pass of waitForStorageVersion didn't get the expected storage version

I will send a fix to make the patch do actual mutation

@liggitt

This comment has been minimized.

Copy link
Member Author

commented Jun 15, 2019

That's correct. It needs to be something that will persist to storage like an incrementing annotation. Good catch

It doesn't bump generation, but the bytes to persist in etcd would still be different if the storage version had changed, so they should persist. You can verify that by changing the storage version on a CRD yourself, do an empty patch, and see if the resourceVersion changes.

@roycaihw

This comment has been minimized.

Copy link
Member

commented Jun 15, 2019

you're right. Generation is for non-metadata change only, and the RV changes. I queried etcd locally and the apiVersion did change for an empty patch

(sidetracking: another observation is after changing the storage version (say v1 -> v2; noop convertor), empty-patching v2 endpoint would bump generation, while empty-patching v1 endpoint wouldn't)

@roycaihw

This comment has been minimized.

Copy link
Member

commented Jun 17, 2019

another theory:

our cached storage strategy could be stale if the CRD informer merges multiple events together (e.g. the informer does a re-list). Two scenarios:

  1. delete-create-update events merged into an update event:
  • strategy with UID:a was created on demand for CRD Foo
  • Foo was deleted and recreated
  • strategy with UID:b was created on demand for the new Foo
  • new Foo got updated (e.g. switching storage version)

the strategy with UID:b is supposed to be deleted and re-created on demand, to reflect the last update. But if the CRD informer merges the delete-create-update events into a single update event, the strategy with UID:a will be deleted, but the strategy with UID:b won't.

  1. create-update events merged into a create event:
  • strategy with UID:a was created on demand for CRD Foo
  • Foo was updated (e.g. switching storage version)

the strategy with UID:a is supposed to be deleted and re-created on demand, to reflect the last update. But if the CRD informer merges the create-update events into a single create event, the strategy with UID:a won't be deleted (as we don't react on create events).


another race found in #79114 (comment), which could happen in a single update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.