Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

informers: Don't treat relist same as sync #86015

Merged
merged 2 commits into from Jan 24, 2020

Conversation

@squeed
Copy link
Contributor

squeed commented Dec 6, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and Replace(). Seperately, the SharedInformer will only pass that event on to handlers that have a ResyncInterval and are due for Resync. This can cause updates to be lost if an object changes as part of the Replace(), as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on Replace(). For backwards compatability concerns, the old behavior of always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all SharedInformer handlers will get a correct Add() or Update() notification.

One additional side-effect is that handlers which do not ever want Resyncs will now see them for all objects that have not changed during the Replace.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Shared informers are now more reliable in the face of network disruption.

/cc @sttts
/cc @liggitt

@squeed
Copy link
Contributor Author

squeed commented Dec 6, 2019

Many thanks to @sttts, who helped me understand exactly where to look in the SharedInformer when faced with an "impossible" sequence of log lines.

@liggitt
Copy link
Member

liggitt commented Dec 6, 2019

Separately, there is a commit that elides re-List updates if RV and UID hasn't changed.

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary. There are two points for consideration:

  1. The larger bugfix is also changing existing behavior. Assume a handler without ResyncInterval doesn't want to get a Sync-style update (e.g. Update(A1, A1)) and only wants edges. However, if the client internally decides to do a re-List, then the handler will still see no-op updates. Now, I don't think there's any kind of contract that says you'll never get no-op updates, but it it was coded in to some of the tests.

  2. As to UIDs, I wasn't sure of the guarantees for ResourceVersion across delete-and-recreate. It's not explicitly mentioned in the documentation. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

Additionally, the elision only takes place for Replace events - so it's not a large change.

@squeed squeed force-pushed the squeed:informer-missing-updates branch from c3567d0 to c174e22 Dec 9, 2019
@liggitt
Copy link
Member

liggitt commented Dec 9, 2019

2. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

@liggitt
Copy link
Member

liggitt commented Dec 9, 2019

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary.

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

Ack. Removed UID check; doesn't affect correctness.

@liggitt
Copy link
Member

liggitt commented Dec 9, 2019

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

Understood. The open question to me is: is it OK to send resync updates to handlers that don't want them? If so, then we don't need the last commit. Otherwise we do.

@lavalamp
Copy link
Member

lavalamp commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

@lavalamp I don't quite follow - what sort of option were you thinking of?

To be clear here, this PR doesn't disable resyncs or change the standard informer behavior at all. It just ensures we don't miss updates due to a re-List. AFAICT, the question is whether or not we should expose the re-List due to disruption to the handlers.

@lavalamp
Copy link
Member

lavalamp commented Dec 9, 2019

doesn't disable resyncs or change the standard informer behavior at all

There can be direct clients of this.

The fact that we can make corresponding changes in the informer to preserve its behavior makes your life much easier though.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

As an alternative to adding a new Delta type, we could "fix" this in the SharedInformer by cheating a bit, and treating Sync as Updated if the object differs from the store.

@squeed
Copy link
Contributor Author

squeed commented Jan 20, 2020

@liggitt Updated and rebased, thanks for the feedback.

@liggitt
Copy link
Member

liggitt commented Jan 20, 2020

/retest

@liggitt
Copy link
Member

liggitt commented Jan 20, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 20, 2020
@liggitt
Copy link
Member

liggitt commented Jan 20, 2020

unit test failure is legitimate:

$ go test ./vendor/k8s.io/client-go/tools/cache
# k8s.io/client-go/tools/cache [k8s.io/client-go/tools/cache.test]
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:294:3: undefined: keyLookupFunc
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:410:17: undefined: keyLookupFunc
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:440:17: undefined: keyLookupFunc
@liggitt
Copy link
Member

liggitt commented Jan 20, 2020

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 20, 2020
@squeed
Copy link
Contributor Author

squeed commented Jan 20, 2020

whoops, fixing

@squeed squeed force-pushed the squeed:informer-missing-updates branch from 0279ec0 to c889d9d Jan 20, 2020
@squeed
Copy link
Contributor Author

squeed commented Jan 20, 2020

/retest

1 similar comment
@liggitt
Copy link
Member

liggitt commented Jan 21, 2020

/retest

@lavalamp
Copy link
Member

lavalamp commented Jan 21, 2020

Overall, great change-- I hate to ask for something but the name is unchangeable once we cut a client-go release, so...

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and
Replace(). Seperately, the SharedInformer will only pass that event
on to handlers that have a ResyncInterval and are due for Resync. This
can cause updates to be lost if an object changes as part of the Replace(),
as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on
Replace(). For backwards compatability concerns, the old behavior of
always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all
SharedInformer handlers will get a correct Add() or Update()
notification.

One additional side-effect is that handlers which do not ever want
Resyncs will now see them for all objects that have not changed during
the Replace.
@squeed squeed force-pushed the squeed:informer-missing-updates branch from c889d9d to ca1eeb9 Jan 23, 2020
@squeed
Copy link
Contributor Author

squeed commented Jan 23, 2020

OK, rename done.

@liggitt
Copy link
Member

liggitt commented Jan 23, 2020

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 23, 2020
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 23, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

fejta-bot commented Jan 23, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@fejta-bot
Copy link

fejta-bot commented Jan 23, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

fejta-bot commented Jan 24, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 9f09913 into kubernetes:master Jan 24, 2020
14 of 16 checks passed
14 of 16 checks passed
pull-kubernetes-e2e-kind Job triggered.
Details
tide Not mergeable. Job pull-kubernetes-e2e-kind has not succeeded.
Details
cla/linuxfoundation squeed authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind-ipv6 Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

9 participants
You can’t perform that action at this time.