Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

informers: Don't treat relist same as sync #86015

Merged
merged 2 commits into from Jan 24, 2020

Conversation

@squeed
Copy link
Contributor

@squeed squeed commented Dec 6, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and Replace(). Seperately, the SharedInformer will only pass that event on to handlers that have a ResyncInterval and are due for Resync. This can cause updates to be lost if an object changes as part of the Replace(), as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on Replace(). For backwards compatability concerns, the old behavior of always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all SharedInformer handlers will get a correct Add() or Update() notification.

One additional side-effect is that handlers which do not ever want Resyncs will now see them for all objects that have not changed during the Replace.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Shared informers are now more reliable in the face of network disruption.

/cc @sttts
/cc @liggitt

@squeed
Copy link
Contributor Author

@squeed squeed commented Dec 6, 2019

Many thanks to @sttts, who helped me understand exactly where to look in the SharedInformer when faced with an "impossible" sequence of log lines.

Loading

@liggitt
Copy link
Member

@liggitt liggitt commented Dec 6, 2019

Separately, there is a commit that elides re-List updates if RV and UID hasn't changed.

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

Loading

@squeed
Copy link
Contributor Author

@squeed squeed commented Dec 9, 2019

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary. There are two points for consideration:

  1. The larger bugfix is also changing existing behavior. Assume a handler without ResyncInterval doesn't want to get a Sync-style update (e.g. Update(A1, A1)) and only wants edges. However, if the client internally decides to do a re-List, then the handler will still see no-op updates. Now, I don't think there's any kind of contract that says you'll never get no-op updates, but it it was coded in to some of the tests.

  2. As to UIDs, I wasn't sure of the guarantees for ResourceVersion across delete-and-recreate. It's not explicitly mentioned in the documentation. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

Additionally, the elision only takes place for Replace events - so it's not a large change.

Loading

@squeed squeed force-pushed the informer-missing-updates branch from c3567d0 to c174e22 Dec 9, 2019
@liggitt
Copy link
Member

@liggitt liggitt commented Dec 9, 2019

2. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

Loading

@liggitt
Copy link
Member

@liggitt liggitt commented Dec 9, 2019

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary.

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

Loading

@squeed
Copy link
Contributor Author

@squeed squeed commented Dec 9, 2019

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

Ack. Removed UID check; doesn't affect correctness.

Loading

@liggitt
Copy link
Member

@liggitt liggitt commented Dec 9, 2019

Loading

@squeed
Copy link
Contributor Author

@squeed squeed commented Dec 9, 2019

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

Understood. The open question to me is: is it OK to send resync updates to handlers that don't want them? If so, then we don't need the last commit. Otherwise we do.

Loading

@lavalamp
Copy link
Member

@lavalamp lavalamp commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

Loading

@squeed
Copy link
Contributor Author

@squeed squeed commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

@lavalamp I don't quite follow - what sort of option were you thinking of?

To be clear here, this PR doesn't disable resyncs or change the standard informer behavior at all. It just ensures we don't miss updates due to a re-List. AFAICT, the question is whether or not we should expose the re-List due to disruption to the handlers.

Loading

@lavalamp
Copy link
Member

@lavalamp lavalamp commented Dec 9, 2019

doesn't disable resyncs or change the standard informer behavior at all

There can be direct clients of this.

The fact that we can make corresponding changes in the informer to preserve its behavior makes your life much easier though.

Loading

@squeed
Copy link
Contributor Author

@squeed squeed commented Dec 9, 2019

As an alternative to adding a new Delta type, we could "fix" this in the SharedInformer by cheating a bit, and treating Sync as Updated if the object differs from the store.

Loading

@liggitt
Copy link
Member

@liggitt liggitt commented Jan 20, 2020

/lgtm
/approve

Loading

@liggitt
Copy link
Member

@liggitt liggitt commented Jan 20, 2020

unit test failure is legitimate:

$ go test ./vendor/k8s.io/client-go/tools/cache
# k8s.io/client-go/tools/cache [k8s.io/client-go/tools/cache.test]
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:294:3: undefined: keyLookupFunc
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:410:17: undefined: keyLookupFunc
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:440:17: undefined: keyLookupFunc

Loading

@liggitt
Copy link
Member

@liggitt liggitt commented Jan 20, 2020

/lgtm cancel

Loading

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 20, 2020
@squeed
Copy link
Contributor Author

@squeed squeed commented Jan 20, 2020

whoops, fixing

Loading

@squeed squeed force-pushed the informer-missing-updates branch from 0279ec0 to c889d9d Jan 20, 2020
@squeed
Copy link
Contributor Author

@squeed squeed commented Jan 20, 2020

/retest

Loading

1 similar comment
@liggitt
Copy link
Member

@liggitt liggitt commented Jan 21, 2020

/retest

Loading

@lavalamp
Copy link
Member

@lavalamp lavalamp commented Jan 21, 2020

Overall, great change-- I hate to ask for something but the name is unchangeable once we cut a client-go release, so...

Loading

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and
Replace(). Seperately, the SharedInformer will only pass that event
on to handlers that have a ResyncInterval and are due for Resync. This
can cause updates to be lost if an object changes as part of the Replace(),
as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on
Replace(). For backwards compatability concerns, the old behavior of
always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all
SharedInformer handlers will get a correct Add() or Update()
notification.

One additional side-effect is that handlers which do not ever want
Resyncs will now see them for all objects that have not changed during
the Replace.
@squeed squeed force-pushed the informer-missing-updates branch from c889d9d to ca1eeb9 Jan 23, 2020
@squeed
Copy link
Contributor Author

@squeed squeed commented Jan 23, 2020

OK, rename done.

Loading

@liggitt
Copy link
Member

@liggitt liggitt commented Jan 23, 2020

/lgtm
/approve
/retest

Loading

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Jan 23, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Loading

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jan 23, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Loading

2 similar comments
@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jan 23, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Loading

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jan 24, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment