Skip to content

informers: Don't treat relist same as sync#86015

Merged
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
squeed:informer-missing-updates
Jan 24, 2020
Merged

informers: Don't treat relist same as sync#86015
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
squeed:informer-missing-updates

Conversation

@squeed
Copy link
Copy Markdown
Contributor

@squeed squeed commented Dec 6, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and Replace(). Seperately, the SharedInformer will only pass that event on to handlers that have a ResyncInterval and are due for Resync. This can cause updates to be lost if an object changes as part of the Replace(), as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on Replace(). For backwards compatability concerns, the old behavior of always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all SharedInformer handlers will get a correct Add() or Update() notification.

One additional side-effect is that handlers which do not ever want Resyncs will now see them for all objects that have not changed during the Replace.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Shared informers are now more reliable in the face of network disruption.

/cc @sttts
/cc @liggitt

@k8s-ci-robot k8s-ci-robot requested review from liggitt and sttts December 6, 2019 21:44
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 6, 2019
@squeed squeed force-pushed the informer-missing-updates branch from e77bb56 to c3567d0 Compare December 6, 2019 21:47
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 6, 2019

Many thanks to @sttts, who helped me understand exactly where to look in the SharedInformer when faced with an "impossible" sequence of log lines.

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Dec 6, 2019

Separately, there is a commit that elides re-List updates if RV and UID hasn't changed.

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

Comment thread staging/src/k8s.io/client-go/tools/cache/testing/fake_controller_source.go Outdated
Comment thread staging/src/k8s.io/client-go/tools/cache/shared_informer.go Outdated
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 9, 2019

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary. There are two points for consideration:

  1. The larger bugfix is also changing existing behavior. Assume a handler without ResyncInterval doesn't want to get a Sync-style update (e.g. Update(A1, A1)) and only wants edges. However, if the client internally decides to do a re-List, then the handler will still see no-op updates. Now, I don't think there's any kind of contract that says you'll never get no-op updates, but it it was coded in to some of the tests.

  2. As to UIDs, I wasn't sure of the guarantees for ResourceVersion across delete-and-recreate. It's not explicitly mentioned in the documentation. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

Additionally, the elision only takes place for Replace events - so it's not a large change.

@squeed squeed force-pushed the informer-missing-updates branch from c3567d0 to c174e22 Compare December 9, 2019 10:51
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Dec 9, 2019

2. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Dec 9, 2019

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary.

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 9, 2019

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

Ack. Removed UID check; doesn't affect correctness.

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Dec 9, 2019

fyi @smarterclayton @lavalamp

Comment thread staging/src/k8s.io/client-go/tools/cache/controller.go Outdated
Comment thread staging/src/k8s.io/client-go/tools/cache/delta_fifo.go Outdated
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 9, 2019

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

Understood. The open question to me is: is it OK to send resync updates to handlers that don't want them? If so, then we don't need the last commit. Otherwise we do.

@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

@lavalamp I don't quite follow - what sort of option were you thinking of?

To be clear here, this PR doesn't disable resyncs or change the standard informer behavior at all. It just ensures we don't miss updates due to a re-List. AFAICT, the question is whether or not we should expose the re-List due to disruption to the handlers.

Comment thread staging/src/k8s.io/client-go/tools/cache/delta_fifo.go Outdated
Comment thread staging/src/k8s.io/client-go/tools/cache/delta_fifo.go Outdated
@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Dec 9, 2019

doesn't disable resyncs or change the standard informer behavior at all

There can be direct clients of this.

The fact that we can make corresponding changes in the informer to preserve its behavior makes your life much easier though.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 9, 2019

As an alternative to adding a new Delta type, we could "fix" this in the SharedInformer by cheating a bit, and treating Sync as Updated if the object differs from the store.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2020
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jan 20, 2020

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2020
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Jan 20, 2020

whoops, fixing

@squeed squeed force-pushed the informer-missing-updates branch from 0279ec0 to c889d9d Compare January 20, 2020 17:29
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Jan 20, 2020

/retest

1 similar comment
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jan 21, 2020

/retest

Comment thread staging/src/k8s.io/client-go/tools/cache/delta_fifo.go Outdated
@lavalamp
Copy link
Copy Markdown
Contributor

Overall, great change-- I hate to ask for something but the name is unchangeable once we cut a client-go release, so...

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and
Replace(). Seperately, the SharedInformer will only pass that event
on to handlers that have a ResyncInterval and are due for Resync. This
can cause updates to be lost if an object changes as part of the Replace(),
as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on
Replace(). For backwards compatability concerns, the old behavior of
always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all
SharedInformer handlers will get a correct Add() or Update()
notification.

One additional side-effect is that handlers which do not ever want
Resyncs will now see them for all objects that have not changed during
the Replace.
@squeed squeed force-pushed the informer-missing-updates branch from c889d9d to ca1eeb9 Compare January 23, 2020 10:26
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Jan 23, 2020

OK, rename done.

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jan 23, 2020

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2020
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link
Copy Markdown

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@fejta-bot
Copy link
Copy Markdown

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link
Copy Markdown

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 9f09913 into kubernetes:master Jan 24, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 24, 2020
squeed added a commit to squeed/cilium that referenced this pull request Apr 7, 2025
The Replaced event is emitted when the informer's watch is disrupted and
requires a re-list. If this re-list results in a change, the change is
*swallowed* unless either Sync events or Replaced events are enabled.

Given that Sync is generally disabled, we should accept the Replaced
event type.

(see: kubernetes/kubernetes#86015 )

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
github-merge-queue Bot pushed a commit to cilium/cilium that referenced this pull request Apr 16, 2025
The Replaced event is emitted when the informer's watch is disrupted and
requires a re-list. If this re-list results in a change, the change is
*swallowed* unless either Sync events or Replaced events are enabled.

Given that Sync is generally disabled, we should accept the Replaced
event type.

(see: kubernetes/kubernetes#86015 )

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants