Skip to content

syncer: Use UID pre-condition for delete to downstream operations#628

Closed
astefanutti wants to merge 2 commits intokcp-dev:mainfrom
astefanutti:pr-10
Closed

syncer: Use UID pre-condition for delete to downstream operations#628
astefanutti wants to merge 2 commits intokcp-dev:mainfrom
astefanutti:pr-10

Conversation

@astefanutti
Copy link
Copy Markdown
Member

@astefanutti astefanutti commented Mar 4, 2022

This fixes possible races when upstream objects are deleted and re-created.

An example of such races I've encountered, is:

  1. The downstream object is deleted for some reasons,
  2. The upstream object is deleted, but the operation fails and is re-queued for retry with backoff
  3. The upstream object is re-created, that is with the same name
  4. The re-created upstream object is synced downstream
  5. The delete operation that failed in step 2. is processed
  6. The re-created downstream object is deleted

In that particular example, it could be possible to prevent the race by simply ignoring 404 errors, but it seems more robust to use a concurrency control mechanism, to avoid unanticipated races for delete operations, as for update ones.

The UID of the downstream objects, used in the delete pre-conditions for concurrency control, are stored in an annotation. That seemed the easiest way to transport the downstream UID, but there may be better options I'm not aware of.

@sttts sttts requested a review from marun March 4, 2022 19:18
Copy link
Copy Markdown
Contributor

@marun marun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inline comments assume what you're proposing makes sense, but after reviewing I'm not clear that is the case. I would appreciate it if you would update the PR description with a crisp explanation of the problem you are trying to solve and how exactly the proposed solution addresses that problem. Please emphasise the 'why' rather than the 'how'.

Assuming this PR evolves into something we want to merge, I would also expect test changes to cover the new behavior.

func (c *Controller) deleteFromDownstream(ctx context.Context, gvr schema.GroupVersionResource, upstreamNamespace, name string, downstreamUID string) error {
if err := c.toClient.Resource(gvr).Namespace(upstreamNamespace).Delete(ctx, name, metav1.DeleteOptions{
Preconditions: metav1.NewUIDPreconditions(downstreamUID),
}); err != nil && !k8serrors.IsNotFound(err) && !k8serrors.IsConflict(err) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would k8serrors.IsConflict being true indicate, and why is it desirable to silently ignore that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that particular case, that is when the UID set in the pre-condition differs from the UID of the downstream object that actually exists, a conflict implies the object that was meant to be deleted no longer exists, and has been replaced with a new object, hence the operation that is being performed has already been fulfilled, and the error can be ignored, similarly to the 404 case.

Patch(ctx, downstreamObj.GetName(), types.ApplyPatchType, data, metav1.PatchOptions{
FieldManager: syncerApplyManager,
Force: pointer.Bool(true),
}, "status"); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe add a blank line to make up for the lack of visual indication between the conditional and its block? Either that or maybe this form of error handling (if err...; err != nil) should only apply to single-line conditionals...?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'd be inclined to go with the "apply to single-line conditionals only". I'll do it if we decide to pursue the work on that PR.

resyncPeriod = 10 * time.Hour
SyncerNamespaceKey = "SYNCER_NAMESPACE"
syncerApplyManager = "syncer"
targetUIDAnnotation = "workload.kcp.dev/target.uid"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) Maybe dashes are used instead of periods in the non-domain part of the key (e.g. target-uid)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll do it if we decide to pursue the work on that PR.

@astefanutti
Copy link
Copy Markdown
Member Author

My inline comments assume what you're proposing makes sense, but after reviewing I'm not clear that is the case. I would appreciate it if you would update the PR description with a crisp explanation of the problem you are trying to solve and how exactly the proposed solution addresses that problem. Please emphasise the 'why' rather than the 'how'.

Thanks for the review. I've updated the description with the use case that I faced, and that explains why I'm proposing this change.

Assuming this PR evolves into something we want to merge, I would also expect test changes to cover the new behavior.

Agreed.

@marun
Copy link
Copy Markdown
Contributor

marun commented Mar 6, 2022

That's helpful detail, thank you!

Unless I'm missing something, what you're seeing is a symptom of a syncer that is not responding to pcluster events (also reported by #604). I think it might be preferable to first ensure reconciliation is bidirectional and that there exists sufficient test coverage to understand whether additional coordination is required.

@astefanutti
Copy link
Copy Markdown
Member Author

Unless I'm missing something, what you're seeing is a symptom of a syncer that is not responding to pcluster events (also reported by #604). I think it might be preferable to first ensure reconciliation is bidirectional and that there exists sufficient test coverage to understand whether additional coordination is required.

Yes, I think that would greatly reduce the likelihood of the race in that case, but I don't think that would eliminate it nonetheless.

@ncdc
Copy link
Copy Markdown
Member

ncdc commented Mar 7, 2022

@astefanutti would you mind filing an issue for this?

@astefanutti
Copy link
Copy Markdown
Member Author

@ncdc sure, I've created kcp-dev/contrib-tmc#138. I'd be inclined to close that PR, and continue the discussion there, if that's alright with you.

@astefanutti astefanutti closed this Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants