Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kuma-cp) upsert with retry on conflict #1236

Merged
merged 3 commits into from
Dec 2, 2020

Conversation

jakubdyszkiewicz
Copy link
Contributor

Problem

The problem right now affects mostly Kubernetes. When we enabled Kubernetes Client cache, the KubernetesStore is no longer consistent. if in 1 thread we execute Get, Update, Get, Update quickly enough, the second Get may not be fresh with proper Version for optimistic locking.

The problem technically can be also visible outside of Kubernetes if we are executing Upsert from different parts of the code. Right now we are Upserting DataplaneInsights from status tracker and SDS to update certs times.

I noticed this problem with DataplaneInsights when there are a lot of changes and dataplane status sink essentially is in the loop. Then it happens once every ~100 flushes.

Solution

  1. Sleep between invocation of flush of dataplane status sink (and equivalent to zone insight, mesh insights). Does not really solves the problem, it's impossible to tell for how long we should sleep
  2. Introduce UpdateForce() to ResourceStore#Update. This one would ignore optimistic locking. Unfortunatelly I could not implement this on Kubernetes. Update cannot ignore this. Patch operation also cannot bypass it. The only option that potentially could bypass it is Patch of type Apply, but it is available since Kubernetes 1.18+
    Does not really solve the problem
  3. Retry on resource conflict. I noticed that 100ms backoff we are good to go with a second try

I picked the third option since it seems to be most reasonable. I brought it as a required argument to Upsert to force users of the API to think of this specific case.

This is a draft to confirm I should proceed with this implementation.

In addition to this change, I want to increase the sink timer for Dataplane Insight and Zone Insight (as a separate PR) so we can try to avoid situations where sink is in the loop. The default 1s is really excessive for this.

Documentation

  • Fix in the code.

@lobkovilya
Copy link
Contributor

Do we really want to retry upserting until the error is gone, taking into account that fresher Insights are more relatable? I'd rather ignore an error and let the next ticker event to do its job. Maybe we can add a rate limiter to guarantee a gap between 2 upsert requests

@jakubdyszkiewicz
Copy link
Contributor Author

Ok, it could work like this when you have a ticker, but what about the case of updating the resource from different parts of code, like DataplaneInsight cert and stats? I think retry in pkg/sds/server/reconciller.go is very relevant. So we could do retry there and skip on conflict in dataplane and zone sink.

Not sure about insights resyncer though.

@lobkovilya
Copy link
Contributor

Yes, that makes sense to skip reties for dataplane/zone sink (and for insight resyncer as well). Maybe we can introduce an option store.WtihRetry or something like that

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>
Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>
@jakubdyszkiewicz jakubdyszkiewicz marked this pull request as ready for review December 1, 2020 17:11
@jakubdyszkiewicz jakubdyszkiewicz requested a review from a team as a code owner December 1, 2020 17:11
@jakubdyszkiewicz
Copy link
Contributor Author

Ok, changed so

  1. We are missing flushes with log on V1
  2. Introduced backoff which is / 10 of the interval, so if user cares about quick saves, the backoff will be shorter
  3. Introduced retry on upsert from sds
  4. Changed to functional params

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>
@jakubdyszkiewicz jakubdyszkiewicz merged commit 5e6c524 into master Dec 2, 2020
@jakubdyszkiewicz jakubdyszkiewicz deleted the fix/retry-resource-conflict branch December 2, 2020 12:17
mergify bot pushed a commit that referenced this pull request Dec 2, 2020
Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>
(cherry picked from commit 5e6c524)
jakubdyszkiewicz pushed a commit that referenced this pull request Dec 2, 2020
Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants