fix(kuma-cp) upsert with retry on conflict #1236

jakubdyszkiewicz · 2020-11-30T14:17:13Z

Problem

The problem right now affects mostly Kubernetes. When we enabled Kubernetes Client cache, the KubernetesStore is no longer consistent. if in 1 thread we execute Get, Update, Get, Update quickly enough, the second Get may not be fresh with proper Version for optimistic locking.

The problem technically can be also visible outside of Kubernetes if we are executing Upsert from different parts of the code. Right now we are Upserting DataplaneInsights from status tracker and SDS to update certs times.

I noticed this problem with DataplaneInsights when there are a lot of changes and dataplane status sink essentially is in the loop. Then it happens once every ~100 flushes.

Solution

Sleep between invocation of flush of dataplane status sink (and equivalent to zone insight, mesh insights). Does not really solves the problem, it's impossible to tell for how long we should sleep
Introduce UpdateForce() to ResourceStore#Update. This one would ignore optimistic locking. Unfortunatelly I could not implement this on Kubernetes. Update cannot ignore this. Patch operation also cannot bypass it. The only option that potentially could bypass it is Patch of type Apply, but it is available since Kubernetes 1.18+
Does not really solve the problem
Retry on resource conflict. I noticed that 100ms backoff we are good to go with a second try

I picked the third option since it seems to be most reasonable. I brought it as a required argument to Upsert to force users of the API to think of this specific case.

This is a draft to confirm I should proceed with this implementation.

In addition to this change, I want to increase the sink timer for Dataplane Insight and Zone Insight (as a separate PR) so we can try to avoid situations where sink is in the loop. The default 1s is really excessive for this.

Documentation

Fix in the code.

lobkovilya · 2020-11-30T14:59:52Z

Do we really want to retry upserting until the error is gone, taking into account that fresher Insights are more relatable? I'd rather ignore an error and let the next ticker event to do its job. Maybe we can add a rate limiter to guarantee a gap between 2 upsert requests

jakubdyszkiewicz · 2020-11-30T15:40:09Z

Ok, it could work like this when you have a ticker, but what about the case of updating the resource from different parts of code, like DataplaneInsight cert and stats? I think retry in pkg/sds/server/reconciller.go is very relevant. So we could do retry there and skip on conflict in dataplane and zone sink.

Not sure about insights resyncer though.

lobkovilya · 2020-11-30T16:13:01Z

Yes, that makes sense to skip reties for dataplane/zone sink (and for insight resyncer as well). Maybe we can introduce an option store.WtihRetry or something like that

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>

jakubdyszkiewicz · 2020-12-01T17:14:24Z

Ok, changed so

We are missing flushes with log on V1
Introduced backoff which is / 10 of the interval, so if user cares about quick saves, the backoff will be shorter
Introduced retry on upsert from sds
Changed to functional params

pkg/kds/server/status_sink.go

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com> (cherry picked from commit 5e6c524)

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>

jakubdyszkiewicz added 2 commits December 1, 2020 17:51

chore(kuma-cp) retry resource on conflict and log on DEBUG

562d06d

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>

refactor retry to opts

bd64fad

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>

jakubdyszkiewicz force-pushed the fix/retry-resource-conflict branch from 28bed54 to bd64fad Compare December 1, 2020 17:11

jakubdyszkiewicz marked this pull request as ready for review December 1, 2020 17:11

jakubdyszkiewicz requested a review from a team as a code owner December 1, 2020 17:11

jakubdyszkiewicz requested a review from lobkovilya December 1, 2020 17:14

lobkovilya reviewed Dec 1, 2020

View reviewed changes

pkg/kds/server/status_sink.go Show resolved Hide resolved

nickolaev added the backport-to-stable label Dec 2, 2020

fix missing log assignment

3363998

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>

lobkovilya approved these changes Dec 2, 2020

View reviewed changes

jakubdyszkiewicz merged commit 5e6c524 into master Dec 2, 2020

jakubdyszkiewicz deleted the fix/retry-resource-conflict branch December 2, 2020 12:17

mergify bot pushed a commit that referenced this pull request Dec 2, 2020

fix(kuma-cp) handle resource conflicts more gracefully (#1236)

cef8096

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com> (cherry picked from commit 5e6c524)

mergify bot mentioned this pull request Dec 2, 2020

fix(kuma-cp) upsert with retry on conflict (bp #1236) #1262

Merged

jakubdyszkiewicz pushed a commit that referenced this pull request Dec 2, 2020

fix(kuma-cp) handle resource conflicts more gracefully (#1236)

12be05f

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kuma-cp) upsert with retry on conflict #1236

fix(kuma-cp) upsert with retry on conflict #1236

jakubdyszkiewicz commented Nov 30, 2020

lobkovilya commented Nov 30, 2020

jakubdyszkiewicz commented Nov 30, 2020

lobkovilya commented Nov 30, 2020

jakubdyszkiewicz commented Dec 1, 2020

fix(kuma-cp) upsert with retry on conflict #1236

fix(kuma-cp) upsert with retry on conflict #1236

Conversation

jakubdyszkiewicz commented Nov 30, 2020

Problem

Solution

Documentation

lobkovilya commented Nov 30, 2020

jakubdyszkiewicz commented Nov 30, 2020

lobkovilya commented Nov 30, 2020

jakubdyszkiewicz commented Dec 1, 2020