-
-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition in Event Queuing When MODIFIED Arrive After CREATE but Before last-handled-configuration Was Written #729
Comments
Hello, @paxbit. Yes, you might be right with this behaviour. Kopf has some protection measures against quickly arriving events — the "batch window" ( One sure problem that I see in your operator is using Such blocking freeze of the whole process is functionally equivalent to the operator process/pod being terminated/suspended for 2 seconds (or, say, 2 minutes), and then restarted/resumed — to find out that the pod has changed. A better and proper way would be to either do # EITHER:
@kopf.on.startup()
async def create_pod(**_): # <<< async!
await asyncio.sleep(2) # <<< async!
# OR:
@kopf.on.startup()
def create_pod(**_): # <<< not async!
time.sleep(2) # <<< not async! This will sufficiently unblock the event loop to react to the pod's creation fast enough, and to putting the last-handled configuration also fast enough. As a result, the update handler will be called properly in the majority of cases. Since pykube itself is synchronous, I recommend the sync-way: However, there will be a short period when this issue can appear — maybe, as long as a fraction of a second on slow connections (operator<->cluster), and near-instant (but non-zero) on fast connections. Specifically, if a pod is changed by external actors between its actual creation and within the above-mentioned "batch window". In that case, an already changed pod will arrive to Kopf, and will be seen as a creation with an already modified state. Almost nothing can be done in that case. Kopf implements the state-driven approach of Kubernetes, so the first time it "sees" the object's state is when it is already quick-modified. The initial non-modified state does not exist for Kopf, so there is nothing to compare against. You can try setting Blindly assuming the goals of your operator, I would rather recommend one of these 3 approaches instead: Approach 1: React to all kinds of resource changes for that field, including its creation; or, better said, its first appearance. @kopf.on.create('pods', field='status')
@kopf.on.update('pods', field='status')
#@kopf.on.resume('pods', field='status') # might be a good idea
async def status_updated(name, status, **_):
print(f"{name}'s status was updated: {status.get('phase')}") Approach 2: React to the low-level events only. It still is the latest state within the batch window, but you can reduce the codebase for that. A downside: such handlers are fire-and-forget, they are not retried in case of errors. @kopf.on.event('pods', field='status')
async def status_updated(name, status, **_):
print(f"{name}'s status was updated: {status.get('phase')}") Approach 3: Use the on-field handlers, which is the equivalent of on-creation + on-update as above, with some subtle nuances. @kopf.on.field('pods', field='status')
async def status_updated(name, status, old, new, **_):
print(f"{name}'s status was updated: {status.get('phase')}") In all approaches, on the first appearance of the object, it will be a change of |
Hello, @nolar,
Errm, yes I'm quite aware of that and I actually though about mentioning this but thought it'd be too obvious ;) But I guess I may understand your maintainer perspective and why pointing out loop blocking comes to mind first. Maybe, I could have been more clear about that.
Yes, this is what I see.
That's what I meant.
But it's still the case that kopf sees the CREATE event before the MODIFIED event. It's just that all of kopfs iterations necessary to patch a storage annotation into the resource did not happen yet. This can be detected I think. The way I read kopfs code is such that a MODIFIED event without a storage annotation is undefined in terms of triggering any discrete handling of that fact. Couldn't the batching (debounce) implementation maybe track event sequences until pending storage patching is done? I mean as soon as kopf sees a resource event it could, while debouncing, decide if it is managed by it. After doing that it could remember the pending storage patch in-memory and implicitly apply it to every consecutive event for the resource until it sees the patch in the data of an incoming event. I think that would go a long way in putting the two racers on separate tracks.
So, then, I'd rather not ;) |
About the 3 approaches you suggested:
I would say we have not yet seen or imagined all possible side-effects of this race. For instance I think it should be possible to not just lose handler invocations but to trigger more handler invocations than desired. If an in-flight handler progress patch is preceded by one or more MODIFIED events in quick succession kopf might not see that e.g. the handler was already started, or did already terminate, or has already exhausted all configured retries. Does that sound reasonable or would that somehow be prevented? |
Hi @nolar, I had a look at the batching and storage implementations. I did not implement anything yet but what do you think about the following ideas?
resource_storage=storage.get_from_body(body)
in_memory_storage=memory.for_resource(body)
if not in_memory_storage and not resource_storage:
# no storage at all
return empty
if in_memory_storage and not resource_storage:
# 1st patch, might still be underway, or lost. Doesn't matter.
return in_memory_storage
if not in_memory_storage and resource_storage:
# patch "arrived" at the resource and memory was already invalidated.
return resource_storage
if in_memory_storage is_older_or_equals resource_storage:
# patch "arrived" at the resource. Invalidate memory.
purge in_memory_storage
return resource_storage
if in_memory_storage is_newer_than resource_storage:
# patch might still be underway, or lost. Doesn't matter.
return in_memory_storage What do you think? |
Sorry, I didn't have any intention to question your skills. But this was an obvious problem, so indeed, I just reacted to that.
On the one hand, while individual objects are parallelised (via the multiplexing logic), each individual object is always handled sequentially. It means the "worker" for that individual object is not doing anything because its task is somewhere in the handler currently, so it will not start a new handler until the previous one is finished. The events are accumulated in the backlog queue for that worker. Once the handler is finished, the call stack for that individual resource goes back to the top, i.e. to the worker, where it depletes the whole backlog queue to get the latest seen state to be processed. The "batch window" was added to ensure that even changes introduced by that handler are taken into account. On the other hand, you are right. If the operator<->cluster latency is high and exceeds the batching window, it might be so that we have a sequence of states A B pause C D; the handlers for the state B are triggered after the first pause; while they work, states C D arrive into the queue. If the handlers do patch the resource, they create a new state E. In normal cases, it should end up with the sequence "A B pause C D E pause", and the new E state will be processed as desired. However, if the latency is high, it might be so that the sequence looks like "A B pause C D latency E pause". As a result, the handlers will be invoked for state D when the resource is already in state E, and Kopf just does not know about that. And so, the double-execution is possible. As a possible fix, patching the resource by Kopf should remember its resource version (r.v.) and return it back to the worker. The worker must deplete the backlog "at least" till that reported r.v. is seen — before switching to the batching mode. In other words, if it sees the backlog as depleted, but the r.v. is not yet seen, it should wait an extra time. Since resource versions are documented as "opaque", we cannot even compare them with However, as a safety measure from skipping the states (I don't trust the K8s watch-stream either), a secondary time-window should be added: "deplete the backlog until a specific r.v. is seen, but at most for 5.0 seconds, before considering the batch as over; then, switch to regular batching with the 0.1s window". Something like this. This issue seems complicated and deep, but very interesting 😅 I will try to reproduce the latency issues locally (I remember there were some tools to simulate slow TCP) — and to see how these cases can be optimized. Sadly, I have to do my regular work in the daytime. So I can get to this problem only on the coming weekend at the earliest, maybe the next one. Luckily, there are no big major roadmap features to add now, so I can focus on stability & optimizations. |
|
Hi @nolar, Thanks for the answer and also for taking the time to look into it!
|
As a side-note: the latency can be added by using toxiproxy by Shopify.
where port
When injected into kubeconfig by:
Then the communication with
(At least, 2 requests are made: one for API resources available; one for the actual pod list; total 2*3s=6s; sometimes 18s for reasons unknown.) Only the "downstream" (client->server) is slowed down, not the "upstream" (server->client). To do this to Kopf or the operator only, without affecting kubectl: import logging
import kopf
import dataclasses
@kopf.on.login()
def delayed_k3s(**_):
conn = kopf.login_via_pykube(logger=logging.getLogger('xxx'))
if conn:
return dataclasses.replace(conn, server=conn.server.rsplit(':', 1)[0] + ':11223')
# from examples/01-minimal:
@kopf.on.create('kopfexamples')
def create_fn(spec, **kwargs):
print(f"And here we are! Creating: {spec}")
return {'message': 'hello world'} # will be the new status Then we get a 3-second delay on every "client->server" call (~0s on "server->client" streaming!):
|
In response to the comments: To no.1. I agree in principle. Though, I have some doubts about the implementation. Per-resource configuration looks questionable to me. The batching window is a low-level hack mostly for latency issues, not for operating logic. Where I see the distinction line, is event-watching vs. change-detecting handlers (let's keep daemons & timers & indices aside for now). The change-detecting handlers (on-create/update/delete/resume) are state-driven — the same as Kubernetes is often described to be. If you have a state change of Ready→NotReady→Ready, and the handler fails on Ready→NotReady, it will be retried, but the state will be (2nd)Ready again. More on that, Kopf will have to remember the whole chain of changes and process/retry the old changes while the resource can be in a much newer state for a long time. So, by design, these handlers should not see all events, only the differences between the last-processed & current states. Worth mentioning, that change-detecting handlers are not a Kubernetes feature, it is Kopf's own addon, so there is no baseline to align to. However, the event-watching handlers (on-event) give no promises on which events will be seen or not seen. In fact, there might be an implicit promise and an expectation that ALL events are seen as sent by Kubernetes. And that implicit promise is obviously broken due to batching. That didn't cause any problems by now, but I would not claim that it is how it should be forever. It could be changed. For those rare cases when every interim event counts, the operator developers can implement their queues, and put events to queues in the on-event handlers, and process them elsewhere (in daemons?) — with the guarantee that the events are delivered precisely as K8s does that (except for downtimes). A possible solution might be a redesign of the Would such a logic solve the limitations of Kopf for the use-cases you mentioned? |
To no.2: A similar approach was once offered for some other unrelated problem. There are reasons why patches should not be handled any other way than by applying them to the K8s API. Reason 1: No matter how much Kopf adheres to RFCs, there might be differences in how this patching is implemented in Kopf and how it is implemented in Kubernetes. I would not trust myself here in doing things correctly. Instead, I prefer to send it to the server-side always and get back the patched body of a resource. Reason 2: Even if we would be able to implement proper patching in-memory the same way as Kubernetes does, or even comparison of multiple/combined patches to the resulting body, there are server-side changes outside of Kopf's control: e.g., mutating admission webhooks. Even in one single The latter is basically a showstopper: patches cannot be interpreted client-side in any way. The whole patching logic must be done server-side. And so, to track the resource changes, we can only go for the "opaque" |
Notes for myself: It seems that ToxiProxy cannot help here. It slows down all requests, including patching. If patching is slowed down, the operator blocks for the whole duration of the request sent & response received. As a result, all changes arriving into Kubernetes during the patch, end up in a single batch on the operator side:
It is impossible to introduce latency only for watching streams via ToxyProxy. However, it is possible with strategic placement of sleeps: # kopf.clients.watching::watch_objs()
...
async with response:
async for line in _iter_jsonlines(response.content):
await asyncio.sleep(3.0) # <<<<<<<<<<<<<<<<<<<<<<<<<<<
raw_input = cast(bodies.RawInput, json.loads(line.decode("utf-8")))
yield raw_input
... WIth that single-line hack, the double-execution is reproducible with a regular operator (examples/01-minimal) with no authentication handlers, and the following CLI script:
This happens 100% of the time, without freezing the event loop artificially. The timeline looks like this (with patch-blocking removed):
|
Yes, I think it would. This is how I was thinking about the topic too. Leave the change detection via old/new diff in place while modifying the stream multiplexing so that it it feeds every resource event to the change detection so that it can make correct decisions about actual changes. Or, put differently, I think it is OK, actually desirable, to drop events in well known circumstances. Detecting nothing changed of the essence would be one. This of course presumes good knowledge about what is of the essence and that all assumptions kopf makes about this ideally overlap 100% with the assumptions an operator implementor. |
I completely agree in principle. However I think this small change to my proposal should make this work: Simply do never patch in-memory when a resource is still missing a storage structure but use the in-memory representation in change detection and handler progress calculation. At the end of the day it cannot matter by definition if kopf got its progress/diffbase storage from the actual resource data or from a cached in-memory representation. kopf today already expects the storage it patches onto the resource to come back verbatim. I think of it as a cache. The idea to patch the resource in-memory came from my desire to fully encapsulate the proposal in the About the resource-version thing. I never thought about using kube's What do you think. Does that fix the idea? |
FYI I just noticed a mistake I made. Of course it is not enough to just deal with |
Long story short
MODIFIED events on just CREATED resources might arrive before last-handled-configuration was written. This leads to the MODIFIED event being treated as
Reason.CREATE
b/c itsold
version is still empty.Loading the (empty)
old
manifest is tried here:kopf/kopf/reactor/processing.py
Line 153 in 1d657e2
Falsely setting the cause reason to
CREATE
as a result of the emptyold
manifest is done here:kopf/kopf/reactor/causation.py
Lines 197 to 200 in d29ac2b
The handler is not being called b/c its cause does not match the resource changing cause:
kopf/kopf/reactor/registries.py
Line 201 in 8fb507c
Description
If the handler creating a resource via 3rd-party means like pykube still spends a small amount of time after creating the resource before returning, a quick update-after-create to the resource will queue up MODIFIED events before kopf had a chance to write its last-handled-config.
The following code snipped reproduces this. We had a situation where a 2-container pod had one container immediately crashing after creation. When this happened quickly enough after the pod was created the handler designated to deal with crashing containers was never called. Since I'm working from home via a DSL link to the data center where the cluster lives, the varying connection latency over the day through the VPN gateway is sometimes enough to trigger this. But only after today's lucky setting of a break-point (introducing a sufficient handler delay) right after the pod creation I was able to reliably reproduce it and find the root cause.
All the handler does btw after creating the pod is creating an event about the fact as well as setting kopf's
patch
dict.I believe this one to be broken at the queuing design level and have no good idea how to fix this. After looking at this I'm not sure the current implementation can be fixed for correctness without substantial rewrites (
memories
, maybe?). The assumptions currently made around last-handled-configuration can never be fully upheld as long as third parties other than kopf (a.k.a. pykube, kubernetes itself) modify resources too - which will of course always be true.However I'd be very happily proven wrong. Maybe the already queued MODIFIED events sans kopf storage annotations can be augmented in-memory with the missing data by remembering the CREATED event long enough. IDK.
The following script:
time.sleeps
for 2s.on_update(...)
is never called and "wonky's status was updated: ..." is missing from the output.To make it work:
Comment the
time.sleep(2)
after pod creation. Theon_update(...)
handler will be called.Note
Running this script the first time might actually trigger
on_update
. This would be b/c thealpine
image might need to be pulled. If this takes longer than the 2s sleep, there will be MODIFIED events after that and kopf might have had enough time to write a last-handled-config. If the image is already there it should fail the first time - except maybe when run on very slow or loaded clusters so the container takes longer to crash. Simply increase the sleep to 3-4s then to still trigger it.event_race_bug.py
The exact command to reproduce the issue
I hope somebody proves me wrong with my analysis, I really do, because if I'm correct it means that by definition I'll never be able to implement a correctly behaving operator using kopf as I would have to expect subtle errors like this one without any way to detect them through kopfs API.
Environment
The text was updated successfully, but these errors were encountered: