-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher fails with 410 "GONE" #45
Comments
@obmarg any further ideas on how to solve this? My current preference is option (2) above. |
@chazsconi do you have any code that can reproduce this situation? I'd like to investigate a little before deciding on a solution. |
@obmarg I will try to reproduce this with a test, however the scenario in which I saw it happen was after over 48 hours of no events being received by the watcher, so it might not be simple. After seeing this ticket, I think the resource version getting too old is a normal behaviour. |
@chazsconi Yeah, I agree it seems like a normal thing to happen. Definitely something Kazan should do something about. Since it seems to happen in response to other things changing in the database, I was hoping there'd be a simple way to induce the issue by just changing tons of un-watched resources. Though I guess it depends precisely what changes need to be made in order to induce the issue. I did come across this PR that updates kubectl to handle something similar. Seems like (though I've not yet tested) if you specify a resourceVersion of "0" you'll get sent the latest resource, and then any updates to that resource. Was wondering if we could utilize this for the issue you're experiencing, though wanted to play around with the issue to see whether it made sense before suggesting... |
@obmarg I tried to reproduce the problem by creating a watch for pod changes in a namespace and then making changes to a config map, but even after 75000 changes to the config map the problem did not occur on the watch. However, as you say, specifying a resource version of "0" appears to send only new events. So a simple solution is to fix the code to revert to resource version "0" when the "too old resource version" problem occurs. However there is a possibility of events being missed between the original watch failing and restarting it with resource version "0" - if this is critical, then the consumer can be informed and it can refetch everything again. Therefore maybe providing both options in the library would be best and allow the consumer of the library to choose which they prefer:
|
@chazsconi Yeah, maybe we'll end up having to do that. Though I don't know if I'm too happy with an option that forces people to pick between handling a semi-rare event (that we can't explain enough about to say when it'll happen) and potentially losing events. Though since this seems to happen when the watched resource hasn't been updated for days, it seems like the chances of it changing in the second or so where we're restarting the watch is unlikely. I definitely want to do a bit more investigation around this before settling on a fix, though if you're blocked I'd be happy to release a temporary fix. |
@obmarg I actually the saw the problem occur again today. An event had been received on the watched resource around 1 hour before, so it's certainly not days before this happens. We are currently using K8S v1.8.8. We are planning to upgrade to v1.9.7 next week, and I'd be interested to see if the problem still occurs after the upgrade. |
@obmarg I'm currently running against my fork in production with this fix: As I cannot reproduce this in tests I'm waiting for the scenario to re-occur to check that this fixes the problem. After that I can create this as a PR. |
Unfortunately this does not work. Setting the resource version to "0" causes some previous changes to be resent, as does not setting a resource version at all. After checking the K8S docs, I found that the correct way to do it, is to refetch the resources.
Therefore, I believe that option (1) that I listed above is the solution. i.e. inform the consumer, and let the consumer decide how to deal with it. |
Apologies for my complete silence on this. I’ve been in the process of finding a new job, which has taken up just about all of my time. Almost sorted now though, so will be ready to feedback soon. |
No problem. We're currently running with with a fork using this commit in production: This is a breaking change as the messages now sent to the consumer are a different struct so the |
Ok, so I'd ideally like a way for kazan to handle this automatically. However, it's not clear the best way to do that, and forcing users to handle it is at least a step in the right direction (since it's currently un-handleable). If you want to make a PR with that branch I'd be happy to accept after a bit of review. |
@obmarg Sorry for the delay - PR now created. |
No problem, thanks for the PR 👍. I’m without access to a laptop for the
next couple of weeks. Will give it a look once I’m back.
…On Fri, 15 Jun 2018 at 18:29, chazsconi ***@***.***> wrote:
@obmarg <https://github.com/obmarg> Sorry for the delay - PR now created.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#45 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAh9ymDmMTmyXmsYY3p1eA4ChGj1tMqiks5t89LngaJpZM4T0vSh>
.
|
Fixed in #47 |
If one particular event type is been watched (e.g. Namespace changes) and no new events happen for a long time, but other events, that are not being watched happen, the other, non-watched events, cause the RV to increase.
Eventually the RV that is stored in the watcher is too old to be used, and the watcher fails returning a 410 response code.
If an original RV was passed to the watcher, the supervisor restarts the watcher with the same init parameters, and it fails again.
Possible solutions:
Parse the message that comes back to extract a later RV
Send a message to the caller to inform it that it can no longer watch and it should start again
Somehow obtain the latest RV from K8S without potentially missing events
Logs below:
The text was updated successfully, but these errors were encountered: