Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher fails with 410 "GONE" #45

Closed
chazsconi opened this issue May 7, 2018 · 15 comments
Closed

Watcher fails with 410 "GONE" #45

chazsconi opened this issue May 7, 2018 · 15 comments
Labels

Comments

@chazsconi
Copy link
Collaborator

If one particular event type is been watched (e.g. Namespace changes) and no new events happen for a long time, but other events, that are not being watched happen, the other, non-watched events, cause the RV to increase.

Eventually the RV that is stored in the watcher is too old to be used, and the watcher fails returning a 410 response code.

If an original RV was passed to the watcher, the supervisor restarts the watcher with the same init parameters, and it fails again.

Possible solutions:

  1. Parse the message that comes back to extract a later RV

  2. Send a message to the caller to inform it that it can no longer watch and it should start again

  3. Somehow obtain the latest RV from K8S without potentially missing events

Logs below:

FunctionClauseError) no function clause matching in Kazan.Watcher.extract_rv/1
    (kazan) lib/kazan/watcher.ex:235: Kazan.Watcher.extract_rv(%{"apiVersion" => "v1", "code" => 410, "kind" => "Status", "message" => "too old resource version: 20293058 (20295457)", "metadata" => %{}, "reason" => "Gone", "status" => "Failure"})
    (kazan) lib/kazan/watcher.ex:219: anonymous fn/4 in Kazan.Watcher.process_lines/2
    (elixir) lib/enum.ex:1899: Enum."-reduce/3-lists^foldl/2-0-"/3
    (kazan) lib/kazan/watcher.ex:158: Kazan.Watcher.handle_info/2
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: %HTTPoison.AsyncChunk{chunk: "{\"type\":\"ERROR\",\"object\":{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"too old resource version: 20293058 (20295457)\",\"reason\":\"Gone\",\"code\":410}}\n", id: #Reference<0.4253789252.1463549953.187567>}
@chazsconi
Copy link
Collaborator Author

@obmarg any further ideas on how to solve this? My current preference is option (2) above.

@obmarg obmarg added the bug label May 7, 2018
@obmarg
Copy link
Owner

obmarg commented May 7, 2018

@chazsconi do you have any code that can reproduce this situation? I'd like to investigate a little before deciding on a solution.

@chazsconi
Copy link
Collaborator Author

chazsconi commented May 7, 2018

@obmarg I will try to reproduce this with a test, however the scenario in which I saw it happen was after over 48 hours of no events being received by the watcher, so it might not be simple.

After seeing this ticket, I think the resource version getting too old is a normal behaviour.

kubernetes/kubernetes#55230

@obmarg
Copy link
Owner

obmarg commented May 7, 2018

@chazsconi Yeah, I agree it seems like a normal thing to happen. Definitely something Kazan should do something about. Since it seems to happen in response to other things changing in the database, I was hoping there'd be a simple way to induce the issue by just changing tons of un-watched resources. Though I guess it depends precisely what changes need to be made in order to induce the issue.

I did come across this PR that updates kubectl to handle something similar. Seems like (though I've not yet tested) if you specify a resourceVersion of "0" you'll get sent the latest resource, and then any updates to that resource. Was wondering if we could utilize this for the issue you're experiencing, though wanted to play around with the issue to see whether it made sense before suggesting...

@chazsconi
Copy link
Collaborator Author

@obmarg I tried to reproduce the problem by creating a watch for pod changes in a namespace and then making changes to a config map, but even after 75000 changes to the config map the problem did not occur on the watch.

However, as you say, specifying a resource version of "0" appears to send only new events. So a simple solution is to fix the code to revert to resource version "0" when the "too old resource version" problem occurs.

However there is a possibility of events being missed between the original watch failing and restarting it with resource version "0" - if this is critical, then the consumer can be informed and it can refetch everything again.

Therefore maybe providing both options in the library would be best and allow the consumer of the library to choose which they prefer:

  1. Inform the consumer and terminate the watch
  2. Restart from resource version "0"

@obmarg
Copy link
Owner

obmarg commented May 9, 2018

@chazsconi Yeah, maybe we'll end up having to do that. Though I don't know if I'm too happy with an option that forces people to pick between handling a semi-rare event (that we can't explain enough about to say when it'll happen) and potentially losing events. Though since this seems to happen when the watched resource hasn't been updated for days, it seems like the chances of it changing in the second or so where we're restarting the watch is unlikely.

I definitely want to do a bit more investigation around this before settling on a fix, though if you're blocked I'd be happy to release a temporary fix.

@chazsconi
Copy link
Collaborator Author

@obmarg I actually the saw the problem occur again today. An event had been received on the watched resource around 1 hour before, so it's certainly not days before this happens.

We are currently using K8S v1.8.8. We are planning to upgrade to v1.9.7 next week, and I'd be interested to see if the problem still occurs after the upgrade.

@chazsconi
Copy link
Collaborator Author

@obmarg I'm currently running against my fork in production with this fix:
chazsconi@42b297d

As I cannot reproduce this in tests I'm waiting for the scenario to re-occur to check that this fixes the problem. After that I can create this as a PR.

@chazsconi
Copy link
Collaborator Author

chazsconi commented May 17, 2018

Unfortunately this does not work. Setting the resource version to "0" causes some previous changes to be resent, as does not setting a resource version at all.

After checking the K8S docs, I found that the correct way to do it, is to refetch the resources.
(https://kubernetes.io/docs/reference/api-concepts/)

When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

Therefore, I believe that option (1) that I listed above is the solution. i.e. inform the consumer, and let the consumer decide how to deal with it.

@obmarg
Copy link
Owner

obmarg commented May 31, 2018

Apologies for my complete silence on this. I’ve been in the process of finding a new job, which has taken up just about all of my time. Almost sorted now though, so will be ready to feedback soon.

@chazsconi
Copy link
Collaborator Author

No problem. We're currently running with with a fork using this commit in production:
chazsconi@a201af7 for the last couple of weeks and it appears to solve the problem, although, of course, we need to handle the 410 in the consumer.

This is a breaking change as the messages now sent to the consumer are a different struct so the from can be included.

@obmarg
Copy link
Owner

obmarg commented Jun 7, 2018

Ok, so I'd ideally like a way for kazan to handle this automatically. However, it's not clear the best way to do that, and forcing users to handle it is at least a step in the right direction (since it's currently un-handleable). If you want to make a PR with that branch I'd be happy to accept after a bit of review.

@chazsconi
Copy link
Collaborator Author

@obmarg Sorry for the delay - PR now created.

@obmarg
Copy link
Owner

obmarg commented Jun 18, 2018 via email

@obmarg
Copy link
Owner

obmarg commented Oct 16, 2018

Fixed in #47

@obmarg obmarg closed this as completed Oct 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants