Watcher fails with 410 "GONE" #45

chazsconi · 2018-05-07T10:18:52Z

If one particular event type is been watched (e.g. Namespace changes) and no new events happen for a long time, but other events, that are not being watched happen, the other, non-watched events, cause the RV to increase.

Eventually the RV that is stored in the watcher is too old to be used, and the watcher fails returning a 410 response code.

If an original RV was passed to the watcher, the supervisor restarts the watcher with the same init parameters, and it fails again.

Possible solutions:

Parse the message that comes back to extract a later RV
Send a message to the caller to inform it that it can no longer watch and it should start again
Somehow obtain the latest RV from K8S without potentially missing events

Logs below:

FunctionClauseError) no function clause matching in Kazan.Watcher.extract_rv/1
    (kazan) lib/kazan/watcher.ex:235: Kazan.Watcher.extract_rv(%{"apiVersion" => "v1", "code" => 410, "kind" => "Status", "message" => "too old resource version: 20293058 (20295457)", "metadata" => %{}, "reason" => "Gone", "status" => "Failure"})
    (kazan) lib/kazan/watcher.ex:219: anonymous fn/4 in Kazan.Watcher.process_lines/2
    (elixir) lib/enum.ex:1899: Enum."-reduce/3-lists^foldl/2-0-"/3
    (kazan) lib/kazan/watcher.ex:158: Kazan.Watcher.handle_info/2
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: %HTTPoison.AsyncChunk{chunk: "{\"type\":\"ERROR\",\"object\":{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"too old resource version: 20293058 (20295457)\",\"reason\":\"Gone\",\"code\":410}}\n", id: #Reference<0.4253789252.1463549953.187567>}

The text was updated successfully, but these errors were encountered:

chazsconi · 2018-05-07T10:19:32Z

@obmarg any further ideas on how to solve this? My current preference is option (2) above.

obmarg · 2018-05-07T13:49:09Z

@chazsconi do you have any code that can reproduce this situation? I'd like to investigate a little before deciding on a solution.

chazsconi · 2018-05-07T18:23:15Z

@obmarg I will try to reproduce this with a test, however the scenario in which I saw it happen was after over 48 hours of no events being received by the watcher, so it might not be simple.

After seeing this ticket, I think the resource version getting too old is a normal behaviour.

kubernetes/kubernetes#55230

obmarg · 2018-05-07T19:38:34Z

@chazsconi Yeah, I agree it seems like a normal thing to happen. Definitely something Kazan should do something about. Since it seems to happen in response to other things changing in the database, I was hoping there'd be a simple way to induce the issue by just changing tons of un-watched resources. Though I guess it depends precisely what changes need to be made in order to induce the issue.

I did come across this PR that updates kubectl to handle something similar. Seems like (though I've not yet tested) if you specify a resourceVersion of "0" you'll get sent the latest resource, and then any updates to that resource. Was wondering if we could utilize this for the issue you're experiencing, though wanted to play around with the issue to see whether it made sense before suggesting...

chazsconi · 2018-05-08T09:00:43Z

@obmarg I tried to reproduce the problem by creating a watch for pod changes in a namespace and then making changes to a config map, but even after 75000 changes to the config map the problem did not occur on the watch.

However, as you say, specifying a resource version of "0" appears to send only new events. So a simple solution is to fix the code to revert to resource version "0" when the "too old resource version" problem occurs.

However there is a possibility of events being missed between the original watch failing and restarting it with resource version "0" - if this is critical, then the consumer can be informed and it can refetch everything again.

Therefore maybe providing both options in the library would be best and allow the consumer of the library to choose which they prefer:

Inform the consumer and terminate the watch
Restart from resource version "0"

obmarg · 2018-05-09T18:55:31Z

@chazsconi Yeah, maybe we'll end up having to do that. Though I don't know if I'm too happy with an option that forces people to pick between handling a semi-rare event (that we can't explain enough about to say when it'll happen) and potentially losing events. Though since this seems to happen when the watched resource hasn't been updated for days, it seems like the chances of it changing in the second or so where we're restarting the watch is unlikely.

I definitely want to do a bit more investigation around this before settling on a fix, though if you're blocked I'd be happy to release a temporary fix.

chazsconi · 2018-05-09T19:27:32Z

@obmarg I actually the saw the problem occur again today. An event had been received on the watched resource around 1 hour before, so it's certainly not days before this happens.

We are currently using K8S v1.8.8. We are planning to upgrade to v1.9.7 next week, and I'd be interested to see if the problem still occurs after the upgrade.

chazsconi · 2018-05-10T14:11:32Z

@obmarg I'm currently running against my fork in production with this fix:
chazsconi@42b297d

As I cannot reproduce this in tests I'm waiting for the scenario to re-occur to check that this fixes the problem. After that I can create this as a PR.

chazsconi · 2018-05-17T18:45:30Z

Unfortunately this does not work. Setting the resource version to "0" causes some previous changes to be resent, as does not setting a resource version at all.

After checking the K8S docs, I found that the correct way to do it, is to refetch the resources.
(https://kubernetes.io/docs/reference/api-concepts/)

When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

Therefore, I believe that option (1) that I listed above is the solution. i.e. inform the consumer, and let the consumer decide how to deal with it.

obmarg · 2018-05-31T19:25:42Z

Apologies for my complete silence on this. I’ve been in the process of finding a new job, which has taken up just about all of my time. Almost sorted now though, so will be ready to feedback soon.

chazsconi · 2018-06-01T08:05:44Z

No problem. We're currently running with with a fork using this commit in production:
chazsconi@a201af7 for the last couple of weeks and it appears to solve the problem, although, of course, we need to handle the 410 in the consumer.

This is a breaking change as the messages now sent to the consumer are a different struct so the from can be included.

obmarg · 2018-06-07T12:16:21Z

Ok, so I'd ideally like a way for kazan to handle this automatically. However, it's not clear the best way to do that, and forcing users to handle it is at least a step in the right direction (since it's currently un-handleable). If you want to make a PR with that branch I'd be happy to accept after a bit of review.

chazsconi · 2018-06-15T15:29:42Z

@obmarg Sorry for the delay - PR now created.

obmarg · 2018-06-18T19:14:32Z

No problem, thanks for the PR 👍. I’m without access to a laptop for the next couple of weeks. Will give it a look once I’m back.

…

On Fri, 15 Jun 2018 at 18:29, chazsconi ***@***.***> wrote: @obmarg <https://github.com/obmarg> Sorry for the delay - PR now created. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAh9ymDmMTmyXmsYY3p1eA4ChGj1tMqiks5t89LngaJpZM4T0vSh> .

obmarg · 2018-10-16T22:41:26Z

Fixed in #47

obmarg added the bug label May 7, 2018

obmarg mentioned this issue May 10, 2018

Terminate Watcher if send_to process terminates #46

Merged

obmarg closed this as completed Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher fails with 410 "GONE" #45

Watcher fails with 410 "GONE" #45

chazsconi commented May 7, 2018

chazsconi commented May 7, 2018

obmarg commented May 7, 2018

chazsconi commented May 7, 2018 •

edited

Loading

obmarg commented May 7, 2018

chazsconi commented May 8, 2018

obmarg commented May 9, 2018 •

edited

Loading

chazsconi commented May 9, 2018

chazsconi commented May 10, 2018

chazsconi commented May 17, 2018 •

edited

Loading

obmarg commented May 31, 2018 •

edited

Loading

chazsconi commented Jun 1, 2018

obmarg commented Jun 7, 2018

chazsconi commented Jun 15, 2018

obmarg commented Jun 18, 2018 via email

obmarg commented Oct 16, 2018

Watcher fails with 410 "GONE" #45

Watcher fails with 410 "GONE" #45

Comments

chazsconi commented May 7, 2018

chazsconi commented May 7, 2018

obmarg commented May 7, 2018

chazsconi commented May 7, 2018 • edited Loading

obmarg commented May 7, 2018

chazsconi commented May 8, 2018

obmarg commented May 9, 2018 • edited Loading

chazsconi commented May 9, 2018

chazsconi commented May 10, 2018

chazsconi commented May 17, 2018 • edited Loading

obmarg commented May 31, 2018 • edited Loading

chazsconi commented Jun 1, 2018

obmarg commented Jun 7, 2018

chazsconi commented Jun 15, 2018

obmarg commented Jun 18, 2018 via email

obmarg commented Oct 16, 2018

chazsconi commented May 7, 2018 •

edited

Loading

obmarg commented May 9, 2018 •

edited

Loading

chazsconi commented May 17, 2018 •

edited

Loading

obmarg commented May 31, 2018 •

edited

Loading