-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When listeners are getting crashed, it is failing to recover by itself. #126
Comments
@Naren1997 can you post the full backtrace? These errors seem to be in the underlying RubyKafka library rather than Phobos. |
@dorner Here is the full backtrace,
|
@Naren1997 yeah I can't see anything in Phobos that would cause the fetcher to suddenly lose its thread. Maybe try posting this issue in https://github.com/zendesk/ruby-kafka ? |
@dorner Okay, I'll check in ruby-kafka. What would be an ideal time interval between listener restarts? Currently I've 10 seconds, will it be due to this? |
It would totally depend on your broker and consumer settings, and would also depend on why you're crashing in the first place. :) |
We've been having this problem sporadically as well when a kafka broker restarts. I looked into the code of phobos and ruby-kafka and I have a theory of why it occurs. In short, I think the underlying problem is a race condition. I'm just not sure if ruby-kafka is supporting the threading case that phobos is performing. Here's what I think occurs in this order in these threads:
I'm not 100% sure this is what is happening, but I think that it is very likely. A "cheap" solution to this problem is for ruby-kafka fetcher to call
https://github.com/zendesk/ruby-kafka#thread-safety
I'm not sure the "cheap" fix from above is the "correct" fix though. This is why I said, I'm "not sure if ruby-kafka is supporting the threading case that phobos is performing". Is it ok for phobos to start the ruby-kafka fetcher in one thread and stop it in another? In other words, is it ok for the phobos executor to stop the listeners explicitly or would it be (thread) safer to send a signal to the threads to stop which will in turn stop the listener and therefore in turn stop the ruby-kafka consumer & fetcher. To implement such a solution I think the following would need to be done:
As a side note, I think it's great that phobos is using threads instead of a separate process per consumer. In our project we have lot's of little consumers that run sporadically. Letting them run in a single process is much more memory efficient in our case. A big thanks to the maintainers of this gem. We enjoy using it and can definitely recommend it. |
Hey @austinmoore! I think you might be misreading the situation in terms of the quote you added from the ruby-kafka docs. That's talking about creating threads within your consumer, which we don't do (we have separate consumers per thread). It also says you shouldn't share a Kafka client between threads, which again we don't do. I think probably the best solution would be to move the |
I do see your point about stopping the listeners from the main thread. I'm not sure what the ramifications are of that, but I can say we've been using Phobos in production for about three years now and have yet to see any real issues coming out of it. |
is this problem fixed? |
I don't think so. I'm not sure if anyone has actually opened an issue on the ruby-kafka library though, as that's where the problem lies. |
I don't agree that this is a pure ruby-kafka issue, I think there should be something we can do here for this retry logic: https://github.com/phobos/phobos/blob/master/lib/phobos/executor.rb#L71 it is keeping using the original listener object for all the way to retry, but if there is any fundamental issues in that listener object, it won't be fixed by retry, and the retry is just useless at all. Endless loop with repeating the previous error, just like what is reported here. |
It's not easy to try and recreate the listener from scratch unfortunately. If you have any ideas or proposals, I'd be happy to review a PR for them. |
endless retrying, in this case 118, even the kafka server is ok. if we can fix the retry, then we won't see this problem on stop, i think.
|
@dorner can we add a retry up limit? because most will run this in a container, if the container stops. the container will be started again. but if phobos keeps this kind of retry failures to itself, the container won't be rebooted. |
proposed pr, #144 |
Hmm... I'm not sure this is the right approach. If one of your listeners crashes, does that mean you have to shut down the entire container? In particular it means that any listeners that are still live won't be able to shut down properly, meaning you're likely to get into a big rebalancing loop for a period of time until you can be stable again. I am still convinced that the right solution is to fix the underlying ruby-kafka library, and I'm not sure why out of everyone commenting on this thread and suggesting fixes, as far as I can tell no one has attempted to do so. 🤔 From what I can tell we should at least be able to stop the crash by adding a simple null check. |
@dorner thanks. how about now? recreate the client when retrying? https://github.com/phobos/phobos/pull/144/files |
It's a better approach, but it'll take more than recreating the client I think? The consumer was created with the old client and I'm not sure if just changing the reference to a new client will fix anything or break things worse. |
@reachlin do you know why your PR was closed? Also, were you able to test your fix to see if it was able to handle the retry better? |
When the Kakfa server is down, the listeners are getting crashed and it is unable to recover by itself, giving me the following error..
The text was updated successfully, but these errors were encountered: