Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer in crashloop for persistance processor #212

Closed
cioboteacristian opened this issue Dec 9, 2019 · 2 comments
Closed

Consumer in crashloop for persistance processor #212

cioboteacristian opened this issue Dec 9, 2019 · 2 comments

Comments

@cioboteacristian
Copy link

cioboteacristian commented Dec 9, 2019

I have a processor that has a persistence processor. It appears that it is not able to rebalance at all, and it causes my pods to be in a crash loop.

My topic used for persistence (-table) has cleanup.policy=compact. Apparently, we have 1.3 mil messages in this topic.

I've even tried to reduce the number of pods to 1, to see if there is some kind of concurrency problem, it staled with the last log 2019/12/09 13:33:23 Processor: dispatcher started for about 5 minutes (raising the memory usage to 5GB), and eventually started (no extra logging or something)

I have 4 pods, and the topic has 4 partitions. I have 3 processors (2 of them without the persistence, and they work fine). It looks like the processor is not able to rebalance. The input is about 25msg/s.

I believe the root issue is the same as already mentioned in other posts here, slow recovery of the huge -table. However, I am surprised by this constant rebalancing.

Logs before dying:

2019/12/09 12:23:38 Processor: starting
2019/12/09 12:23:38 Processor: starting
2019/12/09 12:23:38 view: starting
2019/12/09 12:23:38 Processor: starting
2019/12/09 12:23:38 Processor: creating consumer [consumer1]
2019/12/09 12:23:38 Processor: creating consumer [consumer2]
2019/12/09 12:23:38 Processor: creating consumer [consumer3]
2019/12/09 12:23:38 Processor: creating producer
2019/12/09 12:23:38 Processor: creating producer
2019/12/09 12:23:38 view: partition 3 started
2019/12/09 12:23:38 view: partition 1 started
2019/12/09 12:23:38 view: partition 0 started
2019/12/09 12:23:38 view: partition 2 started
2019/12/09 12:23:38 Processor: creating producer
2019/12/09 12:23:38 Processor: rebalancing: map[]
2019/12/09 12:23:38 Processor: dispatcher started
2019/12/09 12:23:38 Processor: rebalancing: map[]
2019/12/09 12:23:38 Processor: dispatcher started
2019/12/09 12:23:38 Processor: rebalancing: map[]
2019/12/09 12:23:38 Processor: dispatcher started
2019/12/09 12:23:41 Processor: dispatcher stopped
2019/12/09 12:23:41 Processor: rebalancing: map[1:-1 2:-1]
2019/12/09 12:23:41 Processor: dispatcher started
2019/12/09 12:23:42 Processor: dispatcher stopped
2019/12/09 12:23:42 Processor: rebalancing: map[3:-1]
2019/12/09 12:23:42 Processor: dispatcher started
2019/12/09 12:23:52 Processor: dispatcher stopped
2019/12/09 12:23:52 partition /3: exit
2019/12/09 12:23:52 Removing partition 3
2019/12/09 12:23:52 Processor: rebalancing: map[3:-1]
2019/12/09 12:23:52 Processor: dispatcher started
2019/12/09 12:23:54 Processor: dispatcher stopped
2019/12/09 12:23:54 partition /2: exit
2019/12/09 12:23:54 partition /1: exit
2019/12/09 12:23:54 Removing partition 1
2019/12/09 12:23:54 Removing partition 2
2019/12/09 12:23:54 Processor: rebalancing: map[1:-1]
2019/12/09 12:23:54 Processor: dispatcher started
2019/12/09 12:24:08 Processor: dispatcher stopped
2019/12/09 12:24:08 Processor: removing partitions
2019/12/09 12:24:08 Processor: closing producer
2019/12/09 12:24:08 Processor: closing consumer [consumer3]
2019/12/09 12:24:08 Processor: stopped
{"lvl":"info","host":"uservice-564dcbbbcb-9g8mv","msg":"component error received"}
{"lvl":"info","host":"uservice-564dcbbbcb-9g8mv","msg":"shutting down component"}
2019/12/09 12:24:08 Processor: dispatcher stopped
2019/12/09 12:24:08 partition /1: exit
2019/12/09 12:24:08 Processor: dispatcher stopped
2019/12/09 12:24:08 view: partition 2 stopped
2019/12/09 12:24:08 Removing partition 1
2019/12/09 12:24:08 Processor: removing partitions
2019/12/09 12:24:08 Processor: closing producer
2019/12/09 12:24:08 view: partition 3 stopped
2019/12/09 12:24:08 view: partition 1 stopped
2019/12/09 12:24:08 Processor: closing consumer [consumer2]
2019/12/09 12:24:08 view: partition 0 stopped
2019/12/09 12:24:08 view: closing consumer
2019/12/09 12:24:08 partition /3: exit
2019/12/09 12:24:08 Removing partition 3
2019/12/09 12:24:08 Processor: removing partitions
2019/12/09 12:24:08 Processor: closing producer
2019/12/09 12:24:08 Processor: closing consumer [consumer1]
2019/12/09 12:24:08 Processor: stopped
2019/12/09 12:24:08 Processor: stopped
@frairon
Copy link
Contributor

frairon commented Mar 22, 2020

Hi @cioboteacristian, sorry for the late reponse. Your issue looks like the component is terminated by the container. The high memory usage could be a reason or the long startup time before it actually starts consuming.
The huge memory consumption is an issue we have to improve at some point but the recovery speed is actually limited to the network bandwidth and disk speed.
Anyway, as announced in #239 we are working on a refactored and improved version for goka. Although the recovery mechanism stayed the same more or less, it would be interesting if you are still facing those issues. Just use branch consumer-group or vendor tag v0.9.0-beta2 to try it out.
Let me know if there are any results or you have any problems.
Cheers!

@frairon
Copy link
Contributor

frairon commented Jul 10, 2020

Since there hasn't been any activity on this, I'll close it. If you're still having the same issue, feel free to reopen or create a new issue.

@frairon frairon closed this as completed Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants