-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243
Comments
Just by curiosity, do you use any persistent volume mount for Liftbridge pods? |
Sorry that you're experiencing this. Can you provide the contents of the |
Judging from your logs, it looks like |
Hmm there should be no partition 1 at any of the streams, all data is partitioned to 0.
I find the problematic lines by searching for the number 5585 in the file in stream stream_meters15:
Files are attached. I changed the name of the files before attaching here to: <streamname>_<podnumber>_leader-epoch-checkpoint.txt mqtt_0_leader-epoch-checkpoint.txt |
@nfoerster Thanks for providing the epoch checkpoint contents. Admittedly, this is a strange problem. The issue is due to the duplicate entry for the leader epoch Since you have debug logs enabled, there should be logs indicating these epoch entries. They start with
Do you see these logs on the nodes that crash leading up to the crash? |
FYI, I did make a small fix after reviewing the leader epoch caching code (#245). I'm not 100% certain this will fix the issue you're seeing without more information, but if you're able, it would be worth a try. To get the cluster into a working state, you'll need to delete the |
Unfortunately not, thats a big issue. Thank you very much for the error description and the supplied patch. We will integrate the patch and also store all logs on debug level, so if the issue occurs again, we can further investigate. If that is the case we shall reopen the issue. |
Liftbridge Version: 1.2.0
Hello,
the second time we run into not recoverable issues with the liftbridge deployment in our k8s cluster. We have 3 NATS pods and 3 Liftbridge pods running:
As shown above, only one pod runs after the issues occurred, but searching for the other two pods in the cluster it fails also to work. There are different errors in the logs but the root cause seems to be an epoch selection mismatch:
This issue is unrecoverable until now. All pods have an own persistent-volume-claim, storing their raft and stream data persistently. If you restart a broken pod (instance 0 or 1) the currently other running pod will crash. However, the third pod (instance 2) crashs directly with another error.
This is our liftbridge configuration:
The logs are attached.
logs-from-liftbridge-in-liftbridge-2 (1).txt
logs-from-liftbridge-in-liftbridge-2.txt
logs-from-liftbridge-in-liftbridge-1 (1).txt
logs-from-liftbridge-in-liftbridge-1.txt
logs-from-liftbridge-in-liftbridge-0 (1).txt
logs-from-liftbridge-in-liftbridge-0.txt
Do you have any glue about the error or how to recover the deployment?
Thank you in advance.
The text was updated successfully, but these errors were encountered: