Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243

Closed
nfoerster opened this issue Aug 12, 2020 · 10 comments

Comments

@nfoerster
Copy link

Liftbridge Version: 1.2.0

Hello,

the second time we run into not recoverable issues with the liftbridge deployment in our k8s cluster. We have 3 NATS pods and 3 Liftbridge pods running:

NAME                             READY   STATUS             RESTARTS   AGE
nats-default-1               1/1     Running            0          28d
nats-default-2               1/1     Running            0          28d
nats-default-3               1/1     Running            0          28d
liftbridge-0                     0/1     CrashLoopBackOff   20         84m
liftbridge-1                     1/1     Running            20         78m
liftbridge-2                     0/1     CrashLoopBackOff   23         93m
nats-operator-669766fdfd-6jfq2   1/1     Running            1          28d

As shown above, only one pod runs after the issues occurred, but searching for the other two pods in the cluster it fails also to work. There are different errors in the logs but the root cause seems to be an epoch selection mismatch:

time="2020-08-12 07:50:03" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99925 tally=1\n"
time="2020-08-12 07:50:04" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:04" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-2 cluster-liftbridge-2}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:05" level=warning msg=" raft: Election timeout reached, restarting election\n"
time="2020-08-12 07:50:05" level=info msg=" raft: entering candidate state: node=\"Node at cluster-liftbridge-1 [Candidate]\" term=99926\n"
time="2020-08-12 07:50:05" level=debug msg="raft: votes: needed=2\n"
time="2020-08-12 07:50:05" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99926 tally=1\n"
time="2020-08-12 07:50:05" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:05" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-2 cluster-liftbridge-2}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:06" level=warning msg=" raft: Election timeout reached, restarting election\n"
time="2020-08-12 07:50:06" level=info msg=" raft: entering candidate state: node=\"Node at cluster-liftbridge-1 [Candidate]\" term=99927\n"
time="2020-08-12 07:50:06" level=debug msg="raft: votes: needed=2\n"
time="2020-08-12 07:50:06" level=debug msg="raft: vote granted: from=cluster-liftbridge-0 term=99927 tally=1\n"
time="2020-08-12 07:50:06" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99927 tally=2\n"
time="2020-08-12 07:50:06" level=info msg=" raft: election won: tally=2\n"
time="2020-08-12 07:50:06" level=info msg=" raft: entering leader state: leader=\"Node at cluster-liftbridge-1 [Leader]\"\n"
time="2020-08-12 07:50:06" level=info msg=" raft: added peer, starting replication: peer=cluster-liftbridge-0\n"
time="2020-08-12 07:50:06" level=info msg=" raft: added peer, starting replication: peer=cluster-liftbridge-2\n"
time="2020-08-12 07:50:06" level=info msg="Server became metadata leader, performing leader promotion actions"
time="2020-08-12 07:50:06" level=info msg=" raft: pipelining replication: peer=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\"\n"
time="2020-08-12 07:50:06" level=debug msg="fsm: Replaying Raft log..."
time="2020-08-12 07:50:06" level=debug msg="fsm: Created stream [name=mqtt, subject=mqtt, partitions=1]"
panic: failed to add stream to metadata store: failed to create commit log: failed to read leader epoch offsets file: duplicate leader epoch 5599

goroutine 72 [running]:
github.com/liftbridge-io/liftbridge/server.(*Server).Apply(0xc0002940e0, 0xc000314f00, 0x0, 0x0)
	/workspace/server/fsm.go:111 +0x407
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000fdb4d0)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:90 +0x2aa
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0006f6000, 0x40, 0x40)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc000490000)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:219 +0x3a9
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000490000, 0xc00031a8c0)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:146 +0x5d
created by github.com/hashicorp/raft.(*raftState).goFunc
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:144 +0x66

This issue is unrecoverable until now. All pods have an own persistent-volume-claim, storing their raft and stream data persistently. If you restart a broken pod (instance 0 or 1) the currently other running pod will crash. However, the third pod (instance 2) crashs directly with another error.

This is our liftbridge configuration:

liftbridge.yaml: |
    listen: 0.0.0.0:9292
    host: liftbridge-clients.nats
    port: 9292

    logging:
      level: debug
      recovery: true
      raft: true

    nats.servers:
      - "nats://nats-default.nats:4222"
      
    streams:
      retention.max:
        bytes: 20937965568
    clustering:
      min.insync.replicas: 2

The logs are attached.
logs-from-liftbridge-in-liftbridge-2 (1).txt
logs-from-liftbridge-in-liftbridge-2.txt
logs-from-liftbridge-in-liftbridge-1 (1).txt
logs-from-liftbridge-in-liftbridge-1.txt
logs-from-liftbridge-in-liftbridge-0 (1).txt
logs-from-liftbridge-in-liftbridge-0.txt

Do you have any glue about the error or how to recover the deployment?
Thank you in advance.

@LaPetiteSouris
Copy link
Contributor

LaPetiteSouris commented Aug 12, 2020

Just by curiosity, do you use any persistent volume mount for Liftbridge pods?

@nfoerster
Copy link
Author

Yes we do, every pod has a persistent volume claim in azure:
image

@LaPetiteSouris
Copy link
Contributor

Yes we do, every pod has a persistent volume claim in azure:
image

Is there any chance that we can take a look in data-dir folder ? In your configuration, I do not see that you specified in to the mount, so I presume that it is in by default at "/tmp/liftbridge/<namespace>"

I guess it may help to check out the content inside.

@nfoerster
Copy link
Author

Yes we can look inside. In /data we see the folders raft with raft.db and streams containing the streamdata. Which file is interested?
image

@tylertreat
Copy link
Member

Sorry that you're experiencing this. Can you provide the contents of the leader-epoch-checkpoint file for the particular stream partition on the node that is crashing? It will be in the partition data directory.

@LaPetiteSouris
Copy link
Contributor

Yes we can look inside. In /data we see the folders raft with raft.db and streams containing the streamdata. Which file is interested?

Judging from your logs, it looks like tmp/liftbridge/<namespace>/streams/mqtt/1 as your stream is mqtt and on partition 1

@nfoerster
Copy link
Author

nfoerster commented Aug 13, 2020

Yes we can look inside. In /data we see the folders raft with raft.db and streams containing the streamdata. Which file is interested?

Judging from your logs, it looks like tmp/liftbridge/<namespace>/streams/mqtt/1 as your stream is mqtt and on partition 1

Hmm there should be no partition 1 at any of the streams, all data is partitioned to 0.

Sorry that you're experiencing this. Can you provide the contents of the leader-epoch-checkpoint file for the particular stream partition on the node that is crashing? It will be in the partition data directory.

I find the problematic lines by searching for the number 5585 in the file in stream stream_meters15:

0
3
5585 3965
120 3085
5585 4425

Files are attached.

I changed the name of the files before attaching here to: <streamname>_<podnumber>_leader-epoch-checkpoint.txt
and <streamname>_<podnumber>_replication-offset-checkpoint.txt

mqtt_0_leader-epoch-checkpoint.txt
mqtt_0_replication-offset-checkpoint.txt
stream_meters15_0_leader-epoch-checkpoint.txt
stream_meters15_0_replication-offset-checkpoint.txt
mqtt_1_leader-epoch-checkpoint.txt
mqtt_1_replication-offset-checkpoint.txt
stream_meters15_1_leader-epoch-checkpoint.txt
stream_meters15_1_replication-offset-checkpoint.txt

@tylertreat
Copy link
Member

@nfoerster Thanks for providing the epoch checkpoint contents. Admittedly, this is a strange problem. The issue is due to the duplicate entry for the leader epoch 5585. The bigger issue though is the 120 3085 entry in between. That generally should not be possible since we only allow adding entries with greater epoch and offset, but it's possible there is a bug that is allowing this entry to be made.

Since you have debug logs enabled, there should be logs indicating these epoch entries. They start with Updated log leader epoch. For example:

DEBU[2020-08-13 10:56:37] Updated log leader epoch. New: {epoch:5, offset:-1}, Previous: {epoch:0, offset:-1} for log [subject=foo, stream=foo-stream, partition=0]. Cache now contains 1 entry.

Do you see these logs on the nodes that crash leading up to the crash?

tylertreat added a commit that referenced this issue Aug 13, 2020
@tylertreat
Copy link
Member

FYI, I did make a small fix after reviewing the leader epoch caching code (#245). I'm not 100% certain this will fix the issue you're seeing without more information, but if you're able, it would be worth a try. To get the cluster into a working state, you'll need to delete the leader-epoch-checkpoint files on the failed nodes.

@nfoerster
Copy link
Author

Do you see these logs on the nodes that crash leading up to the crash?

Unfortunately not, thats a big issue.

Thank you very much for the error description and the supplied patch. We will integrate the patch and also store all logs on debug level, so if the issue occurs again, we can further investigate. If that is the case we shall reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants