Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243

nfoerster · 2020-08-12T08:16:33Z

Liftbridge Version: 1.2.0

Hello,

the second time we run into not recoverable issues with the liftbridge deployment in our k8s cluster. We have 3 NATS pods and 3 Liftbridge pods running:

NAME                             READY   STATUS             RESTARTS   AGE
nats-default-1               1/1     Running            0          28d
nats-default-2               1/1     Running            0          28d
nats-default-3               1/1     Running            0          28d
liftbridge-0                     0/1     CrashLoopBackOff   20         84m
liftbridge-1                     1/1     Running            20         78m
liftbridge-2                     0/1     CrashLoopBackOff   23         93m
nats-operator-669766fdfd-6jfq2   1/1     Running            1          28d

As shown above, only one pod runs after the issues occurred, but searching for the other two pods in the cluster it fails also to work. There are different errors in the logs but the root cause seems to be an epoch selection mismatch:

time="2020-08-12 07:50:03" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99925 tally=1\n"
time="2020-08-12 07:50:04" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:04" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-2 cluster-liftbridge-2}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:05" level=warning msg=" raft: Election timeout reached, restarting election\n"
time="2020-08-12 07:50:05" level=info msg=" raft: entering candidate state: node=\"Node at cluster-liftbridge-1 [Candidate]\" term=99926\n"
time="2020-08-12 07:50:05" level=debug msg="raft: votes: needed=2\n"
time="2020-08-12 07:50:05" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99926 tally=1\n"
time="2020-08-12 07:50:05" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:05" level=error msg="] raft: failed to make requestVote RPC: target=\"{Voter cluster-liftbridge-2 cluster-liftbridge-2}\" error=\"nats: timeout\"\n"
time="2020-08-12 07:50:06" level=warning msg=" raft: Election timeout reached, restarting election\n"
time="2020-08-12 07:50:06" level=info msg=" raft: entering candidate state: node=\"Node at cluster-liftbridge-1 [Candidate]\" term=99927\n"
time="2020-08-12 07:50:06" level=debug msg="raft: votes: needed=2\n"
time="2020-08-12 07:50:06" level=debug msg="raft: vote granted: from=cluster-liftbridge-0 term=99927 tally=1\n"
time="2020-08-12 07:50:06" level=debug msg="raft: vote granted: from=cluster-liftbridge-1 term=99927 tally=2\n"
time="2020-08-12 07:50:06" level=info msg=" raft: election won: tally=2\n"
time="2020-08-12 07:50:06" level=info msg=" raft: entering leader state: leader=\"Node at cluster-liftbridge-1 [Leader]\"\n"
time="2020-08-12 07:50:06" level=info msg=" raft: added peer, starting replication: peer=cluster-liftbridge-0\n"
time="2020-08-12 07:50:06" level=info msg=" raft: added peer, starting replication: peer=cluster-liftbridge-2\n"
time="2020-08-12 07:50:06" level=info msg="Server became metadata leader, performing leader promotion actions"
time="2020-08-12 07:50:06" level=info msg=" raft: pipelining replication: peer=\"{Voter cluster-liftbridge-0 cluster-liftbridge-0}\"\n"
time="2020-08-12 07:50:06" level=debug msg="fsm: Replaying Raft log..."
time="2020-08-12 07:50:06" level=debug msg="fsm: Created stream [name=mqtt, subject=mqtt, partitions=1]"
panic: failed to add stream to metadata store: failed to create commit log: failed to read leader epoch offsets file: duplicate leader epoch 5599

goroutine 72 [running]:
github.com/liftbridge-io/liftbridge/server.(*Server).Apply(0xc0002940e0, 0xc000314f00, 0x0, 0x0)
	/workspace/server/fsm.go:111 +0x407
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000fdb4d0)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:90 +0x2aa
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0006f6000, 0x40, 0x40)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc000490000)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:219 +0x3a9
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000490000, 0xc00031a8c0)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:146 +0x5d
created by github.com/hashicorp/raft.(*raftState).goFunc
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:144 +0x66

This issue is unrecoverable until now. All pods have an own persistent-volume-claim, storing their raft and stream data persistently. If you restart a broken pod (instance 0 or 1) the currently other running pod will crash. However, the third pod (instance 2) crashs directly with another error.

This is our liftbridge configuration:

liftbridge.yaml: |
    listen: 0.0.0.0:9292
    host: liftbridge-clients.nats
    port: 9292

    logging:
      level: debug
      recovery: true
      raft: true

    nats.servers:
      - "nats://nats-default.nats:4222"
      
    streams:
      retention.max:
        bytes: 20937965568
    clustering:
      min.insync.replicas: 2

The logs are attached.
logs-from-liftbridge-in-liftbridge-2 (1).txt
logs-from-liftbridge-in-liftbridge-2.txt
logs-from-liftbridge-in-liftbridge-1 (1).txt
logs-from-liftbridge-in-liftbridge-1.txt
logs-from-liftbridge-in-liftbridge-0 (1).txt
logs-from-liftbridge-in-liftbridge-0.txt

Do you have any glue about the error or how to recover the deployment?
Thank you in advance.

The text was updated successfully, but these errors were encountered:

LaPetiteSouris · 2020-08-12T08:20:50Z

Just by curiosity, do you use any persistent volume mount for Liftbridge pods?

nfoerster · 2020-08-12T08:26:55Z

Yes we do, every pod has a persistent volume claim in azure:

LaPetiteSouris · 2020-08-12T08:32:40Z

Yes we do, every pod has a persistent volume claim in azure:

Is there any chance that we can take a look in data-dir folder ? In your configuration, I do not see that you specified in to the mount, so I presume that it is in by default at "/tmp/liftbridge/<namespace>"

I guess it may help to check out the content inside.

nfoerster · 2020-08-12T08:36:44Z

Yes we can look inside. In /data we see the folders raft with raft.db and streams containing the streamdata. Which file is interested?

tylertreat · 2020-08-12T18:30:42Z

Sorry that you're experiencing this. Can you provide the contents of the leader-epoch-checkpoint file for the particular stream partition on the node that is crashing? It will be in the partition data directory.

LaPetiteSouris · 2020-08-12T20:49:18Z

Yes we can look inside. In /data we see the folders raft with raft.db and streams containing the streamdata. Which file is interested?

Judging from your logs, it looks like tmp/liftbridge/<namespace>/streams/mqtt/1 as your stream is mqtt and on partition 1

nfoerster · 2020-08-13T04:48:04Z

Yes we can look inside. In /data we see the folders raft with raft.db and streams containing the streamdata. Which file is interested?

Judging from your logs, it looks like tmp/liftbridge/<namespace>/streams/mqtt/1 as your stream is mqtt and on partition 1

Hmm there should be no partition 1 at any of the streams, all data is partitioned to 0.

Sorry that you're experiencing this. Can you provide the contents of the leader-epoch-checkpoint file for the particular stream partition on the node that is crashing? It will be in the partition data directory.

I find the problematic lines by searching for the number 5585 in the file in stream stream_meters15:

Files are attached.

I changed the name of the files before attaching here to: <streamname>_<podnumber>_leader-epoch-checkpoint.txt
and <streamname>_<podnumber>_replication-offset-checkpoint.txt

mqtt_0_leader-epoch-checkpoint.txt
mqtt_0_replication-offset-checkpoint.txt
stream_meters15_0_leader-epoch-checkpoint.txt
stream_meters15_0_replication-offset-checkpoint.txt
mqtt_1_leader-epoch-checkpoint.txt
mqtt_1_replication-offset-checkpoint.txt
stream_meters15_1_leader-epoch-checkpoint.txt
stream_meters15_1_replication-offset-checkpoint.txt

tylertreat · 2020-08-13T17:19:26Z

@nfoerster Thanks for providing the epoch checkpoint contents. Admittedly, this is a strange problem. The issue is due to the duplicate entry for the leader epoch 5585. The bigger issue though is the 120 3085 entry in between. That generally should not be possible since we only allow adding entries with greater epoch and offset, but it's possible there is a bug that is allowing this entry to be made.

Since you have debug logs enabled, there should be logs indicating these epoch entries. They start with Updated log leader epoch. For example:

DEBU[2020-08-13 10:56:37] Updated log leader epoch. New: {epoch:5, offset:-1}, Previous: {epoch:0, offset:-1} for log [subject=foo, stream=foo-stream, partition=0]. Cache now contains 1 entry.

Do you see these logs on the nodes that crash leading up to the crash?

This is possibly related to #243.

tylertreat · 2020-08-14T00:09:12Z

FYI, I did make a small fix after reviewing the leader epoch caching code (#245). I'm not 100% certain this will fix the issue you're seeing without more information, but if you're able, it would be worth a try. To get the cluster into a working state, you'll need to delete the leader-epoch-checkpoint files on the failed nodes.

nfoerster · 2020-08-14T09:48:41Z

Do you see these logs on the nodes that crash leading up to the crash?

Unfortunately not, thats a big issue.

Thank you very much for the error description and the supplied patch. We will integrate the patch and also store all logs on debug level, so if the issue occurs again, we can further investigate. If that is the case we shall reopen the issue.

tylertreat added a commit that referenced this issue Aug 13, 2020

Fix logic error in leaderEpochCache ClearEarliest

02dc95f

This is possibly related to #243.

nfoerster closed this as completed Aug 14, 2020

blunghamer mentioned this issue Aug 21, 2020

Chaos testing #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243

Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243

nfoerster commented Aug 12, 2020

LaPetiteSouris commented Aug 12, 2020 •

edited

nfoerster commented Aug 12, 2020

LaPetiteSouris commented Aug 12, 2020

nfoerster commented Aug 12, 2020

tylertreat commented Aug 12, 2020

LaPetiteSouris commented Aug 12, 2020

nfoerster commented Aug 13, 2020 •

edited

tylertreat commented Aug 13, 2020

tylertreat commented Aug 14, 2020

nfoerster commented Aug 14, 2020

Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243

Liftbridge runs into not recoverable issues after x days running in an Azure k8s cluster #243

Comments

nfoerster commented Aug 12, 2020

LaPetiteSouris commented Aug 12, 2020 • edited

nfoerster commented Aug 12, 2020

LaPetiteSouris commented Aug 12, 2020

nfoerster commented Aug 12, 2020

tylertreat commented Aug 12, 2020

LaPetiteSouris commented Aug 12, 2020

nfoerster commented Aug 13, 2020 • edited

tylertreat commented Aug 13, 2020

tylertreat commented Aug 14, 2020

nfoerster commented Aug 14, 2020

LaPetiteSouris commented Aug 12, 2020 •

edited

nfoerster commented Aug 13, 2020 •

edited