failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax #333

ekbfh · 2021-03-16T16:56:03Z

Hello!

I saw this issue before on 1.3.*, but it also happens on 1.5.1

/bin/liftbridge --config=/etc/liftbridge/liftbridge.yml
INFO[2021-03-16 19:36:48] Liftbridge Version:        v1.5.1            
INFO[2021-03-16 19:36:48] Server ID:                 srvnc-nc          
INFO[2021-03-16 19:36:48] Namespace:                 liftbridge-default 
INFO[2021-03-16 19:36:48] NATS Servers:              [nats://172.24.20.9:4222] 
INFO[2021-03-16 19:36:48] Default Retention Policy:  [Age: 3 hours, Compact: false] 
INFO[2021-03-16 19:36:48] Default Partition Pausing: disabled          
INFO[2021-03-16 19:36:48] Starting Liftbridge server on 172.24.20.9:9292... 
INFO[2021-03-16 19:36:48]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:srvnc-nc Address:srvnc-nc}]" 
DEBU[2021-03-16 19:36:48] Loaded existing state for metadata Raft group 
DEBU[2021-03-16 19:36:48] Raft node initialized                        
INFO[2021-03-16 19:36:48]  raft: entering follower state: follower="Node at srvnc-nc [Follower]" leader= 
WARN[2021-03-16 19:36:50]  raft: heartbeat timeout reached, starting election: last-leader= 
INFO[2021-03-16 19:36:50]  raft: entering candidate state: node="Node at srvnc-nc [Candidate]" term=74 
DEBU[2021-03-16 19:36:50] raft: votes: needed=1                        
DEBU[2021-03-16 19:36:50] raft: vote granted: from=srvnc-nc term=74 tally=1 
INFO[2021-03-16 19:36:50]  raft: election won: tally=1                 
INFO[2021-03-16 19:36:50]  raft: entering leader state: leader="Node at srvnc-nc [Leader]" 
INFO[2021-03-16 19:36:50] Server became metadata leader, performing leader promotion actions 
DEBU[2021-03-16 19:36:50] fsm: Replaying Raft log...                   
panic: failed to add stream to metadata store: failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax

goroutine 55 [running]:
github.com/liftbridge-io/liftbridge/server.(*Server).Apply(0xc000198300, 0xc0004c4230, 0x0, 0x0)
	/home/circleci/project/server/fsm.go:111 +0x3e3
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000389320)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:90 +0x2c1
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0003ca600, 0x40, 0x40)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc0001ea300)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:219 +0x42f
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc0001ea300, 0xc000388e80)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:144 +0x66

config file:

cat /etc/liftbridge/liftbridge.yml
clustering:
    raft.bootstrap.seed: true
    server.id: srvnc-nc
cursors:
    stream.auto.pause.time: 0
    stream.partitions: 1
data:
    dir: /var/lib/liftbridge
host: 172.24.20.9
logging:
    level: debug
    raft: true
nats.servers:
- nats://172.24.20.9:4222
streams:
    compact.enabled: false
    retention.max:
        age: 3h
    segment.max:
        age: 20m

Workaround: drop LB_data folder content

P.S. i have an archive data with example of this case. I can upload it somewhere

The text was updated successfully, but these errors were encountered:

ekbfh · 2021-05-21T07:24:30Z

@tylertreat Can you say, what can go wrong there?

tylertreat · 2021-05-21T15:36:55Z

Sorry, somehow I missed this issue report. Do you by chance have the contents of the partition data directory for the partition that is causing this error?

ekbfh · 2021-05-24T11:03:04Z

lift.1.5.1_parse.tar.gz
Please, take a look with this files. I can't detect which one partition is broken, so i attached all of data files.

tylertreat · 2021-06-01T16:40:48Z

Somehow all of the HW checkpoint files (files named replication-offset-checkpoint in each partition) are empty, which is what causes the panic. I'm not sure how this could happen, but I'm investigating.

Is there any other information surrounding the circumstances in which it occurred? Looks like this was on restarting a server which already had partition data?

ekbfh · 2021-06-15T19:18:29Z

It's possible case, but i don't know the real cause of this problem

ekbfh · 2021-07-20T19:23:07Z

Is there any clue?

tylertreat · 2021-07-20T20:51:17Z

I'm not sure how to reproduce this issue in order to debug it.

aversant · 2021-11-02T14:33:26Z

For reproducing this behaviour i setup liftbridge on VMWare VM. Power Off on it - causes this.

Problem around This:

func (l *commitLog) checkpointHW() error {
	var (
		hw   = l.hw
		r    = strings.NewReader(strconv.FormatInt(hw, 10))
		file = filepath.Join(l.Path, hwFileName)
	)
	return atomic_file.WriteFile(file, r)
}

I bump atomic_file library to 1.0.1 version and after shutdown saw the following:

[root@noc-mt ~]# ls -la /var/lib/liftbridge/streams/ch.alarms/0/
total 80
drwxr-xr-x  2 nobody  wheel       512 Nov  2 19:01 .
drwxr-xr-x  3 nobody  wheel       512 Nov  2 18:25 ..
-rw-r--r--  1 nobody  wheel  10485760 Nov  2 18:25 00000000000000000000.index
-rw-r--r--  1 nobody  wheel         0 Nov  2 18:25 00000000000000000000.log
-rw-------  1 nobody  wheel         4 Nov  2 18:54 leader-epoch-checkpoint
-rw-------  1 nobody  wheel         0 Nov  2 19:01 replication-offset-checkpoint
-rw-------  1 nobody  wheel         2 Nov  2 19:01 replication-offset-checkpoint602025427

The commit may have been affected. Example __cursor stream:

[root@noc-mt ~]# ls -la /var/lib/liftbridge/streams/__cursors/0/
total 160
drwxr-xr-x  2 nobody  wheel       512 Nov  2 19:01 .
drwxr-xr-x  3 nobody  wheel       512 Nov  2 18:24 ..
-rw-r--r--  1 nobody  wheel  10485760 Nov  2 18:27 00000000000000000000.index
-rw-r--r--  1 nobody  wheel     42660 Nov  2 18:27 00000000000000000000.log
-rw-------  1 nobody  wheel         8 Nov  2 18:29 leader-epoch-checkpoint
-rw-------  1 nobody  wheel         3 Nov  2 19:01 replication-offset-checkpoint
-rw-------  1 nobody  wheel         3 Nov  2 19:01 replication-offset-checkpoint201413289

Log Liftbridge liftbridge_striconv_trace1.txt. checkpointHW added by me.

tylertreat · 2021-11-03T15:49:33Z

@aversant do you want to create a PR containing the version bump for atomic_file?

tylertreat · 2021-11-03T15:50:23Z

And just to confirm, when you bump the version, you no longer see the empty replication-offset-checkpoint files?

aversant · 2021-11-03T17:24:36Z

And just to confirm, when you bump the version, you no longer see the empty replication-offset-checkpoint files?

No. After bump version still see emty replication-offset-checkpoint file (ch.alarms/0/ stream example). I don't understand why they appear.

tylertreat · 2021-11-03T18:34:41Z

I'm not sure. The atomic file library just does an os.Rename to replace the old checkpoint file with the new one, which should be an atomic operation on Linux.

WriteFile atomically writes the contents of r to the specified filepath. If an error occurs, the target file is guaranteed to be either fully written, or not written at all. WriteFile overwrites any file that exists at the location (but only if the write fully succeeds, otherwise the existing file is unmodified).

It's not clear to me how the contents would become empty.

tylertreat · 2021-11-05T21:57:13Z

@aversant Out of curiosity, are you using the embedded NATS server when you see this issue or a separate standalone NATS?

aversant · 2021-11-06T14:12:15Z

I tested on embedded and a separate standalone NATS server.

Checked again with atomic v1.0.1 on CentOS 7 couldn't reproduce the problem - I tried to turn off the power several dozen times. I'll try again on CentOS 8, if it doesn't play, then atomic v1.0.1 works.

tylertreat · 2021-11-08T16:16:52Z

The atomic dependency was updated in this PR, which will be going out in a release today.

tylertreat · 2021-11-08T21:32:10Z

I believe this is fixed in v1.7.0. Please re-open if it still occurs.

tylertreat mentioned this issue Nov 8, 2021

Fix bug related to stream deletion #367

Merged

tylertreat closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax #333

failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax #333

ekbfh commented Mar 16, 2021

ekbfh commented May 21, 2021

tylertreat commented May 21, 2021

ekbfh commented May 24, 2021

tylertreat commented Jun 1, 2021

ekbfh commented Jun 15, 2021

ekbfh commented Jul 20, 2021

tylertreat commented Jul 20, 2021

aversant commented Nov 2, 2021

tylertreat commented Nov 3, 2021

tylertreat commented Nov 3, 2021

aversant commented Nov 3, 2021

tylertreat commented Nov 3, 2021

tylertreat commented Nov 5, 2021

aversant commented Nov 6, 2021 •

edited

tylertreat commented Nov 8, 2021

tylertreat commented Nov 8, 2021

failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax #333

failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax #333

Comments

ekbfh commented Mar 16, 2021

ekbfh commented May 21, 2021

tylertreat commented May 21, 2021

ekbfh commented May 24, 2021

tylertreat commented Jun 1, 2021

ekbfh commented Jun 15, 2021

ekbfh commented Jul 20, 2021

tylertreat commented Jul 20, 2021

aversant commented Nov 2, 2021

tylertreat commented Nov 3, 2021

tylertreat commented Nov 3, 2021

aversant commented Nov 3, 2021

tylertreat commented Nov 3, 2021

tylertreat commented Nov 5, 2021

aversant commented Nov 6, 2021 • edited

tylertreat commented Nov 8, 2021

tylertreat commented Nov 8, 2021

aversant commented Nov 6, 2021 •

edited