Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax #333

Closed
ekbfh opened this issue Mar 16, 2021 · 16 comments

Comments

@ekbfh
Copy link

ekbfh commented Mar 16, 2021

Hello!

I saw this issue before on 1.3.*, but it also happens on 1.5.1

/bin/liftbridge --config=/etc/liftbridge/liftbridge.yml
INFO[2021-03-16 19:36:48] Liftbridge Version:        v1.5.1            
INFO[2021-03-16 19:36:48] Server ID:                 srvnc-nc          
INFO[2021-03-16 19:36:48] Namespace:                 liftbridge-default 
INFO[2021-03-16 19:36:48] NATS Servers:              [nats://172.24.20.9:4222] 
INFO[2021-03-16 19:36:48] Default Retention Policy:  [Age: 3 hours, Compact: false] 
INFO[2021-03-16 19:36:48] Default Partition Pausing: disabled          
INFO[2021-03-16 19:36:48] Starting Liftbridge server on 172.24.20.9:9292... 
INFO[2021-03-16 19:36:48]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:srvnc-nc Address:srvnc-nc}]" 
DEBU[2021-03-16 19:36:48] Loaded existing state for metadata Raft group 
DEBU[2021-03-16 19:36:48] Raft node initialized                        
INFO[2021-03-16 19:36:48]  raft: entering follower state: follower="Node at srvnc-nc [Follower]" leader= 
WARN[2021-03-16 19:36:50]  raft: heartbeat timeout reached, starting election: last-leader= 
INFO[2021-03-16 19:36:50]  raft: entering candidate state: node="Node at srvnc-nc [Candidate]" term=74 
DEBU[2021-03-16 19:36:50] raft: votes: needed=1                        
DEBU[2021-03-16 19:36:50] raft: vote granted: from=srvnc-nc term=74 tally=1 
INFO[2021-03-16 19:36:50]  raft: election won: tally=1                 
INFO[2021-03-16 19:36:50]  raft: entering leader state: leader="Node at srvnc-nc [Leader]" 
INFO[2021-03-16 19:36:50] Server became metadata leader, performing leader promotion actions 
DEBU[2021-03-16 19:36:50] fsm: Replaying Raft log...                   
panic: failed to add stream to metadata store: failed to create commit log: parse high watermark file failed: strconv.ParseInt: parsing "": invalid syntax

goroutine 55 [running]:
github.com/liftbridge-io/liftbridge/server.(*Server).Apply(0xc000198300, 0xc0004c4230, 0x0, 0x0)
	/home/circleci/project/server/fsm.go:111 +0x3e3
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000389320)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:90 +0x2c1
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0003ca600, 0x40, 0x40)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc0001ea300)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:219 +0x42f
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc0001ea300, 0xc000388e80)
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
	/go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:144 +0x66

config file:

cat /etc/liftbridge/liftbridge.yml
clustering:
    raft.bootstrap.seed: true
    server.id: srvnc-nc
cursors:
    stream.auto.pause.time: 0
    stream.partitions: 1
data:
    dir: /var/lib/liftbridge
host: 172.24.20.9
logging:
    level: debug
    raft: true
nats.servers:
- nats://172.24.20.9:4222
streams:
    compact.enabled: false
    retention.max:
        age: 3h
    segment.max:
        age: 20m

Workaround: drop LB_data folder content

P.S. i have an archive data with example of this case. I can upload it somewhere

@ekbfh
Copy link
Author

ekbfh commented May 21, 2021

@tylertreat Can you say, what can go wrong there?

@tylertreat
Copy link
Member

Sorry, somehow I missed this issue report. Do you by chance have the contents of the partition data directory for the partition that is causing this error?

@ekbfh
Copy link
Author

ekbfh commented May 24, 2021

lift.1.5.1_parse.tar.gz
Please, take a look with this files. I can't detect which one partition is broken, so i attached all of data files.

@tylertreat
Copy link
Member

Somehow all of the HW checkpoint files (files named replication-offset-checkpoint in each partition) are empty, which is what causes the panic. I'm not sure how this could happen, but I'm investigating.

Is there any other information surrounding the circumstances in which it occurred? Looks like this was on restarting a server which already had partition data?

@ekbfh
Copy link
Author

ekbfh commented Jun 15, 2021

It's possible case, but i don't know the real cause of this problem

@ekbfh
Copy link
Author

ekbfh commented Jul 20, 2021

Is there any clue?

@tylertreat
Copy link
Member

I'm not sure how to reproduce this issue in order to debug it.

@aversant
Copy link

aversant commented Nov 2, 2021

For reproducing this behaviour i setup liftbridge on VMWare VM. Power Off on it - causes this.

Problem around This:

func (l *commitLog) checkpointHW() error {
	var (
		hw   = l.hw
		r    = strings.NewReader(strconv.FormatInt(hw, 10))
		file = filepath.Join(l.Path, hwFileName)
	)
	return atomic_file.WriteFile(file, r)
}

I bump atomic_file library to 1.0.1 version and after shutdown saw the following:

[root@noc-mt ~]# ls -la /var/lib/liftbridge/streams/ch.alarms/0/
total 80
drwxr-xr-x  2 nobody  wheel       512 Nov  2 19:01 .
drwxr-xr-x  3 nobody  wheel       512 Nov  2 18:25 ..
-rw-r--r--  1 nobody  wheel  10485760 Nov  2 18:25 00000000000000000000.index
-rw-r--r--  1 nobody  wheel         0 Nov  2 18:25 00000000000000000000.log
-rw-------  1 nobody  wheel         4 Nov  2 18:54 leader-epoch-checkpoint
-rw-------  1 nobody  wheel         0 Nov  2 19:01 replication-offset-checkpoint
-rw-------  1 nobody  wheel         2 Nov  2 19:01 replication-offset-checkpoint602025427

The commit may have been affected. Example __cursor stream:

[root@noc-mt ~]# ls -la /var/lib/liftbridge/streams/__cursors/0/
total 160
drwxr-xr-x  2 nobody  wheel       512 Nov  2 19:01 .
drwxr-xr-x  3 nobody  wheel       512 Nov  2 18:24 ..
-rw-r--r--  1 nobody  wheel  10485760 Nov  2 18:27 00000000000000000000.index
-rw-r--r--  1 nobody  wheel     42660 Nov  2 18:27 00000000000000000000.log
-rw-------  1 nobody  wheel         8 Nov  2 18:29 leader-epoch-checkpoint
-rw-------  1 nobody  wheel         3 Nov  2 19:01 replication-offset-checkpoint
-rw-------  1 nobody  wheel         3 Nov  2 19:01 replication-offset-checkpoint201413289

Log Liftbridge liftbridge_striconv_trace1.txt. checkpointHW added by me.

@tylertreat
Copy link
Member

@aversant do you want to create a PR containing the version bump for atomic_file?

@tylertreat
Copy link
Member

And just to confirm, when you bump the version, you no longer see the empty replication-offset-checkpoint files?

@aversant
Copy link

aversant commented Nov 3, 2021

And just to confirm, when you bump the version, you no longer see the empty replication-offset-checkpoint files?

No. After bump version still see emty replication-offset-checkpoint file (ch.alarms/0/ stream example). I don't understand why they appear.

@tylertreat
Copy link
Member

I'm not sure. The atomic file library just does an os.Rename to replace the old checkpoint file with the new one, which should be an atomic operation on Linux.

WriteFile atomically writes the contents of r to the specified filepath. If an error occurs, the target file is guaranteed to be either fully written, or not written at all. WriteFile overwrites any file that exists at the location (but only if the write fully succeeds, otherwise the existing file is unmodified).

It's not clear to me how the contents would become empty.

@tylertreat
Copy link
Member

@aversant Out of curiosity, are you using the embedded NATS server when you see this issue or a separate standalone NATS?

@aversant
Copy link

aversant commented Nov 6, 2021

I tested on embedded and a separate standalone NATS server.

Checked again with atomic v1.0.1 on CentOS 7 couldn't reproduce the problem - I tried to turn off the power several dozen times. I'll try again on CentOS 8, if it doesn't play, then atomic v1.0.1 works.

@tylertreat
Copy link
Member

The atomic dependency was updated in this PR, which will be going out in a release today.

@tylertreat
Copy link
Member

I believe this is fixed in v1.7.0. Please re-open if it still occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants