Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LevelDB/LevelQueue could get stuck on NFS when NFS locking doesn't work properly #24684

Closed
wxiaoguang opened this issue May 12, 2023 · 10 comments · Fixed by #25202
Closed

LevelDB/LevelQueue could get stuck on NFS when NFS locking doesn't work properly #24684

wxiaoguang opened this issue May 12, 2023 · 10 comments · Fixed by #25202
Labels
issue/workaround it is or has a workaround type/docs This PR mainly updates/creates documentation

Comments

@wxiaoguang
Copy link
Contributor

wxiaoguang commented May 12, 2023

Description

Not only one user (for 1.19 and maybe early version) reports that Gitea gets stuck when starting, the log doesn't help (outputted logs are unrelated).

It seems related to NFS: NFS lock system is broken.

The leveldb code (3rd) does syscall.Flock(int(f.Fd()), how|syscall.LOCK_NB), this call should never block, but it seems that NFS blocks it.

Workaround

Reset NFS locking service:

#24684 (comment)

The Gitea stacktrace when blocking

gitea-0:/# curl http://127.0.0.1:6060/debug/pprof/goroutine?debug=1
goroutine profile: total 16
4 @ 0x4447b6 0x454a3e 0x1897aef 0x478a81
#	0x1897aee	github.com/blevesearch/bleve_index_api.AnalysisWorker+0x8e	/go/pkg/mod/github.com/blevesearch/bleve_index_api@v1.0.5/analysis.go:46

1 @ 0x415234 0x474e8f 0xd9b3f9 0x478a81
# labels: {"graceful-lifecycle":"with-manager", "pid":"645e42ca-4", "ppid":"", "process-description":"Graceful: HandleSignals", "process-type":"system"}
#	0x474e8e	os/signal.signal_recv+0x2e	/usr/local/go/src/runtime/sigqueue.go:152
#	0xd9b3f8	os/signal.loop+0x18		/usr/local/go/src/os/signal/signal_unix.go:23

1 @ 0x4394d6 0x472685 0x80e295 0x80e0ad 0x80ae2b 0x20f92a5 0x20f9df3 0x9b982f 0x9bb689 0x9bce36 0x9b8352 0x478a81
#	0x472684	runtime/pprof.runtime_goroutineProfileWithLabels+0x24	/usr/local/go/src/runtime/mprof.go:844
#	0x80e294	runtime/pprof.writeRuntimeProfile+0xb4			/usr/local/go/src/runtime/pprof/pprof.go:734
#	0x80e0ac	runtime/pprof.writeGoroutine+0x4c			/usr/local/go/src/runtime/pprof/pprof.go:694
#	0x80ae2a	runtime/pprof.(*Profile).WriteTo+0x14a			/usr/local/go/src/runtime/pprof/pprof.go:329
#	0x20f92a4	net/http/pprof.handler.ServeHTTP+0x4a4			/usr/local/go/src/net/http/pprof/pprof.go:259
#	0x20f9df2	net/http/pprof.Index+0xf2				/usr/local/go/src/net/http/pprof/pprof.go:376
#	0x9b982e	net/http.HandlerFunc.ServeHTTP+0x2e			/usr/local/go/src/net/http/server.go:2122
#	0x9bb688	net/http.(*ServeMux).ServeHTTP+0x148			/usr/local/go/src/net/http/server.go:2500
#	0x9bce35	net/http.serverHandler.ServeHTTP+0x315			/usr/local/go/src/net/http/server.go:2936
#	0x9b8351	net/http.(*conn).serve+0x611				/usr/local/go/src/net/http/server.go:1995

1 @ 0x4447b6 0x40f91d 0x40f418 0xee1c45 0x1c3dea5 0x1c3dc4b 0x1c3e7b6 0x1c396ce 0x1c44be9 0x1d24d08 0x1d24c79 0x24aad1d 0x2597cd9 0x1dbbdf0 0x1dbcc3b 0x1db98e7 0x25d782b 0x444387 0x478a81
# labels: {"graceful-lifecycle":"with-hammer"}
#	0xee1c44	code.gitea.io/gitea/modules/nosql.(*Manager).GetLevelDB+0x144		/go/src/code.gitea.io/gitea/modules/nosql/manager_leveldb.go:70
#	0x1c3dea4	code.gitea.io/gitea/modules/queue.NewLevelQueueByteFIFO+0x44		/go/src/code.gitea.io/gitea/modules/queue/queue_disk.go:70
#	0x1c3dc4a	code.gitea.io/gitea/modules/queue.NewLevelQueue+0x18a			/go/src/code.gitea.io/gitea/modules/queue/queue_disk.go:43
#	0x1c3e7b5	code.gitea.io/gitea/modules/queue.NewPersistableChannelQueue+0x475	/go/src/code.gitea.io/gitea/modules/queue/queue_disk_channel.go:101
#	0x1c396cd	code.gitea.io/gitea/modules/queue.NewQueue+0x8d				/go/src/code.gitea.io/gitea/modules/queue/queue.go:200
#	0x1c44be8	code.gitea.io/gitea/modules/queue.CreateQueue+0x268			/go/src/code.gitea.io/gitea/modules/queue/setting.go:50
#	0x1d24d07	code.gitea.io/gitea/modules/notification/ui.NewNotifier+0xa7		/go/src/code.gitea.io/gitea/modules/notification/ui/ui.go:40
#	0x1d24c78	code.gitea.io/gitea/modules/notification.NewContext+0x18		/go/src/code.gitea.io/gitea/modules/notification/notification.go:34
#	0x24aad1c	code.gitea.io/gitea/routers.GlobalInitInstalled+0x53c			/go/src/code.gitea.io/gitea/routers/init.go:135
#	0x2597cd8	code.gitea.io/gitea/cmd.runWeb+0x8b8					/go/src/code.gitea.io/gitea/cmd/web.go:163
#	0x1dbbdef	github.com/urfave/cli.HandleAction+0x4f					/go/pkg/mod/github.com/urfave/cli@v1.22.10/app.go:524
#	0x1dbcc3a	github.com/urfave/cli.Command.Run+0x67a					/go/pkg/mod/github.com/urfave/cli@v1.22.10/command.go:173
#	0x1db98e6	github.com/urfave/cli.(*App).Run+0xb86					/go/pkg/mod/github.com/urfave/cli@v1.22.10/app.go:277
#	0x25d782a	main.main+0xcea								/go/src/code.gitea.io/gitea/main.go:118
#	0x444386	runtime.main+0x206							/usr/local/go/src/runtime/proc.go:250

1 @ 0x4447b6 0x43cdf7 0x472b29 0x4ef092 0x4f499d 0x4f498b 0x6029f5 0x61c9a5 0x61b75d 0x9bd3a5 0x9bcfdd 0x259b1cd 0x259b1ce 0x478a81
# labels: {"pid":"645e42ca-5", "ppid":"", "process-description":"Web: PProf Server", "process-type":"system"}
#	0x472b28	internal/poll.runtime_pollWait+0x88		/usr/local/go/src/runtime/netpoll.go:306
#	0x4ef091	internal/poll.(*pollDesc).wait+0x31		/usr/local/go/src/internal/poll/fd_poll_runtime.go:84
#	0x4f499c	internal/poll.(*pollDesc).waitRead+0x2bc	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
#	0x4f498a	internal/poll.(*FD).Accept+0x2aa		/usr/local/go/src/internal/poll/fd_unix.go:614
#	0x6029f4	net.(*netFD).accept+0x34			/usr/local/go/src/net/fd_unix.go:172
#	0x61c9a4	net.(*TCPListener).accept+0x24			/usr/local/go/src/net/tcpsock_posix.go:148
#	0x61b75c	net.(*TCPListener).Accept+0x3c			/usr/local/go/src/net/tcpsock.go:297
#	0x9bd3a4	net/http.(*Server).Serve+0x384			/usr/local/go/src/net/http/server.go:3059
#	0x9bcfdc	net/http.(*Server).ListenAndServe+0x7c		/usr/local/go/src/net/http/server.go:2988
#	0x259b1cc	net/http.ListenAndServe+0xcc			/usr/local/go/src/net/http/server.go:3242
#	0x259b1cd	code.gitea.io/gitea/cmd.runWeb.func2+0xcd	/go/src/code.gitea.io/gitea/cmd/web.go:154

1 @ 0x4447b6 0x454a3e 0x8a2a36 0x478a81
# labels: {"pid":"645e42cb", "ppid":"645e42ca", "process-description":"Logger: console(console)", "process-type":"system"}
#	0x8a2a35	code.gitea.io/gitea/modules/log.(*ChannelledLog).Start+0xd5	/go/src/code.gitea.io/gitea/modules/log/event.go:79

1 @ 0x4447b6 0x454a3e 0x8a2a36 0x478a81
# labels: {"pid":"645e42cb-3", "ppid":"645e42cb-2", "process-description":"Logger: console(console)", "process-type":"system"}
#	0x8a2a35	code.gitea.io/gitea/modules/log.(*ChannelledLog).Start+0xd5	/go/src/code.gitea.io/gitea/modules/log/event.go:79

1 @ 0x4447b6 0x454a3e 0x8a42dc 0x478a81
# labels: {"pid":"645e42ca", "ppid":"", "process-description":"Logger: default", "process-type":"system"}
#	0x8a42db	code.gitea.io/gitea/modules/log.(*MultiChannelledLog).Start+0x2bb	/go/src/code.gitea.io/gitea/modules/log/event.go:340

1 @ 0x4447b6 0x454a3e 0x8a42dc 0x478a81
# labels: {"pid":"645e42cb-2", "ppid":"", "process-description":"Logger: router", "process-type":"system"}
#	0x8a42db	code.gitea.io/gitea/modules/log.(*MultiChannelledLog).Start+0x2bb	/go/src/code.gitea.io/gitea/modules/log/event.go:340

1 @ 0x4447b6 0x454a3e 0xda1310 0x478a81
# labels: {"graceful-lifecycle":"with-manager", "pid":"645e42ca-4", "ppid":"", "process-description":"Graceful: HandleSignals", "process-type":"system"}
#	0xda130f	code.gitea.io/gitea/modules/graceful.(*Manager).handleSignals+0x22f	/go/src/code.gitea.io/gitea/modules/graceful/manager_unix.go:131

1 @ 0x4447b6 0x455b2f 0x455b06 0x474667 0x4971ab 0xda0f85 0x478a81
# labels: {"graceful-lifecycle":"with-manager"}
#	0x474666	sync.runtime_Semacquire+0x26						/usr/local/go/src/runtime/sema.go:62
#	0x4971aa	sync.(*WaitGroup).Wait+0x4a						/usr/local/go/src/sync/waitgroup.go:116
#	0xda0f84	code.gitea.io/gitea/modules/graceful.(*Manager).start.func1+0x64	/go/src/code.gitea.io/gitea/modules/graceful/manager_unix.go:83

1 @ 0x4d2287 0x4cfc2a 0xabf10b 0xabefa5 0xab9932 0xae25f9 0xee37a8 0xee1d72 0x478a81
# labels: {"pid":"645e42ca-3", "ppid":"", "process-description":"Service: NoSQL", "process-type":"system"}
#	0x4d2286	syscall.Syscall+0x26							/usr/local/go/src/syscall/syscall_linux.go:69
#	0x4cfc29	syscall.Flock+0x29							/usr/local/go/src/syscall/zsyscall_linux_amd64.go:428
#	0xabf10a	github.com/syndtr/goleveldb/leveldb/storage.setFileLock+0x8a		/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.0/leveldb/storage/file_storage_unix.go:59
#	0xabefa4	github.com/syndtr/goleveldb/leveldb/storage.newFileLock+0xc4		/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.0/leveldb/storage/file_storage_unix.go:41
#	0xab9931	github.com/syndtr/goleveldb/leveldb/storage.OpenFile+0x1d1		/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.0/leveldb/storage/file_storage.go:107
#	0xae25f8	github.com/syndtr/goleveldb/leveldb.OpenFile+0x38			/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.0/leveldb/db.go:215
#	0xee37a7	code.gitea.io/gitea/modules/nosql.(*Manager).getLevelDB+0x1867		/go/src/code.gitea.io/gitea/modules/nosql/manager_leveldb.go:191
#	0xee1d71	code.gitea.io/gitea/modules/nosql.(*Manager).GetLevelDB.func1+0xb1	/go/src/code.gitea.io/gitea/modules/nosql/manager_leveldb.go:68

1 @ 0x4d2287 0x4d01c5 0x4f04ae 0x4f0496 0x4f0216 0x600869 0x6130c5 0x9b209f 0x478a81
# labels: {"pid":"645e42ca-5", "ppid":"", "process-description":"Web: PProf Server", "process-type":"system"}
#	0x4d2286	syscall.Syscall+0x26				/usr/local/go/src/syscall/syscall_linux.go:69
#	0x4d01c4	syscall.read+0x44				/usr/local/go/src/syscall/zsyscall_linux_amd64.go:711
#	0x4f04ad	syscall.Read+0x2cd				/usr/local/go/src/syscall/syscall_unix.go:178
#	0x4f0495	internal/poll.ignoringEINTRIO+0x2b5		/usr/local/go/src/internal/poll/fd_unix.go:794
#	0x4f0215	internal/poll.(*FD).Read+0x35			/usr/local/go/src/internal/poll/fd_unix.go:163
#	0x600868	net.(*netFD).Read+0x28				/usr/local/go/src/net/fd_posix.go:55
#	0x6130c4	net.(*conn).Read+0x44				/usr/local/go/src/net/net.go:183
@wxiaoguang wxiaoguang changed the title LevelDB/LevelQueue could get stuck for unknown reason LevelDB/LevelQueue could get stuck on NFS May 12, 2023
@lunny lunny added the type/upstream This is an issue in one of Gitea's dependencies and should be reported there label May 12, 2023
@lunny
Copy link
Member

lunny commented May 12, 2023

It should be related with levelqueue and leveldb. But I don't think these data should be stored into NFS. They should be stored into local disk.

@wxiaoguang
Copy link
Contributor Author

It's related to the filesystem. LOCK_NB means non-blocking. Such call should never block.

@wxiaoguang wxiaoguang removed type/upstream This is an issue in one of Gitea's dependencies and should be reported there type/bug labels May 12, 2023
@ccureau
Copy link

ccureau commented May 12, 2023

Some additional context around the issue as it happened to me...

I am using nfs-subdir-external-provisioner to provide PVs to my cluster, for which the backing store is a QNAP NAS using NFS v3 (no v4 on this blasted hardware...) After an update to 1.19.1, the issue begins to present.

I'm unsure of what actually is stored in the leveldb, but could it be better served in a emptydir volume?

@KN4CK3R
Copy link
Member

KN4CK3R commented May 12, 2023

The queue is used to store different informations like what webhooks must be delivered. This is stored on disk to survive a Gitea restart.

@wxiaoguang
Copy link
Contributor Author

wxiaoguang commented May 12, 2023

I suspect the problem is not related to Gitea itself (or version), but it seems that NFS doesn't handle the lock when the last Gitea exits.

NFS locking is quite complicated.

If you have a chance, you can use command line tool flock to check whether the file locking works in NFS, eg:


# make sure the target file exists
ls -l {gitea-data}/queues/common/LOCK

# this is LOCK_NB, it should never block
flock -n {gitea-data}/queues/common/LOCK ...

# this might block if the lock is not acquired
flock {gitea-data}/queues/common/LOCK ...

https://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch11_02.htm

https://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch07_05.htm

@ccureau
Copy link

ccureau commented May 12, 2023

gitea-0:~# ls -l /data/gitea/queues/common/LOCK 
-rw-r--r--    1 git      git              0 Apr 21 20:49 /data/gitea/queues/common/LOCK
gitea-0:~# flock -n /data/gitea/queues/common/LOCK 
flock: invalid number '/data/gitea/queues/common/LOCK'

I checked the NFS options on the QNAP and see that oplocks are enabled. But, as you mentioned, NFS ain't easy.

@wxiaoguang
Copy link
Contributor Author

Hmm, flock need a sub command.

$ time flock -n ./test-file-lock echo "lock is acquired"
lock is acquired

real	0m0.002s
user	0m0.001s
sys	0m0.001s

@ccureau
Copy link

ccureau commented May 12, 2023

With the subcommand, we've reproduced the issue. The command hangs regardless of the state of oplocks.

@wxiaoguang
Copy link
Contributor Author

wxiaoguang commented May 12, 2023

Yup, so the problem should be fixed from NFS side.

I can think about a workaround: remove your queues directory completely, then start Gitea, then a new LOCK file will be created automatically which have a new id, then it might work again. If it still doesn't work, the NFS locking must have been broken more seriously than we thought.

Another approach might be restarting your NFS server, or fix/reset some locking service?

@wxiaoguang wxiaoguang changed the title LevelDB/LevelQueue could get stuck on NFS LevelDB/LevelQueue could get stuck on NFS when NFS locking doesn't work properly May 12, 2023
@ccureau
Copy link

ccureau commented May 12, 2023

I just solved the issue...and it was completely to do with NFS.

Did a ssh into the QNAP and saw that there were issues in the kernel log:

[1124925.091886] lockd: cannot monitor node-3
[1124960.160037] statd: server rpc.statd not responding, timed out

I did a little surgery to correct the situation:

# /etc/init.d/nfs stop
# rm -rf /var/lib/nfs/sm/*
# /etc/init.d/nfs start

After that, locking seems to work fine:

gitea-0:/# time flock -n /data/gitea/queues/common/LOCK echo 'lock acquired'
lock acquired

real	0m0.005s
user	0m0.003s
sys	0m0.001s
gitea-0:/# time flock -u /data/gitea/queues/common/LOCK echo 'lock acquired'
lock acquired

real	0m0.004s
user	0m0.003s
sys	0m0.000s
gitea-0:/# 

Sorry for the noise, and thank you so much for all of your help. I'm really excited about gitea and would love to help contribute to actions!

@wxiaoguang wxiaoguang added issue/workaround it is or has a workaround type/docs This PR mainly updates/creates documentation labels May 12, 2023
lunny pushed a commit that referenced this issue Jun 12, 2023
Close #23654
Close #24684

@techknowlogick I still think we need to rename
https://dl.gitea.com/gitea/1.20/ to
https://dl.gitea.com/gitea/1.20-nightly/

`/gitea/1.20/` is quite confusing, it needs these words to explain why.
If we call it `1.20-nightly`, the FAQ can be simplified a lot.
GiteaBot pushed a commit to GiteaBot/gitea that referenced this issue Jun 12, 2023
…25202)

Close go-gitea#23654
Close go-gitea#24684

@techknowlogick I still think we need to rename
https://dl.gitea.com/gitea/1.20/ to
https://dl.gitea.com/gitea/1.20-nightly/

`/gitea/1.20/` is quite confusing, it needs these words to explain why.
If we call it `1.20-nightly`, the FAQ can be simplified a lot.
lunny pushed a commit that referenced this issue Jun 12, 2023
…25204)

Backport #25202 by @wxiaoguang

Close #23654

Close #24684


@techknowlogick I still think we need to rename
https://dl.gitea.com/gitea/1.20/ to
https://dl.gitea.com/gitea/1.20-nightly/

`/gitea/1.20/` is quite confusing, it needs these words to explain why.
If we call it `1.20-nightly`, the FAQ can be simplified a lot.

Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/workaround it is or has a workaround type/docs This PR mainly updates/creates documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants