-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hey @mattn -- since you saw my post on X about sqlite3 crashing on macOS 26 Tahoe, I've been doing some more digging and learning and I was hoping to get your thoughts.
My code is running on an M2 Macbook Air, with a sqlite DB on an ExFAT-formatted 4TB Samsung T7 SSD storage device. The DB is opened in WAL mode.
I can reliably reproduce the error, and have done so about a dozen times now. Unfortunately, I can't seem to minimize the reproducer, and the program I'm writing processes tons of personal data, so I can't easily/safely share a reproducer. Still, I can reliably get it to occur after about 10 minutes of running.
Here's the crash:
SIGBUS: bus error
PC=0x1021a0298 m=0 sigcode=1 addr=0x105b673ca
signal arrived during cgo execution
goroutine 10452650 gp=0x1400887c8c0 m=0 mp=0x10507b460 [syscall]:
runtime.cgocall(0x1021740a0, 0x140019b2e08)
/usr/local/go/src/runtime/cgocall.go:167 +0x44 fp=0x140019b2dd0 sp=0x140019b2d90 pc=0x100e9fa74
github.com/mattn/go-sqlite3._Cfunc__sqlite3_step_internal(0xa1882c640)
_cgo_gotypes.go:384 +0x30 fp=0x140019b2e00 sp=0x140019b2dd0 pc=0x101319300
github.com/mattn/go-sqlite3.(*SQLiteRows).nextSyncLocked.func1(...)
/Users/matt/go/pkg/mod/github.com/mattn/go-sqlite3@v1.14.32/sqlite3.go:2233
github.com/mattn/go-sqlite3.(*SQLiteRows).nextSyncLocked(0x1400285e060, {0x140012a9b00, 0x8, 0x100eab848?})
/Users/matt/go/pkg/mod/github.com/mattn/go-sqlite3@v1.14.32/sqlite3.go:2233 +0x64 fp=0x140019b2f80 sp=0x140019b2e00 pc=0x1013290e4
github.com/mattn/go-sqlite3.(*SQLiteRows).Next.func1()
/Users/matt/go/pkg/mod/github.com/mattn/go-sqlite3@v1.14.32/sqlite3.go:2214 +0x2c fp=0x140019b2fd0 sp=0x140019b2f80 pc=0x101328f4c
runtime.goexit({})
/usr/local/go/src/runtime/asm_arm64.s:1268 +0x4 fp=0x140019b2fd0 sp=0x140019b2fd0 pc=0x100eab844
created by github.com/mattn/go-sqlite3.(*SQLiteRows).Next in goroutine 10451559
/Users/matt/go/pkg/mod/github.com/mattn/go-sqlite3@v1.14.32/sqlite3.go:2213 +0x17c
The stack trace is the same on each reproduction.
It comes deep inside a syscall within cgo, but if I'm reading this correctly, the calling stack trace says "created by github.com/mattn/go-sqlite3.(*SQLiteRows).Next in goroutine 10451559" which, when I follow that number, takes me to this goroutine stack trace:
goroutine 10451559 gp=0x14001c56540 m=nil [select]:
runtime.gopark(0x140006ab258?, 0x2?, 0x28?, 0xb1?, 0x140006ab22c?)
/usr/local/go/src/runtime/proc.go:460 +0xc0 fp=0x140006ab0a0 sp=0x140006ab080 pc=0x100ea2fe0
runtime.selectgo(0x140006ab258, 0x140006ab228, 0x140006ab258?, 0x0, 0x5?, 0x1)
/usr/local/go/src/runtime/select.go:351 +0x6bc fp=0x140006ab1e0 sp=0x140006ab0a0 pc=0x100e813dc
github.com/mattn/go-sqlite3.(*SQLiteRows).Next(0x1400285e060, {0x140012a9b00, 0x8, 0x8})
/Users/matt/go/pkg/mod/github.com/mattn/go-sqlite3@v1.14.32/sqlite3.go:2216 +0x1c4 fp=0x140006ab2c0 sp=0x140006ab1e0 pc=0x101328d94
database/sql.(*Rows).nextLocked(0x140012e7860)
/usr/local/go/src/database/sql/sql.go:3066 +0x124 fp=0x140006ab320 sp=0x140006ab2c0 pc=0x1011802e4
database/sql.(*Rows).Next.func1()
/usr/local/go/src/database/sql/sql.go:3041 +0x2c fp=0x140006ab350 sp=0x140006ab320 pc=0x10118018c
database/sql.withLock({0x103690810, 0x140012e7898}, 0x140006ab3b8)
/usr/local/go/src/database/sql/sql.go:3572 +0x74 fp=0x140006ab390 sp=0x140006ab350 pc=0x101182874
database/sql.(*Rows).Next(0x140012e7860)
/usr/local/go/src/database/sql/sql.go:3040 +0x6c fp=0x140006ab3e0 sp=0x140006ab390 pc=0x1011800fc
github.com/timelinize/timelinize/timeline.thumbnailJob.iteratePagesOfTasksFromImportJob({{0x0, 0x0, 0x0}, 0x1, 0x0}, 0x14000d3ec60, 0x1)
/Users/matt/Dev/timelinize/timeline/thumbnails.go:203 +0x698 fp=0x140006ab9d0 sp=0x140006ab3e0 pc=0x10171e108
github.com/timelinize/timelinize/timeline.thumbnailJob.Run({{0x0, 0x0, 0x0}, 0x1, 0x0}, 0x14000d3ec60, {0x0?, 0x0?, 0x140006abb88?})
/Users/matt/Dev/timelinize/timeline/thumbnails.go:119 +0x124 fp=0x140006abb10 sp=0x140006ab9d0 pc=0x10171d5d4
github.com/timelinize/timelinize/timeline.(*thumbnailJob).Run(0x103628ae0?, 0x10229f090?, {0x0?, 0x14001170a00?, 0x2?})
<autogenerated>:1 +0x70 fp=0x140006abb90 sp=0x140006abb10 pc=0x1017307f0
github.com/timelinize/timelinize/timeline.(*Timeline).runJob.func1(_, _, _, {{0x1400f1bd170, 0x24}, 0x2, {0x140012dcb00, 0xa}, 0x0, {0x14003ea0580, ...}, ...}, ...)
/Users/matt/Dev/timelinize/timeline/jobs.go:463 +0x258 fp=0x140006abea0 sp=0x140006abb90 pc=0x1016ef558
github.com/timelinize/timelinize/timeline.(*Timeline).runJob.gowrap1()
/Users/matt/Dev/timelinize/timeline/jobs.go:561 +0x5c fp=0x140006abfd0 sp=0x140006abea0 pc=0x1016ef2cc
runtime.goexit({})
/usr/local/go/src/runtime/asm_arm64.s:1268 +0x4 fp=0x140006abfd0 sp=0x140006abfd0 pc=0x100eab844
created by github.com/timelinize/timelinize/timeline.(*Timeline).runJob in goroutine 336
/Users/matt/Dev/timelinize/timeline/jobs.go:424 +0x604
The last line of my code it refers to is this one:
(The line number varies slightly because I've since added debug logs, but it is the same line.)
As far as I can tell, my code is correct. It has run millions of times without issues until recently. While there are concurrent goroutines using the DB, each query is inside an appropriate sync.RWMutex lock, and I seem to be closing the rows and transactions. I can do multiple read queries simultaneously, but writes should be exclusive.
I find it hard to believe it's a bug in sqlite3 itself (I have never encountered a bug in sqlite before!?), and I would say it's the kernel or the file system, but the error happens SO reliably that that can't be either, right? If it was the file system doing some sort of sync or flush or power cycle, why would it happen so reliably? If it was the kernel, I'd expect a kernel panic, not a sigbus.
One of my candidate theories is a bug in this wrapper. Still, I doubt it, because the error only started happening after upgrading to Tahoe, and I have to process a specific set of data in my program. So it's still quite likely a bug in my program.
Still, I wanted to ask you... on following the stack trace deeper, I found it referred to this area of code:
Lines 2211 to 2227 in 8bf7a8a
| resultCh := make(chan error) | |
| defer close(resultCh) | |
| go func() { | |
| resultCh <- rc.nextSyncLocked(dest) | |
| }() | |
| select { | |
| case err := <-resultCh: | |
| return err | |
| case <-rc.ctx.Done(): | |
| select { | |
| case <-resultCh: // no need to interrupt | |
| default: | |
| // this is still racy and can be no-op if executed between sqlite3_* calls in nextSyncLocked. | |
| C.sqlite3_interrupt(rc.s.c.db) | |
| <-resultCh // ensure goroutine completed | |
| } | |
| return rc.ctx.Err() |
Specifically the select on L2216. There is a comment in there that says, // this is still racy and can be no-op if executed between sqlite3_* calls in nextSyncLocked.
Is that possibly related?
Do you have any other ideas what might be the cause of this?
PS. FWIW, I seem to remember writing code that struggled with this pattern:
go func() {
resultCh <- rc.nextSyncLocked(dest)
}()I seem to recall having to instead do:
go func() {
result := rc.nextSyncLocked(dest)
resultCh <- result
}()in order to get the expected results. (I don't remember details, but I think the channel would receive right away, rather than after the nextSyncLocked() function finished.)
I wonder if that could be something else to look into.
Thanks for any help!