[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538

desa · 2015-10-21T23:24:01Z

While running a test, I dropped the database that the test was writing to. This resulted in the following panic.

[tsm1] 2015/10/21 16:09:15 rewriting index file /var/opt/influxdb/data/stress/default/14/0001170.tsm1 with /var/opt/influxdb/data/stress/default/14/0001173.tsm1
[query] 2015/10/21 16:09:17 DROP DATABASE stress
[tsm1wal] 2015/10/21 16:09:17 /var/opt/influxdb/data/stress/default/14 flush to index took 2.755019774s
[wal] 2015/10/21 16:09:18 Flush due to idle. Flushing 16 series with 16 points and 740 bytes from partition 1
[wal] 2015/10/21 16:09:18 write to index of partition 1 took 1.430646ms
[monitor] 2015/10/21 16:09:27 failed to store statistics: timeout
[monitor] 2015/10/21 16:09:37 failed to store statistics: timeout
[monitor] 2015/10/21 16:09:47 failed to store statistics: timeout
[query] 2015/10/21 16:09:48 DROP DATABASE stress
[monitor] 2015/10/21 16:09:57 failed to store statistics: timeout
[monitor] 2015/10/21 16:10:07 failed to store statistics: timeout
[tsm1] 2015/10/21 16:10:12 Compaction of /var/opt/influxdb/data/stress/default/14 took 4m13.079764908s
panic: error opening new segment file for wal: open /var/opt/influxdb/data/stress/default/14/_03549.wal: no such file or directory

goroutine 407990 [running]:
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).writeToLog(0xc823cd4700, 0x62101, 0xc84d0dc000, 0x10a15, 0x6218e, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:411 +0x94b
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).WritePoints(0xc823cd4700, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:211 +0x35f
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints(0xc823f3d560, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/tsm1.go:348 +0x8a
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc823f51b80, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/shard.go:263 +0x670
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc820086780, 0xe, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/store.go:396 +0x14e
github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard.func1(0xc82007a6c0, 0xc830dc56ef, 0x6, 0xc824689580, 0x7, 0xc831be8540, 0x1, 0xe, 0x1, 0xc842c66000, ...)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/cluster/points_writer.go:302 +0x10b
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/cluster/points_writer.go:334 +0x1c5

The text was updated successfully, but these errors were encountered:

desa · 2015-10-22T00:49:30Z

Repro steps:

In one terminal session, start running influx_stress -test cmd/influx_stress/examples/100B_STD.toml
Wait for a few minutes.
In another session, run influx_stress -test cmd/influx_stress/examples/5M_STD.toml

The database will either panic, or lock up for writes.

jwilder · 2015-10-23T20:02:59Z

The panic is fixed by #4543 but the database is still deadlocked.

Stack trace of the deadlock: https://gist.github.com/jwilder/7a0672f46598afd77200

jwilder · 2015-10-23T21:37:58Z

Actually, the panic is not fixed by #4543 since I just hit it locally while looking into the deadlock.

The issue is L435 is intentionally panicing because the write failed to create a new wal segment in the DB dir. When the database is dropped, the DB dirs are removed. This happens under write load when writes are queued up waiting to acquire locks and then the DB dirs are removed from beneath them. New writes fail since the DB does not exist, but the server will crash if there were in-flight writes at this time. This should probably return an error instead of panicing.

The writes below that line can also fail for the same reason: L439-L450

@pauldix Thoughts?

otoolep · 2015-10-23T23:20:20Z

Yeah, an error seems sensible here.

pauldix · 2015-10-24T15:43:52Z

@jwilder how are the DB dirs removed beneath writes waiting to acquire a lock? Doesn't the removal require a lock too?

Returning an error on the WAL segment sounds ok to me. Nothing in the index though. All those system level errors should panic the server. I'd rather fail hard than open up the possibility of data corruption or failing stuff while the user thinks it's actually up ok.

jwilder · 2015-10-26T15:07:48Z

@pauldix What happens is that in-flight writes validate that the database and shards exists in the PointsWriter here: https://github.com/influxdb/influxdb/blob/master/cluster/points_writer.go#L231 and then fire off a few goroutines to process writes concurrently. They end up blocking here: https://github.com/influxdb/influxdb/blob/master/cluster/points_writer.go#L265 waiting for each write to acquire the tsdb.Store write lock: https://github.com/influxdb/influxdb/blob/master/tsdb/store.go#L382.

A drop database call can execute while some pending goroutines are blocked on L231. The drop DB calls acquires the tsdb.Store write lock and deletes the DB dirs. When it returns and allows the gorouintes to complete, they fail to create new segment in the DB dirs and panic.

Even with returning an error for the new segment call, we could still get a panic in the later calls. What would the user actually do in this case? If they restart, does the WAL drop corrupted blocks or would it fail to start?

jwilder · 2015-10-26T16:06:20Z

Verified that returning an error from creating the segments prevents the original panic, but now I'm getting a new panic on the writes below.

[tsm1wal] 2015/10/26 10:03:04 addToCache: failed to flush: error creating new wal file: open /Users/jason/.influxdb/data/stress/default/24/_00023.wal: no such file or directory
panic: error writing type to wal: write /Users/jason/.influxdb/data/stress/default/24/_00022.wal: bad file descriptor

goroutine 5747 [running]:
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).writeToLog(0xc82774c200, 0x62101, 0xc83f236000, 0x10a28, 0x62160, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:444 +0x2fb
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).WritePoints(0xc82774c200, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:215 +0x341
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints(0xc8240a6b40, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/tsm1.go:389 +0x8a
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc8235ce2c0, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/shard.go:263 +0x670
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc820154000, 0x18, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/store.go:396 +0x14e
github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard.func1(0xc82014a1b0, 0xc8448f8fcf, 0x6, 0xc835449520, 0x7, 0xc83743c420, 0x1, 0x18, 0x1, 0xc825f06000, ...)
    /Users/jason/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:302 +0x10b
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
    /Users/jason/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:334 +0x1c5

If a drop database is executed while writes are in flight, a panic could occur because the WAL would fail to write to the DB dirs where had been removed. Partil fix for #4538

otoolep added the area/tsm label Oct 22, 2015

jwilder added this to the 0.9.5 milestone Oct 23, 2015

jwilder mentioned this issue Oct 26, 2015

Fix dropping database under write load #4580

Merged

jwilder self-assigned this Oct 26, 2015

jwilder added the 2 - Working label Oct 26, 2015

jwilder closed this as completed in #4580 Oct 26, 2015

jwilder removed the 2 - Working label Oct 26, 2015

This was referenced Oct 26, 2015

l.cacheLock.RLock() is missing? #4369

Closed

Unlock() is missing in tsdb/engine/tsm1.Log.flush()? #4368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538

[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538

desa commented Oct 21, 2015

desa commented Oct 22, 2015

jwilder commented Oct 23, 2015

jwilder commented Oct 23, 2015

otoolep commented Oct 23, 2015

pauldix commented Oct 24, 2015

jwilder commented Oct 26, 2015

jwilder commented Oct 26, 2015

[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538

[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538

Comments

desa commented Oct 21, 2015

desa commented Oct 22, 2015

jwilder commented Oct 23, 2015

jwilder commented Oct 23, 2015

otoolep commented Oct 23, 2015

pauldix commented Oct 24, 2015

jwilder commented Oct 26, 2015

jwilder commented Oct 26, 2015