Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538

Closed
desa opened this issue Oct 21, 2015 · 7 comments · Fixed by #4580
Closed

[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538

desa opened this issue Oct 21, 2015 · 7 comments · Fixed by #4580
Assignees
Labels
Milestone

Comments

@desa
Copy link
Contributor

desa commented Oct 21, 2015

While running a test, I dropped the database that the test was writing to. This resulted in the following panic.

[tsm1] 2015/10/21 16:09:15 rewriting index file /var/opt/influxdb/data/stress/default/14/0001170.tsm1 with /var/opt/influxdb/data/stress/default/14/0001173.tsm1
[query] 2015/10/21 16:09:17 DROP DATABASE stress
[tsm1wal] 2015/10/21 16:09:17 /var/opt/influxdb/data/stress/default/14 flush to index took 2.755019774s
[wal] 2015/10/21 16:09:18 Flush due to idle. Flushing 16 series with 16 points and 740 bytes from partition 1
[wal] 2015/10/21 16:09:18 write to index of partition 1 took 1.430646ms
[monitor] 2015/10/21 16:09:27 failed to store statistics: timeout
[monitor] 2015/10/21 16:09:37 failed to store statistics: timeout
[monitor] 2015/10/21 16:09:47 failed to store statistics: timeout
[query] 2015/10/21 16:09:48 DROP DATABASE stress
[monitor] 2015/10/21 16:09:57 failed to store statistics: timeout
[monitor] 2015/10/21 16:10:07 failed to store statistics: timeout
[tsm1] 2015/10/21 16:10:12 Compaction of /var/opt/influxdb/data/stress/default/14 took 4m13.079764908s
panic: error opening new segment file for wal: open /var/opt/influxdb/data/stress/default/14/_03549.wal: no such file or directory

goroutine 407990 [running]:
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).writeToLog(0xc823cd4700, 0x62101, 0xc84d0dc000, 0x10a15, 0x6218e, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:411 +0x94b
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).WritePoints(0xc823cd4700, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:211 +0x35f
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints(0xc823f3d560, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/tsm1.go:348 +0x8a
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc823f51b80, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/shard.go:263 +0x670
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc820086780, 0xe, 0xc842c66000, 0x1387, 0x1800, 0x0, 0x0)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/tsdb/store.go:396 +0x14e
github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard.func1(0xc82007a6c0, 0xc830dc56ef, 0x6, 0xc824689580, 0x7, 0xc831be8540, 0x1, 0xe, 0x1, 0xc842c66000, ...)
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/cluster/points_writer.go:302 +0x10b
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
        /tmp/tmp.vlHqNIFOU1/src/github.com/influxdb/influxdb/cluster/points_writer.go:334 +0x1c5
@desa
Copy link
Contributor Author

desa commented Oct 22, 2015

Repro steps:

  1. In one terminal session, start running influx_stress -test cmd/influx_stress/examples/100B_STD.toml
  2. Wait for a few minutes.
  3. In another session, run influx_stress -test cmd/influx_stress/examples/5M_STD.toml

The database will either panic, or lock up for writes.

@jwilder
Copy link
Contributor

jwilder commented Oct 23, 2015

The panic is fixed by #4543 but the database is still deadlocked.

Stack trace of the deadlock: https://gist.github.com/jwilder/7a0672f46598afd77200

@jwilder
Copy link
Contributor

jwilder commented Oct 23, 2015

Actually, the panic is not fixed by #4543 since I just hit it locally while looking into the deadlock.

The issue is L435 is intentionally panicing because the write failed to create a new wal segment in the DB dir. When the database is dropped, the DB dirs are removed. This happens under write load when writes are queued up waiting to acquire locks and then the DB dirs are removed from beneath them. New writes fail since the DB does not exist, but the server will crash if there were in-flight writes at this time. This should probably return an error instead of panicing.

The writes below that line can also fail for the same reason: L439-L450

@pauldix Thoughts?

@jwilder jwilder added this to the 0.9.5 milestone Oct 23, 2015
@otoolep
Copy link
Contributor

otoolep commented Oct 23, 2015

Yeah, an error seems sensible here.

@pauldix
Copy link
Member

pauldix commented Oct 24, 2015

@jwilder how are the DB dirs removed beneath writes waiting to acquire a lock? Doesn't the removal require a lock too?

Returning an error on the WAL segment sounds ok to me. Nothing in the index though. All those system level errors should panic the server. I'd rather fail hard than open up the possibility of data corruption or failing stuff while the user thinks it's actually up ok.

@jwilder
Copy link
Contributor

jwilder commented Oct 26, 2015

@pauldix What happens is that in-flight writes validate that the database and shards exists in the PointsWriter here: https://github.com/influxdb/influxdb/blob/master/cluster/points_writer.go#L231 and then fire off a few goroutines to process writes concurrently. They end up blocking here: https://github.com/influxdb/influxdb/blob/master/cluster/points_writer.go#L265 waiting for each write to acquire the tsdb.Store write lock: https://github.com/influxdb/influxdb/blob/master/tsdb/store.go#L382.

A drop database call can execute while some pending goroutines are blocked on L231. The drop DB calls acquires the tsdb.Store write lock and deletes the DB dirs. When it returns and allows the gorouintes to complete, they fail to create new segment in the DB dirs and panic.

Even with returning an error for the new segment call, we could still get a panic in the later calls. What would the user actually do in this case? If they restart, does the WAL drop corrupted blocks or would it fail to start?

@jwilder
Copy link
Contributor

jwilder commented Oct 26, 2015

Verified that returning an error from creating the segments prevents the original panic, but now I'm getting a new panic on the writes below.

[tsm1wal] 2015/10/26 10:03:04 addToCache: failed to flush: error creating new wal file: open /Users/jason/.influxdb/data/stress/default/24/_00023.wal: no such file or directory
panic: error writing type to wal: write /Users/jason/.influxdb/data/stress/default/24/_00022.wal: bad file descriptor

goroutine 5747 [running]:
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).writeToLog(0xc82774c200, 0x62101, 0xc83f236000, 0x10a28, 0x62160, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:444 +0x2fb
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).WritePoints(0xc82774c200, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:215 +0x341
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints(0xc8240a6b40, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/tsm1.go:389 +0x8a
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc8235ce2c0, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/shard.go:263 +0x670
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc820154000, 0x18, 0xc825f06000, 0x1387, 0x1800, 0x0, 0x0)
    /Users/jason/go/src/github.com/influxdb/influxdb/tsdb/store.go:396 +0x14e
github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard.func1(0xc82014a1b0, 0xc8448f8fcf, 0x6, 0xc835449520, 0x7, 0xc83743c420, 0x1, 0x18, 0x1, 0xc825f06000, ...)
    /Users/jason/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:302 +0x10b
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
    /Users/jason/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:334 +0x1c5

jwilder added a commit that referenced this issue Oct 26, 2015
If a drop database is executed while writes are in flight, a panic
could occur because the WAL would fail to write to the DB dirs where
had been removed.

Partil fix for #4538
@jwilder jwilder self-assigned this Oct 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants