-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.5 nightly-ff997a7] Dropping database under a write load causes panics #4538
Comments
Repro steps:
The database will either panic, or lock up for writes. |
The panic is fixed by #4543 but the database is still deadlocked. Stack trace of the deadlock: https://gist.github.com/jwilder/7a0672f46598afd77200 |
Actually, the panic is not fixed by #4543 since I just hit it locally while looking into the deadlock. The issue is L435 is intentionally panicing because the write failed to create a new wal segment in the DB dir. When the database is dropped, the DB dirs are removed. This happens under write load when writes are queued up waiting to acquire locks and then the DB dirs are removed from beneath them. New writes fail since the DB does not exist, but the server will crash if there were in-flight writes at this time. This should probably return an error instead of panicing. The writes below that line can also fail for the same reason: L439-L450 @pauldix Thoughts? |
Yeah, an error seems sensible here. |
@jwilder how are the DB dirs removed beneath writes waiting to acquire a lock? Doesn't the removal require a lock too? Returning an error on the WAL segment sounds ok to me. Nothing in the index though. All those system level errors should panic the server. I'd rather fail hard than open up the possibility of data corruption or failing stuff while the user thinks it's actually up ok. |
@pauldix What happens is that in-flight writes validate that the database and shards exists in the A drop database call can execute while some pending goroutines are blocked on L231. The drop DB calls acquires the Even with returning an error for the new segment call, we could still get a panic in the later calls. What would the user actually do in this case? If they restart, does the WAL drop corrupted blocks or would it fail to start? |
Verified that returning an error from creating the segments prevents the original panic, but now I'm getting a new panic on the writes below.
|
If a drop database is executed while writes are in flight, a panic could occur because the WAL would fail to write to the DB dirs where had been removed. Partil fix for #4538
While running a test, I dropped the database that the test was writing to. This resulted in the following panic.
The text was updated successfully, but these errors were encountered: