Skip to content

Handle nil memtable gracefully on Close #1034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

connorgorman
Copy link
Contributor

@connorgorman connorgorman commented Sep 13, 2019

This is related to crashes on Close. I think it's a fair expectation that if you have multiple goroutines reading/writing a call to Close (on a sigterm for example) will not cause a panic. This fixes an issue where the memtable is nil on Close. This will cause Get's to return errors and not panic

This is an example of the panic

goroutine 85845 [running]:
github.com/dgraph-io/badger/skl.(*Skiplist).IncrRef(...)
	skl/skl.go:86
github.com/dgraph-io/badger.(*DB).getMemTables(0xc000491c00, 0x0, 0x0, 0x0, 0x0)
	db.go:484 +0xd7
github.com/dgraph-io/badger.(*DB).get(0xc000491c00, 0xc015c7c3c0, 0x38, 0x38, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	db.go:516 +0x66
github.com/dgraph-io/badger.(*Txn).Get(0xc00dfc5500, 0xc013083da0, 0x30, 0x30, 0x24, 0x30, 0xc017628378)
	txn.go:407 +0xe7

This change is Reviewable

Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ A review job has been created and sent to the PullRequest network.


@connorgorman you can click here to see the review status or cancel the code review job.

Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straightforward with a note on reliability of memory management.


Reviewed with ❤️ by PullRequest

@@ -492,6 +492,10 @@ func (db *DB) getMemTables() ([]*skl.Skiplist, func()) {

tables := make([]*skl.Skiplist, len(db.imm)+1)

if db.mt == nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change and optimization to avoid this issue!

If you haven't already, this is a good opportunity to start tracking memory usage as part of reliability metrics to ensure your application doesn't start overflowing as the codebase becomes more complex and we get into the memory management nitty gritties.

@ashish-goswami
Copy link
Contributor

@connorgorman Thanks for the PR. Can you tell us, how did you end up with above crash? Also, if possible, please create issue for this.

@connorgorman
Copy link
Contributor Author

@ashish-goswami It's actually related to this one: hypermodeinc/dgraph#3873. Basically, that stacktrace shows two panics. It panics once on the getMemTables because it is getting a value, but then there is an iterator panic in a defer function because the DB has been closed

This scenario occurs when you Close the DB, but have a txn in progress (new txns get an error about the DB being closed)

Copy link
Contributor

@jarifibrahim jarifibrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @connorgorman, @manishrjain, and @pullrequest[bot])


db.go, line 495 at r1 (raw file):

Previously, pullrequest[bot] wrote…

Good change and optimization to avoid this issue!

If you haven't already, this is a good opportunity to start tracking memory usage as part of reliability metrics to ensure your application doesn't start overflowing as the codebase becomes more complex and we get into the memory management nitty gritties.

I am not very sure about this change. The db.mt shouldn't be nil under normal circumstances. I'm curious to know how would we end up in a situation where db.mt is nil.


db.go, line 535 at r1 (raw file):

	tables, decr := db.getMemTables() // Lock should be released.
	if len(tables) == 0 {
		return y.ValueStruct{}, errors.New("No tables found. DB may be closed")

This wouldn't work. The db.get(...) function looks in the memtables and then in all the levels to find a key. If we return from here, we would not find keys that are stored in any level below the memtables.

@connorgorman
Copy link
Contributor Author

@jarifibrahim If you are trying to gracefully shutdown by calling Close, but have other readers then I believe they are not synchronized and will panic. https://github.com/dgraph-io/badger/blob/master/db.go#L407 is where db.mt is set to nil.

In the case where len(tables) == 0, based on the code in getMemTables, the only time when it would have 0 tables would be when db.mt == nil based on the change I made. Otherwise, there is always at least 1

Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to inactivity, PullRequest has cancelled this review job. You can reactivate the code review job from the PullRequest dashboard

@connorgorman
Copy link
Contributor Author

Circling back on this change, I think the correct Close behavior would be to wait for all txns including Views to complete while not allowing any new txns. Thoughts?

@jarifibrahim
Copy link
Contributor

Circling back on this change, I think the correct Close behavior would be to wait for all txns including Views to complete while not allowing any new txns. Thoughts?

I had the same thing in mind. Instead of crashing with a nil pointer exception, it would be nice to complain to the user about the open transactions while closing the DB.

Copy link
Contributor

@jarifibrahim jarifibrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: This patch seems to me the simplest fix for the memtable nil crash.

func TestMemtableNil(t *testing.T) {
	db, err := Open(DefaultOptions("/tmp/b"))
	if err != nil {
		panic(err)
	}
	txnSet(t, db, []byte("foo"), []byte("1"), 0)
	go func() {
		txn := db.NewTransaction(true)
		for {
			time.Sleep(time.Second)
			i, err := txn.Get([]byte("foo"))
			if err != nil && err != ErrKeyNotFound {
				log.Fatal(err)
			}
			fmt.Println(i)
		}
	}()
	db.Close()
	time.Sleep(10 * time.Second)
}

This test would crash on master but on this PR it gracefully returns an error.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @connorgorman, and @manishrjain)


db.go, line 495 at r1 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

I am not very sure about this change. The db.mt shouldn't be nil under normal circumstances. I'm curious to know how would we end up in a situation where db.mt is nil.

Thanks for explaining @connorgorman


db.go, line 535 at r1 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

This wouldn't work. The db.get(...) function looks in the memtables and then in all the levels to find a key. If we return from here, we would not find keys that are stored in any level below the memtables.

In case the DB is being closed, it should be okay to return from here.

Copy link
Contributor

@jarifibrahim jarifibrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @connorgorman, and @manishrjain)

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not assume that if db.mt is nil, db is being closed. That is just relying upon some behavior which happens to be the case right now, but could change in the future.

@jarifibrahim Please make the changes as discussed. Feel free to merge after.

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ashish-goswami, @connorgorman, and @jarifibrahim)


db.go, line 495 at r1 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

Thanks for explaining @connorgorman

Do not return here.


db.go, line 500 at r1 (raw file):

	// Get mutable memtable.
	tables[0] = db.mt

if db.mt != nil { this logic here }


db.go, line 508 at r1 (raw file):

		tables[i+1] = db.imm[last-i]
		tables[i+1].IncrRef()
	}

Here, filter the tables array for nil tables. Loop over elements, if not nil, keep them.


db.go, line 535 at r1 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

In case the DB is being closed, it should be okay to return from here.

I wouldn't do this here.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ashish-goswami, @connorgorman, and @jarifibrahim)


db.go, line 508 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Here, filter the tables array for nil tables. Loop over elements, if not nil, keep them.

Also move all IncrRef here.

@connorgorman
Copy link
Contributor Author

@manishrjain Does it seem like it'd be better to just wait for View txns to close? The issue is that closing the DB does not wait for all txns to finish

@manishrjain
Copy link
Contributor

Hey @jarifibrahim , what's the status of this?

@connorgorman
Copy link
Contributor Author

@manishrjain @jarifibrahim After looking at this and testing, it seems like this will not be sufficient for fixing the issue completely. e.g. if you use a merge operator, then the Get will succeed, but then the iterator will panic. It seems like the only way to fix this issue would be to ensure that all txns complete (both view and update) and then closing the DB (while not allowing new txns to begin). I took a stab at this, but it's fairly problematic due to the NewTransaction function being relied upon to always succeed and have a non-nil return value, which makes synchronizing difficult

Do you guys have any thoughts about how we could synchronize txns?

@jarifibrahim
Copy link
Contributor

Hey @connorgorman, I don't think there's any way we can handle all the db closed cases. For instance, even if we managed to handle nil memtables, there will be nil tables, also the value log buffers will be nil. The crash could happen at so many places. There's no way we can handle all the cases easily.

The crash happened in hypermodeinc/dgraph#3873 because we were accessing badger after closing it. A user/system should never do that. I think it's a reasonable expectation from the user to not access the DB after they have closed the database.

@connorgorman I'd like to close this PR since the issue we're trying to fix here is not really an issue with badger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants