Conversation
f6e7481 to
d408c44
Compare
ff7da76 to
06c39b1
Compare
06c39b1 to
6d9b28b
Compare
6d9b28b to
e969700
Compare
pawanrawal
left a comment
There was a problem hiding this comment.
Not an expert of the posting package but have added some questions for you. How come adding hash index takes 4x time when temp data is written to disk whereas term index is faster now?
Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @mangalaman93, @manishrjain, and @martinmr)
posting/index.go, line 512 at r4 (raw file):
indexDir, err := ioutil.TempDir("", "") if err != nil { return err
Should we wrap the error so as to have more context as to where it is coming from when it is logged?
posting/index.go, line 523 at r4 (raw file):
indexDB, err := badger.Open(dbOpts) if err != nil { return err
Wrap error?
posting/index.go, line 566 at r4 (raw file):
// Write deltas into tmp index badger indexTxn := indexDB.NewTransaction(true)
Should we call defer indexTxn.Discard() here?
posting/index.go, line 575 at r4 (raw file):
} if err := indexTxn.SetEntry(entry); err != nil { return nil, err
Does any cleanup need to happen for the txn in case we return with an error here? Also, docs for badger transaction says that we must called Discard unless we are calling Commit.
posting/index.go, line 603 at r4 (raw file):
// flatten the LSM tree to optimize reads if err := indexDB.Flatten(5); err != nil {
Please add a comment about why was the value 5 chosen here and how much of a difference in total re-indexing time does Flattening makes here.
posting/index.go, line 641 at r4 (raw file):
// which occurred before this schema mutation. Typically, we use // kv.Version as the timestamp. err := writer.SetAt([]byte(kv.Key), kv.Value, BitCompletePosting, r.startTs)
Isn't the Key already a byte slice?
mangalaman93
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)
posting/index.go, line 569 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
Error return value of
tmpTxn.SetEntryis not checked (fromerrcheck)
Done.
posting/index.go, line 580 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
Error return value of
tmpTxn.Commitis not checked (fromerrcheck)
Done.
posting/index.go, line 601 at r2 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
Error return value of
indexDB.Flattenis not checked (fromerrcheck)
Done.
posting/index.go, line 512 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Should we wrap the error so as to have more context as to where it is coming from when it is logged?
Done.
posting/index.go, line 523 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Wrap error?
Done.
posting/index.go, line 566 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Should we call
defer indexTxn.Discard()here?
Done.
posting/index.go, line 575 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Does any cleanup need to happen for the
txnin case we return with an error here? Also, docs for badger transaction says that we must called Discard unless we are calling Commit.
Done.
posting/index.go, line 603 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Please add a comment about why was the value 5 chosen here and how much of a difference in total re-indexing time does Flattening makes here.
Done.
posting/index.go, line 641 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Isn't the Key already a byte slice?
Done.
martinmr
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r2, 1 of 1 files at r5.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @golangcibot, @mangalaman93, @manishrjain, and @pawanrawal)
dgraph/cmd/alpha/reindex_test.go, line 408 at r5 (raw file):
Quoted 6 lines of code…
q1 := `{ q(func: eq(name@en, "Runtime")) { name@en } }`
minor: Maybe retrieve the UID as well.
dgraph/cmd/alpha/reindex_test.go, line 413 at r5 (raw file):
res, _, err := queryWithTs(q1, "application/graphql+-", "", 0) require.NoError(t, err) require.JSONEq(t, `{"data":{"q":[{"name@en":"Runtime"},{"name@en":"Runtime"}]}}`, res)
I would also add another triplet, query again, and check that the changes are reflected.
posting/index.go, line 509 at r5 (raw file):
// We write the index in a temporary badger first and then, // merge entries before writing them to p directry.
minor: this probably fits in a single line.
posting/index.go, line 561 at r5 (raw file):
"error in reading posting list from disk")
here and some places below the error message say "error in ...". The "in" can be removed (e.g. "error reading posting list from disk").
posting/index.go, line 589 at r5 (raw file):
Quoted 4 lines of code…
// Reset deltas for k := range txn.cache.deltas { delete(txn.cache.deltas, k) }
If this is a map and you want to delete every entry, wouldn't it be faster if you recreated the map entirely instead of deleting every entry one by one.
posting/index.go, line 611 at r5 (raw file):
randomly.
minor: change this to "arbitrarily". "randomly" sounds a bit off.
posting/index.go, line 623 at r5 (raw file):
it
maybe rename this to indexIt or something similar so it's clear this iterator is traversing the index DB.
mangalaman93
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)
dgraph/cmd/alpha/reindex_test.go, line 413 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
I would also add another triplet, query again, and check that the changes are reflected.
Done.
posting/index.go, line 561 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
"error in reading posting list from disk")here and some places below the error message say "error in ...". The "in" can be removed (e.g. "error reading posting list from disk").
Done.
posting/index.go, line 589 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
// Reset deltas for k := range txn.cache.deltas { delete(txn.cache.deltas, k) }If this is a map and you want to delete every entry, wouldn't it be faster if you recreated the map entirely instead of deleting every entry one by one.
This is specifically optimized by go compiler to run faster than creating a new map. https://go-review.googlesource.com/c/go/+/110055/
posting/index.go, line 623 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
itmaybe rename this to
indexItor something similar so it's clear this iterator is traversing the index DB.
Done.
d9c411b to
46a317f
Compare
martinmr
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r6.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @golangcibot, @manishrjain, and @pawanrawal)
pawanrawal
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r6.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @golangcibot and @manishrjain)
1a4ba01 to
4757141
Compare
manishrjain
left a comment
There was a problem hiding this comment.
Needs some work. Once done, let's review again.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @golangcibot and @mangalaman93)
dgraph/cmd/alpha/reindex_test.go, line 1 at r7 (raw file):
package alpha
Add license
dgraph/cmd/alpha/reindex_test.go, line 43 at r7 (raw file):
m1 := `{ set { <10000> <name> "Vidar Flataukan"@en .
Might want to use 1million dataset. Looks like a systest.
posting/index.go, line 512 at r7 (raw file):
// We write the index in a temporary badger first and then, // merge entries before writing them to p directory. indexDir, err := ioutil.TempDir("", "")
This can go into /tmp, which is typically in RAM, which can cause issues. Do check for that. Might also want to consider using the current dir.
You can prefix the name here, tempDir will take a prefix in the second argument. Then, you can do a glob or something to delete all with that prefix.
posting/index.go, line 522 at r7 (raw file):
WithSyncWrites(false). WithNumVersionsToKeep(math.MaxInt64). WithCompression(options.None).
Might want to consider some compression.
posting/index.go, line 538 at r7 (raw file):
// localized posting list cache, to avoid stressing or mixing up with the // global lcache (the LRU cache). txn := NewTxn(r.startTs)
Might not need to create this global txn.
posting/index.go, line 563 at r7 (raw file):
} if err := r.fn(pk.Uid, l, txn); err != nil {
Above this line, txn := NewTxn(r.startTs)?
posting/index.go, line 588 at r7 (raw file):
} stream.Send = func(kvList *bpb.KVList) error { for _, kv := range kvList.Kv {
Use a managed badger DB, and use a version counter, which increments for every key that gets written to the DB.
Then you can use TxnWriter here. Then, you don't need to worry about overwrites.
posting/index.go, line 617 at r7 (raw file):
// Flatten the LSM tree to optimize reads, we chose to create 5 workers. // Flatten should not take too long compared to time taken in building the index. if err := indexDB.Flatten(5); err != nil {
This would be lot more expensive than just doing reads. Avoid this.
posting/index.go, line 623 at r7 (raw file):
// Now we write all the created posting lists to disk. glog.V(1).Infof("Rebuilding index for predicate %s: writing index to badger", r.attr) indexTxn := indexDB.NewTransaction(false)
Use stream framework.
posting/index.go, line 636 at r7 (raw file):
}() writer := NewTxnWriter(pstore) for indexIt.Rewind(); indexIt.Valid(); {
itr.
posting/index.go, line 658 at r7 (raw file):
// which occurred before this schema mutation. Typically, we use // kv.Version as the timestamp. err := writer.SetAt(kv.Key, kv.Value, BitCompletePosting, r.startTs)
if err := ; err != nil If you can.
032247a to
2a193cd
Compare
mangalaman93
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 2 files reviewed, 14 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)
dgraph/cmd/alpha/reindex_test.go, line 1 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add license
Done.
dgraph/cmd/alpha/reindex_test.go, line 43 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Might want to use 1million dataset. Looks like a systest.
Done.
posting/index.go, line 512 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This can go into /tmp, which is typically in RAM, which can cause issues. Do check for that. Might also want to consider using the current dir.
You can prefix the name here, tempDir will take a prefix in the second argument. Then, you can do a glob or something to delete all with that prefix.
Done.
posting/index.go, line 522 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Might want to consider some compression.
Done.
posting/index.go, line 538 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Might not need to create this global txn.
Done.
posting/index.go, line 563 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Above this line, txn := NewTxn(r.startTs)?
Done.
posting/index.go, line 588 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Use a managed badger DB, and use a version counter, which increments for every key that gets written to the DB.
Then you can use TxnWriter here. Then, you don't need to worry about overwrites.
Done.
posting/index.go, line 617 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This would be lot more expensive than just doing reads. Avoid this.
Done.
posting/index.go, line 623 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Use stream framework.
Done.
posting/index.go, line 636 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
itr.
Done.
posting/index.go, line 658 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
if err := ; err != nilIf you can.
Done.
mangalaman93
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @animesh2049, @golangcibot, @mangalaman93, and @manishrjain)
posting/index.go, line 534 at r11 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add a TODO here.
Done.
posting/index.go, line 620 at r11 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
TODO here to replace this with Batch Writer. Also, check if batch writer has the throttle mechanism as we have in txn writer.
Also, check where else we can use batch writer.
Done.
3130741 to
200f77c
Compare
Co-Authored-By: Bot from GolangCI <42910462+golangcibot@users.noreply.github.com>
While re-indexing a predicate, we now write the temp data on disk instead of keeping it in memory. This allows us to recreate index for large datasets.
On 21 million dataset, we get following results:
On 18 GB p directory, we get following results:
Todo
* Use Batch Writer for writing data to temp badger
* Use batching to write to temp badger
* Use managed badger and incremental version
* With and without compressions (without compression is faster)
* With and without Flatten (without flatten is faster)
* On 15 GB p directory
* On 21 million dataset
This change is