-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ensure that during resumption of a scan, rows that have not been observed by the caller are re-requested #1444
Conversation
createReadStream() creates a pipeline of streams that converts a stream of row chunks into a stream of logical rows. It also has logic to handle stream resumption when a single attempt fails. The pipeline can be split into 2 parts: the persistent operation stream that the caller sees and the transient per attempt segment. When a retry attempt occurs, the per attempt segment is unpiped from the operation stream and is discarded. Currently this includes any buffered data that each stream might contain. Unfortunately, when constructing the retry request, createReadStream() will use the last row key from the last buffered row. This will cause the buffered rows to be omitted from the operation stream. This PR fixes the missing rows part by only referencing the row keys that were seen by the persistent operation stream when constructing a retry attempt. In other words, this will ensure that we only update the lastSeenRow key once the row has been "committed" to the persistent portion of the pipeline
# Conflicts: # system-test/read-rows.ts
# Conflicts: # system-test/read-rows.ts
test/readrows.ts
Outdated
@@ -317,6 +318,42 @@ describe('Bigtable/ReadRows', () => { | |||
}); | |||
}); | |||
|
|||
it('should return row data in the right order', done => { | |||
// 1000 rows must be enough to reproduce issues with losing the data and to create backpressure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment about backpressure still true given that the highwatermark is set to 0? if not, please remove, otherwise, feel free to resolve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still true because we want to create the scenario where there is backpressure in the chunk transformer and other streams in order to reproduce the issue that occurs when these transforms are thrown away from before the fix. Note that this fix only applies a highwatermark of 0 to the user stream.
However, this comment still does need an adjustment from 1000 to 150 :)
Co-authored-by: Leah E. Cole <6719667+leahecole@users.noreply.github.com>
} | ||
let keyToRequestClosed: any; | ||
if ( | ||
stream?.request?.rows?.rowRanges && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is pretty difficult to read... I looked at the file and there are 33 if statements! All this branching makes it difficult to follow the logic.
I would suggest we refactor this code to take a page from OOP. Can we group if
statements into classes, or at least their own separate pieces of functionality so that instead we call functionality we know will occur? From a cursory glance (I'm not really sure what the code does in this library) it looks like we're concerned with startKey and endkeys. Maybe we could group startKey functions and endKey functions into their own classes or (at least) pieces of logic, and then always call these pieces of logic. Those functions should appropriately handle edge/non-validated cases, that way we can easily read what happens to an input and what to get as an output.
TL;DR: if statements are hard to follow. If we can somehow group/parcel out functionality to reduce overall if-statements that would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all good ideas. Server code was taken from https://github.com/googleapis/nodejs-bigtable/blob/main/test/utils/readRowsImpl.ts and adjusted to mock correct server behaviour in a hurry. I added some TODOs because I think that will take me some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we defer cleaning up the test code until after we land the fix?
I think this entire generator can be simplified significantly but I really dont want us to block a fix to a data loss bug due to test code hygiene. Daniel already added a TODO comment to simplify this
…ermark-removal' of https://github.com/danieljbruce/nodejs-bigtable into fix-missing-rows-with-test-and-fix-for-node-14-plus-watermark-removal
Use it to replace any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approval, please create an issue for the cleanup concerns @sofisl and I have
Took #1440:
createReadStream() creates a pipeline of streams that converts a stream of row chunks into a stream of logical rows. It also has logic to handle stream resumption when a single attempt fails. The pipeline can be split into 2 parts: the persistent operation stream that the caller sees and the transient per attempt segment. When a retry attempt occurs, the per attempt segment is unpiped from the operation stream and is discarded. Currently this includes any buffered data that each stream might contain. Unfortunately, when constructing the retry request, createReadStream() will use the last row key from the last buffered row. This will cause the buffered rows to be omitted from the operation stream.
This PR fixes the missing rows part by only referencing the row keys that were seen by the persistent operation stream when constructing a retry attempt. In other words, this will ensure that we only update the lastSeenRow key once the row has been "committed" to the persistent portion of the pipeline
If my understanding is correct, this should be sufficient to fix the correctness issue. However the performance issue of re-requesting the dropped buffered data remains. This should be addressed separately
Plus:
Added the test that guarantees 150 rows are all sent back and in the right order.
Modified tests to make mocks schedule last event emitted later.
Fixed test to work with Node v14
Removed watermarks