Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Remove custom readrows retry logic and rely on gax for retries #1422

Conversation

danieljbruce
Copy link
Contributor

@danieljbruce danieljbruce commented May 28, 2024

Summary:

For readrows calls, this PR removes the custom retry logic in createreadstream and sets this function up so that it relies on google-gax for retries instead

Changes:

In src/index.ts, the data client is told to opt into the new streaming retry behaviour and retryRequestOptions are not passed along in every streaming call.

In src/table.ts, the custom retry behaviour is removed for readrows calls and the retry logic controlling resumption and whether a retry should happen or not is moved to ReadRowsResumptionStrategy.ts. Changes are made to the other streaming call types like mutateRows to work with new changes in src/table.ts.

In src/utils/read-rows-resumption.ts, code previously from table.ts is contained that establishes retry behaviour. It encapsulates the retry behaviour in a more modular format.

In system-test/read-rows.ts, the tests a rewritten to still test the same retry behaviour but use the mock server to do so.

In system-test/testTypes.ts, interfaces are added to allow for stronger type enforcement in the test files.

In src/utils/retry-options.ts there are some constants that will be used for all streaming retry calls.

In system-test/data/read-rows-retry-test.json we add tests that used to previously live in test/table.ts.

In test/table.ts we move some of the tests over to system-test/data/read-rows-retry-test.json and add new tests that ensures the right data reaches the gapic layer.

In test/util/gapic-layer-tester.ts we define a new class that is useful for testing data that reaches read-rows.ts.

In test/utils/read-rows-resumption.ts we unit test the ReadRowsResumptionStrategy object

Future work:

This PR doesn't intend to remove custom retry logic or change retry behaviour for any of the streaming calls other than readrows. In future PRs, changes will be made so that mutateRows and sampleRowKeys calls rely on google-gax for retries.

- Transform the rowsLimit parameter into an integer
- Change the hook into a before hook so that we don’t attempt to create multiple mock servers
- Create a guard so that the stream only writes if there are row keys to write
define new interfaces too
…into move-retries-from-createreadstream

# Conflicts:
#	src/table.ts
#	src/utils/table.ts
readRowsReqOpts should have an ECMAscript prefix to completely hide it from the user. Also remove a useless  Filter.parse.
Copy link
Contributor

@leahecole leahecole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM when @daniel-sanche is good with it!

@danieljbruce danieljbruce added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 11, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 11, 2024
Copy link

@daniel-sanche daniel-sanche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a couple new thoughts, now that I understand the implementation a bit more.

My main concern at this point is how these retry options are exposed to the end-user, and how we test different user inputs

* {tableName: 'projects/my-project/instances/my-instance/tables/my-table'}
* )
* gaxOpts.retry = strategy.toRetryOptions(gaxOpts);
* ```
Copy link

@daniel-sanche daniel-sanche Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is meant to be internal, right? (How is that usually communicated in node?)

Either way, I don't think this example makes a lot of sense. Is options meant to be the same variable as gaxOpts? How is it usually created in the first place? It's not really clear to me how those are meant to interact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal so I just removed the example.

: arrify(RETRYABLE_STATUS_CODES);
if (
error.code &&
(retryCodesUsed.includes(error.code) || isRstStreamError(error))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that there is no way to disable the check for isRstStreamError? (All the other retryableErrors can be overridden)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what to do about this. Currently isRstStreamError checks if the error code is 13 and the message has RST stream in here because that is what the client library did before. If the user provides 13 in a custom set of codes then should it always retry? If they don't provide 13 in a custom set of codes then should it never retry on code 13 or should it only retry on RST stream errors? If we allow the user to override canResume then they will have control over this behaviour.

Longer term, I think this retry logic should be generated code that is the same across client library languages at some point.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not really sure what a good solution is either, just wanted to make sure you were aware of the situation, and maybe discussed it with stakeholders.

It still seems strange to me to retry based on the content of the error, instead of just the error code itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit specific. Please look at the YAQs question I sent you.

[],
gaxOpts?.retry?.backoffSettings || DEFAULT_BACKOFF_SETTINGS,
gaxOpts?.retry?.shouldRetryFn || canResume,
gaxOpts?.retry?.getResumptionRequestFn || getResumeRequest
Copy link

@daniel-sanche daniel-sanche Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we allow end users to completely override some internal callbacks (canResume, getResumeRequest). Is there a good reason for this? It seems to complicate things immensely, and I can't really think of a good reason to have customers provide custom resumption logic


If we are going to expose these, are these well covered by tests? Can we be sure that customizing these doesn't break things in unexpected ways? (It looks like the custom retryable errors weren't covered by tests, and replacing internal functions seems to open up even more room for issues)

Copy link
Contributor Author

@danieljbruce danieljbruce Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. If shouldRetryFn is provided then it will override canResume and if getResumptionRequestFn is provided then it will override getResumeRequest. The pros of this is that it gives the user complete control of how they want retries to work. The cons are the cons that you mentioned. should use a different set of retry codes is now a test case that covers users providing custom retry codes and overriding canResume and getResumeRequest aren't explicitly tested. Let's decide if the overrides are features we will support. Do you think we should not allow the user to override canResume and getResumeRequest? I wonder what @leahecole thinks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would recommend making them non-customizable (i.e. raising an exception if the user sets these themselves). It would be easy enough to make them customizable in the future if there's demand, but once it's part of the public surface, we have to maintain it going forward. And I can imagine a bunch of weird edge cases coming up, that I'm not confident are covered by tests

But I'll leave that up to you and the product team

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good point. We can always add customizability later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, customizability can always happen later. I think that client library teams should be able to customize them (which is what we are doing with this PR) but giving the user too much control here could lead to a lot of issues and definitely could be hard to maintain. Good point, @daniel-sanche

Copy link

@daniel-sanche daniel-sanche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

I still have a couple concerns about allowing custom callbacks for stream resumption logic you might want to re-consider. But I think that's mostly out of scope for my review

@danieljbruce danieljbruce added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 24, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 24, 2024
@danieljbruce danieljbruce merged commit 3e0a46e into googleapis:main Jun 24, 2024
15 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/nodejs-bigtable API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants