bigtable: retry partially failed reads and writes #1595

arbesfeld · 2016-09-14T20:58:40Z

My calls to bigtable getRows() fail intermittently, so I have had to wrap all of these methods in retry blocks. I was wondering:

Is this flakiness expected?
Have you considered adding a retry mechanism to these RPCs?

Thanks!

The text was updated successfully, but these errors were encountered:

stephenplusplus · 2016-09-14T21:10:03Z

We do use an exponential backoff retry strategy before calling it a failure and returning an error. What errors are you getting? How often are you calling the API/are you waiting for a response before calling again?

You can pass a "maxRetries" number in the Bigtable constructor. The default is 2.

bigtable({ projectId: '...', maxRetries: 5 })

arbesfeld · 2016-09-14T21:44:55Z

Thanks, I'll try out maxRetries

Here is the stacktrace:

Error: Secure read failed
  File "/app/packages/@apphub:logrocket-server-storage-bigtable/node_modules/grpc/src/node/src/client.js", line 189, in ClientReadableStream._emitStatusIfDone
    var error = new Error(status.details);
  File "/app/packages/@apphub:logrocket-server-storage-bigtable/node_modules/grpc/src/node/src/client.js", line 158, in ClientReadableStream._readsDone
    this._emitStatusIfDone();
  File "/app/packages/@apphub:logrocket-server-storage-bigtable/node_modules/grpc/src/node/src/client.js", line 229, in readCallback
    self._readsDone();

stephenplusplus · 2016-09-14T21:59:53Z

Interesting. Not sure what that error is. We only retry after certain error types, I'm not sure this is one we would retry. @lesv @murgatroid99 have you heard of this one?

stephenplusplus · 2016-09-15T20:18:52Z

I found @murgatroid99's comment on this issue, which says that this is a 503, so we do in fact retry on this error. maxRetries should work in this case in place of writing your own retry logic.

arbesfeld · 2016-09-19T18:09:43Z

Hi, I'm still seeing this error message come up with both maxRetries: 6 and a custom retry wrapper around the getRows method.

Is there any more data that I could collect which would help identify the problem?

stephenplusplus · 2016-09-19T18:12:44Z

Can you either show code or estimate how many requests you're making at once? Since this is a 503, it's an issue of either too many requests at once so the server needs a break, or the upstream API is actually broken in some way.

arbesfeld · 2016-09-19T18:23:50Z

Absolutely, here is a snippet of the relevant code:

return new Promise((resolve, reject) => {
  this.table.getRows({
    decode: false,
    start: 'foo|',
    end: 'foo||,
    filter: [{
      column: {
        cellLimit: 1,
      },
    }],
  })
    .on('error', reject)
    .on('data', row => {
      processRow(row);
    })
    .on('end', () => {
      resolve();
    });
});

There could be potentially ~1000 rows in any given key range.

We only have one server node making these requests at any given moment of time.

Perhaps there is a better way to handle error here instead of immediately rejecting?

stephenplusplus · 2016-09-23T14:39:52Z

@callmehiphop when you have a chance, would you mind trying to recreate this scenario?

mbrukman · 2016-11-24T04:27:33Z

FWIW, this behavior (partial failure in batch operations) is expected from the Bigtable perspective. A bulk read or write operation can affect many rows, and what can happen is that some of the reads or writes will succeed, while others may fail (because different parts of the bulk request may go to different backing Bigtable servers, of which some may be busy, unavailable, or simply timeout) — Bigtable does not provide atomicity guarantees across multiple rows, so any single operation within the batch can succeed or fail independently of any others.

However, these are typically not permanent errors, so they should be retried, but as an optimization, rather than retrying the entire batch request, the client library needs to iterate over the response statuses, and only retry the ones that were marked as having failed or timed out. This is precisely what we do in other Bigtable client libraries.

The upside is that even with the occasional retries, the overall performance is much higher than with a single read or write operation per API call.

/cc: @garye, @sduskis

arbesfeld · 2016-12-11T16:58:37Z

@mbrukman @stephenplusplus given this, what is the recommended approach here? Is the user responsible for handling this retry logic?

sduskis · 2016-12-11T17:01:33Z

Java and golang both have automated retries. Retries are nuanced for long running scans and bulk writes.

arbesfeld · 2016-12-11T17:05:08Z

@sduskis I see, so for now we might have to include retry logic in our calls with this nodejs library? Is this expected for both bulk read calls and streaming reads?

stephenplusplus · 2016-12-12T14:04:26Z

You are free to implement this in your application, but it's something we will eventually support in this library.

arbesfeld · 2016-12-12T15:55:10Z

How does this work for a streaming application? Should we restart the stream at the failed point?

garye · 2016-12-13T00:02:01Z

@arbesfeld @stephenplusplus Yes, for streaming reads it's best to restart the stream after the last successfully received row. For multi-row mutations that call mutate_rows under the hood, only mutations that received an error should be retried.

As @stephenplusplus said, "smart" retries should definitely be handled in the library (should I create a separate issue to track that?). To make that effort a bit easier for node and other languages I'm putting together a little server that can be used to validate client retry behavior. I still need to push that out to a public place but, in the meantime, you can look at the test script to get some idea of what it will be testing:

https://gist.github.com/garye/e7f4fa9694dd5b04580aa7cdd6adf16f

You can also consult the java or go client retry logic, such as:
https://github.com/GoogleCloudPlatform/google-cloud-go/blob/master/bigtable/bigtable.go#L149
https://github.com/GoogleCloudPlatform/google-cloud-go/blob/master/bigtable/bigtable.go#L556

arbesfeld · 2016-12-16T03:56:05Z

We are having a bit of difficulty implementing this at the application level, since it seems like we are just getting thrown a generic error, so we end up having to retry the entire read.

@stephenplusplus happy to make a contribution here if it makes sense, though I could use a bit of direction as to where to start looking.

arbesfeld · 2016-12-17T17:58:30Z

Alternatively, some recommendation for how to handle this at the application level would also be greatly appreciated.

We are currently doing something like this:

return new Promise((resolve, reject) => {
      eventsTable
        .createReadStream({
          decode: true,
          start: 'foo',
          end: 'bar',
        })
        .on('data', function handleRow(row) {

        })
        .on('error', reject)
        .on('end', () => {
          resolve();
        });
});

Would it work to just wrap this in a try/catch and then restart from the last-seen row? It's hard to reproduce the BigTable failure so we have no idea if our approach is working.

arbesfeld · 2017-01-30T15:40:02Z

Hi @callmehiphop any updates on this issue? I would be happy to submit a PR if you wouldn't mind pointing me to where I should address the issue.

arbesfeld · 2017-01-30T15:49:14Z

At the very least, we'd like to be able to handle this at the application level.

callmehiphop · 2017-02-08T20:04:32Z

@arbesfeld sorry, we've been pretty busy with other items, but I'm going to try and get on this within the next week or so.

arbesfeld · 2017-03-03T15:12:23Z

@callmehiphop sorry to keep bugging you. I'd be happy to take a look if you could give me a bit of direction on the implementation :-)

lukesneeringer · 2017-10-31T17:32:17Z

This issue was moved to googleapis/nodejs-bigtable#7

stephenplusplus added the api: bigtable Issues related to the Bigtable API. label Sep 14, 2016

stephenplusplus closed this as completed Sep 15, 2016

stephenplusplus added the type: question Request for information or clarification. Not an issue. label Sep 15, 2016

stephenplusplus reopened this Sep 23, 2016

stephenplusplus assigned callmehiphop Sep 23, 2016

stephenplusplus mentioned this issue Oct 12, 2016

Allow setting maxRetries for exponential backoff retry strategy googleapis/gax-nodejs#38

Closed

stephenplusplus mentioned this issue Nov 21, 2016

Secure read failed #1811

Closed

stephenplusplus added enhancement and removed type: question Request for information or clarification. Not an issue. labels Dec 12, 2016

stephenplusplus changed the title ~~bigtable retry on getRows~~ bigtable: retry partially failed reads and writes Dec 12, 2016

jmuk added priority: p2 Moderately-important priority. Fix may not be included in next release. Status: Acknowledged labels Mar 7, 2017

lukesneeringer mentioned this issue Oct 31, 2017

bigtable: retry partially failed reads and writes googleapis/nodejs-bigtable#7

Closed

lukesneeringer closed this as completed Oct 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigtable: retry partially failed reads and writes #1595

bigtable: retry partially failed reads and writes #1595

arbesfeld commented Sep 14, 2016

stephenplusplus commented Sep 14, 2016 •

edited

Loading

arbesfeld commented Sep 14, 2016

stephenplusplus commented Sep 14, 2016

stephenplusplus commented Sep 15, 2016

arbesfeld commented Sep 19, 2016

stephenplusplus commented Sep 19, 2016

arbesfeld commented Sep 19, 2016

stephenplusplus commented Sep 23, 2016

mbrukman commented Nov 24, 2016

arbesfeld commented Dec 11, 2016

sduskis commented Dec 11, 2016

arbesfeld commented Dec 11, 2016 •

edited

Loading

stephenplusplus commented Dec 12, 2016

arbesfeld commented Dec 12, 2016

garye commented Dec 13, 2016

arbesfeld commented Dec 16, 2016

arbesfeld commented Dec 17, 2016

arbesfeld commented Jan 30, 2017

arbesfeld commented Jan 30, 2017

callmehiphop commented Feb 8, 2017

arbesfeld commented Mar 3, 2017

lukesneeringer commented Oct 31, 2017

bigtable: retry partially failed reads and writes #1595

bigtable: retry partially failed reads and writes #1595

Comments

arbesfeld commented Sep 14, 2016

stephenplusplus commented Sep 14, 2016 • edited Loading

arbesfeld commented Sep 14, 2016

stephenplusplus commented Sep 14, 2016

stephenplusplus commented Sep 15, 2016

arbesfeld commented Sep 19, 2016

stephenplusplus commented Sep 19, 2016

arbesfeld commented Sep 19, 2016

stephenplusplus commented Sep 23, 2016

mbrukman commented Nov 24, 2016

arbesfeld commented Dec 11, 2016

sduskis commented Dec 11, 2016

arbesfeld commented Dec 11, 2016 • edited Loading

stephenplusplus commented Dec 12, 2016

arbesfeld commented Dec 12, 2016

garye commented Dec 13, 2016

arbesfeld commented Dec 16, 2016

arbesfeld commented Dec 17, 2016

arbesfeld commented Jan 30, 2017

arbesfeld commented Jan 30, 2017

callmehiphop commented Feb 8, 2017

arbesfeld commented Mar 3, 2017

lukesneeringer commented Oct 31, 2017

stephenplusplus commented Sep 14, 2016 •

edited

Loading

arbesfeld commented Dec 11, 2016 •

edited

Loading