Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNAVAILABLE (503) error was not retried on MutateRows #115

Closed
kevinsi4508 opened this issue Aug 27, 2020 · 5 comments · Fixed by #123
Closed

UNAVAILABLE (503) error was not retried on MutateRows #115

kevinsi4508 opened this issue Aug 27, 2020 · 5 comments · Fixed by #123
Assignees
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@kevinsi4508
Copy link

kevinsi4508 commented Aug 27, 2020

Environment details

SDK Version: Apache Beam Python 3.7 SDK 2.22.0

Steps to reproduce

  1. Run a Dataflow job.
  2. After UNAVAILABLE (503) error, throughput of pipeline completely stopped.
  3. The UNAVAILABLE (503) error was not retried in the client code.

Stack trace

Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 554, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigtableio.py", line 106, in process
    self.batcher.mutate(row)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/bigtable/batcher.py", line 104, in mutate
    self.flush()
  File "/usr/local/lib/python3.7/site-packages/google/cloud/bigtable/batcher.py", line 140, in flush
    self.table.mutate_rows(self.rows)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/bigtable/table.py", line 583, in mutate_rows
    return retryable_mutate_rows(retry=retry)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/bigtable/table.py", line 760, in __call__
    mutate_rows()
  File "/usr/local/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
    on_error=on_error,
  File "/usr/local/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target
    return target()
  File "/usr/local/lib/python3.7/site-packages/google/cloud/bigtable/table.py", line 820, in _do_mutate_retryable_rows
    mutate_rows_request, retry=None
  File "/usr/local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 146, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 Connection reset by peer

As @igorbernstein2 mentioned, MutateRows & ReadRows should ignore service_config and hardcode the retriable errors in the manual layer.

b/165938205

@product-auto-label product-auto-label bot added the api: bigtable Issues related to the googleapis/python-bigtable API. label Aug 27, 2020
@kevinsi4508 kevinsi4508 changed the title UNAVAILABLE (503) error was no retried on MutateRows UNAVAILABLE (503) error was not retried on MutateRows Aug 27, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Aug 27, 2020
@kolea2 kolea2 added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 27, 2020
@kolea2 kolea2 removed the triage me I really want to be triaged. label Aug 27, 2020
@crwilcox
Copy link
Contributor

crwilcox commented Sep 2, 2020

Is this a reliable failure?

@igorbernstein2
Copy link
Contributor

I think this should be a retriable error. I think the problem is that when we migrated to service config json files, we removed retry configs. The reasoning was that service configs will eventually get pulled into grpc and grpc will never be able to handle smart retries for bulk mutations and read row stream resumption. Unless I'm misunderstanding the code, it seems like this client still relies on the service config file for retry settings:

if "mutate_rows" not in inner_api_calls:
default_retry = (data_client._method_configs["MutateRows"].retry,)
if self.timeout is None:
default_timeout = data_client._method_configs["MutateRows"].timeout

@crwilcox
Copy link
Contributor

crwilcox commented Sep 2, 2020

I have started looking into this and it seems we do set retries to none on grpc, instead managing retries in the client code. That said, I am digging into an area I think we may not know how to retry if we experience failure at the highest level of the mutate retry (before we start streaming)

If this line fails, it won't retry and we aren't going to record any retryable rows.

responses = data_client._inner_api_calls["mutate_rows"](

I think, from the stacktrace, that this is where it is failing (things don't quite line up with current versions)

@crwilcox
Copy link
Contributor

crwilcox commented Sep 3, 2020

I have made a fix locally and am attempting to reproduce this by mutating 1000 rows on loop. If you have thoughts on how to trigger this failure do let me know, if not I can just let this go for a while :)

       import datetime

        ROW_COUNT = 1000
        # Create rows
        print(f"Seeding Rows. Count:{ROW_COUNT} T:{datetime.datetime.utcnow()}")
        for i in range(ROW_COUNT):
            row_key =  f"row_key_{i}".encode()
            row1 = self._table.row(row_key)
            row1.set_cell(COLUMN_FAMILY_ID1, COL_NAME1, CELL_VAL1)
            row1.commit()

        loop_count = 0

        while(True):
            loop_count += 1
            print(f"Mutate Loop {loop_count} T:{datetime.datetime.utcnow()}")
            
            cell_val = CELL_VAL1 + f"{loop_count}".encode()

            # Change the contents
            rows = []
            for i in range(ROW_COUNT):
                row_key =  f"row_key_{i}".encode()
                row = self._table.row(row_key)
                row.set_cell(COLUMN_FAMILY_ID1, COL_NAME1, cell_val)
                rows.append(row)

            statuses = self._table.mutate_rows(rows)
            result = [status.code for status in statuses]

            expected_result = [0] * ROW_COUNT
            self.assertEqual(result, expected_result)

image

@crwilcox
Copy link
Contributor

crwilcox commented Sep 3, 2020

Well, it took ~40k iterations, but I think it failed :)
image

So this returns the same retry type that a bad response row would

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants