spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #3258

thiagotnunes · 2020-07-29T02:24:03Z

This bug is related to the Spanner client library.

For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".

In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.

We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.

In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.

ava12 · 2020-08-04T23:11:04Z

Not sure, but looks like this case is already handled and transaction is retried. Is there an easy way to check it? Maybe some server emulator that can be instructed to close connection in a few seconds?

ava12 · 2020-08-05T21:59:49Z

I've tried PDML updates on a table with one milloin rows using custom port forwarder that breaks connections before update operation is complete. In this case gRPC library throws Google\Cloud\Core\Exception\ServiceException with message "Socket closed", code 14, status "UNAVAILABLE". This exception is caught by Google\Cloud\Spanner\Result, and operation is retried using obtained resume token. To me it looks like everything works as needed. But I cannot test it in real conditions.

thiagotnunes · 2020-08-05T23:23:04Z

@ava12 thanks for testing this. I can help you testing this next week with a larger database if you share with me some of the code. This will make sure that the problem does not occur in long running PDMLs.

ava12 · 2020-08-06T13:55:30Z

Sample code is like this:

<?php

require_once __DIR__ . '/vendor/autoload.php';
use Google\Cloud\Spanner\SpannerClient;

const INSTANCE_NAME = 'my-instance';
const DATABASE_NAME = 'my-database';
const QUERY = "UPDATE my_table SET my_col = 'updated' WHERE true";
$config = [
//    'keyFilePath' => '/path/to/keyfile.json', // default: GOOGLE_APPLICATION_CREDENTIALS env var
//    'projectId' => 'my-project', // default: obtain from key file
];

$client = new SpannerClient($config);
$database = $client->instance(INSTANCE_NAME)->database(DATABASE_NAME);
echo 'updating' . PHP_EOL;
$rowCount = $db->executePartitionedUpdate(QUERY);
echo "done, $rowCount rows updated" . PHP_EOL;

thiagotnunes · 2020-08-10T23:58:55Z

@ava12 thanks for the sample code!

I am not being able to successfully run the operation, I keep getting the following error after ~10 minutes of the PDML running:

Fatal error: Uncaught Google\Cloud\Core\Exception\ServiceException: {
    "message": "The operation was cancelled.",
    "code": 1,
    "status": "CANCELLED",
    "details": [
        {
            "@type": "grpc-server-stats-bin",
            "data": "<Unknown Binary Data>"
        }
    ]
} in /home/spanner/vendor/google/cloud-core/src/GrpcRequestWrapper.php:257

Could you try and see if you get the same error for a long running PDML?

ava12 · 2020-08-11T20:35:41Z

I cannot get this error because I cannot initiate a request that is long enough.
I guess connection is broken by client, but don't know where and why.
Could you please run

composer show -i | grep grpc

in the sample code directory and show its output?

thiagotnunes · 2020-08-11T22:24:45Z

The output of the command is the following:

google/grpc-gcp         0.1.5   gRPC GCP library for channel management
grpc/grpc               1.30.0  gRPC library for PHP

We will try to investigate the issue. It seems this is happening in the python library as well.

thiagotnunes · 2020-08-14T06:23:42Z

@ava12 sorry, we will probably not have the resources to investigate this further, could you continue with the investigation?
In order to run a long running PDML you can use the "thiagotnunes-test-instance" and the "thiagotnunes-pdml-testing" (assuming you have credentials to the staging project we use).

The PDML that takes a long time to execute is the following:

<?php
require './vendor/autoload.php';

use Google\Cloud\Spanner\SpannerClient;

$spanner = new SpannerClient(array(
  'projectId' => '' // staging project
));
$instance = $spanner->instance('thagotnunes-test-instance');
$database = $instance->database('thiagotnunes-pdml-testing');

printf("Running PDML started at " . date("Y-m-d H:i:s") . "\n");
$rowCount = $database->executePartitionedUpdate(
  "UPDATE ChildTable c SET c.name=CONCAT('315',c.name) WHERE c.ChildId < 1000000 AND NOT c.name LIKE '315%'"
);
printf("Running PDML finished at " . date("Y-m-d H:i:s") . "\n");

?>

ava12 · 2020-08-14T14:58:48Z

@thiagotnunes, I don't know the name of staging project and I don't think I have credentials to access it.

Your sample query is quite simple and updates only one million rows. I've tried similar query on dev project and my table with one million rows, and it took less than 30 seconds to complete. How come your query takes more than 10 minutes? Complex schema? Huge table rows?

thiagotnunes · 2020-08-15T00:14:45Z

@ava12 oh I thought you had access to it. That would be the app-soda staging project. Let me talk to my peers to see how to get you access to it.

In terms of the time to update, the schema has a parent / child relationship with interleaving. There are about 500 parents with 1M children, so 500M updates.

ava12 · 2020-08-15T00:19:51Z

Ah, the project is appdev-soda-spanner-staging. I can access it and I see thiagotnunes-test-instance in instance list.

ava12 · 2020-08-19T16:00:26Z

Cannot reproduce. I never see neither "Socket closed", nor "The operation was cancelled". I always get "Timeout Exceeded". Even when I set timeout to 3 hours, I get "Timeout Exceeded" in 3 hours, no other errors, no retries.

thiagotnunes · 2020-08-19T22:58:06Z

@ava12 thanks for trying it out. I will see if this still happens on my end and get back to you.

thiagotnunes · 2020-08-20T02:23:59Z

@ava12 you are correct, this problem does not seem to occur any longer. I was running my tests in a different environment, which might had been the culprit of the problem.

Thanks for verifying this. I will close the issue.

thiagotnunes added the api: spanner Issues related to the Spanner API. label Jul 29, 2020

yoshi-automation added the triage me I really want to be triaged. label Jul 29, 2020

skuruppu added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Jul 30, 2020

ava12 self-assigned this Aug 5, 2020

thiagotnunes closed this as completed Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #3258

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #3258

thiagotnunes commented Jul 29, 2020

ava12 commented Aug 4, 2020

ava12 commented Aug 5, 2020

thiagotnunes commented Aug 5, 2020

ava12 commented Aug 6, 2020

thiagotnunes commented Aug 10, 2020

ava12 commented Aug 11, 2020

thiagotnunes commented Aug 11, 2020

thiagotnunes commented Aug 14, 2020

ava12 commented Aug 14, 2020

thiagotnunes commented Aug 15, 2020

ava12 commented Aug 15, 2020

ava12 commented Aug 19, 2020

thiagotnunes commented Aug 19, 2020

thiagotnunes commented Aug 20, 2020

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #3258

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #3258

Comments

thiagotnunes commented Jul 29, 2020

ava12 commented Aug 4, 2020

ava12 commented Aug 5, 2020

thiagotnunes commented Aug 5, 2020

ava12 commented Aug 6, 2020

thiagotnunes commented Aug 10, 2020

ava12 commented Aug 11, 2020

thiagotnunes commented Aug 11, 2020

thiagotnunes commented Aug 14, 2020

ava12 commented Aug 14, 2020

thiagotnunes commented Aug 15, 2020

ava12 commented Aug 15, 2020

ava12 commented Aug 19, 2020

thiagotnunes commented Aug 19, 2020

thiagotnunes commented Aug 20, 2020