Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #3258

Closed
thiagotnunes opened this issue Jul 29, 2020 · 14 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@thiagotnunes
Copy link

This bug is related to the Spanner client library.

For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".

In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.

We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.

In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.

@thiagotnunes thiagotnunes added the api: spanner Issues related to the Spanner API. label Jul 29, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jul 29, 2020
@skuruppu skuruppu added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Jul 30, 2020
@ava12
Copy link
Contributor

ava12 commented Aug 4, 2020

Not sure, but looks like this case is already handled and transaction is retried. Is there an easy way to check it? Maybe some server emulator that can be instructed to close connection in a few seconds?

@ava12 ava12 self-assigned this Aug 5, 2020
@ava12
Copy link
Contributor

ava12 commented Aug 5, 2020

I've tried PDML updates on a table with one milloin rows using custom port forwarder that breaks connections before update operation is complete. In this case gRPC library throws Google\Cloud\Core\Exception\ServiceException with message "Socket closed", code 14, status "UNAVAILABLE". This exception is caught by Google\Cloud\Spanner\Result, and operation is retried using obtained resume token. To me it looks like everything works as needed. But I cannot test it in real conditions.

@thiagotnunes
Copy link
Author

@ava12 thanks for testing this. I can help you testing this next week with a larger database if you share with me some of the code. This will make sure that the problem does not occur in long running PDMLs.

@ava12
Copy link
Contributor

ava12 commented Aug 6, 2020

Sample code is like this:

<?php

require_once __DIR__ . '/vendor/autoload.php';
use Google\Cloud\Spanner\SpannerClient;

const INSTANCE_NAME = 'my-instance';
const DATABASE_NAME = 'my-database';
const QUERY = "UPDATE my_table SET my_col = 'updated' WHERE true";
$config = [
//    'keyFilePath' => '/path/to/keyfile.json', // default: GOOGLE_APPLICATION_CREDENTIALS env var
//    'projectId' => 'my-project', // default: obtain from key file
];

$client = new SpannerClient($config);
$database = $client->instance(INSTANCE_NAME)->database(DATABASE_NAME);
echo 'updating' . PHP_EOL;
$rowCount = $db->executePartitionedUpdate(QUERY);
echo "done, $rowCount rows updated" . PHP_EOL;

@thiagotnunes
Copy link
Author

@ava12 thanks for the sample code!

I am not being able to successfully run the operation, I keep getting the following error after ~10 minutes of the PDML running:

Fatal error: Uncaught Google\Cloud\Core\Exception\ServiceException: {
    "message": "The operation was cancelled.",
    "code": 1,
    "status": "CANCELLED",
    "details": [
        {
            "@type": "grpc-server-stats-bin",
            "data": "<Unknown Binary Data>"
        }
    ]
} in /home/spanner/vendor/google/cloud-core/src/GrpcRequestWrapper.php:257

Could you try and see if you get the same error for a long running PDML?

@ava12
Copy link
Contributor

ava12 commented Aug 11, 2020

I cannot get this error because I cannot initiate a request that is long enough.
I guess connection is broken by client, but don't know where and why.
Could you please run

composer show -i | grep grpc

in the sample code directory and show its output?

@thiagotnunes
Copy link
Author

The output of the command is the following:

google/grpc-gcp         0.1.5   gRPC GCP library for channel management
grpc/grpc               1.30.0  gRPC library for PHP

We will try to investigate the issue. It seems this is happening in the python library as well.

@thiagotnunes
Copy link
Author

@ava12 sorry, we will probably not have the resources to investigate this further, could you continue with the investigation?
In order to run a long running PDML you can use the "thiagotnunes-test-instance" and the "thiagotnunes-pdml-testing" (assuming you have credentials to the staging project we use).

The PDML that takes a long time to execute is the following:

<?php
require './vendor/autoload.php';

use Google\Cloud\Spanner\SpannerClient;

$spanner = new SpannerClient(array(
  'projectId' => '' // staging project
));
$instance = $spanner->instance('thagotnunes-test-instance');
$database = $instance->database('thiagotnunes-pdml-testing');

printf("Running PDML started at " . date("Y-m-d H:i:s") . "\n");
$rowCount = $database->executePartitionedUpdate(
  "UPDATE ChildTable c SET c.name=CONCAT('315',c.name) WHERE c.ChildId < 1000000 AND NOT c.name LIKE '315%'"
);
printf("Running PDML finished at " . date("Y-m-d H:i:s") . "\n");

?>

@ava12
Copy link
Contributor

ava12 commented Aug 14, 2020

@thiagotnunes, I don't know the name of staging project and I don't think I have credentials to access it.

Your sample query is quite simple and updates only one million rows. I've tried similar query on dev project and my table with one million rows, and it took less than 30 seconds to complete. How come your query takes more than 10 minutes? Complex schema? Huge table rows?

@thiagotnunes
Copy link
Author

@ava12 oh I thought you had access to it. That would be the app-soda staging project. Let me talk to my peers to see how to get you access to it.

In terms of the time to update, the schema has a parent / child relationship with interleaving. There are about 500 parents with 1M children, so 500M updates.

@ava12
Copy link
Contributor

ava12 commented Aug 15, 2020

Ah, the project is appdev-soda-spanner-staging. I can access it and I see thiagotnunes-test-instance in instance list.

@ava12
Copy link
Contributor

ava12 commented Aug 19, 2020

Cannot reproduce. I never see neither "Socket closed", nor "The operation was cancelled". I always get "Timeout Exceeded". Even when I set timeout to 3 hours, I get "Timeout Exceeded" in 3 hours, no other errors, no retries.

@thiagotnunes
Copy link
Author

@ava12 thanks for trying it out. I will see if this still happens on my end and get back to you.

@thiagotnunes
Copy link
Author

@ava12 you are correct, this problem does not seem to occur any longer. I was running my tests in a different environment, which might had been the culprit of the problem.

Thanks for verifying this. I will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants