Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experienced '504 Deadline Exceeded' error when tried to delete large number rows by partitioned dml #199

Closed
FengnaLiu opened this issue May 13, 2020 · 17 comments · Fixed by #203 or #232
Closed
Assignees
Labels
api: spanner priority: p1 type: bug

Comments

@FengnaLiu
Copy link

@FengnaLiu FengnaLiu commented May 13, 2020

There are 1 billion old records need to be cleaned. According the documents, It seems Partitioned DML should be the good choice. I choose java client because we can set timeout by using setPartitionedDmlTimeout(Duration.ofHours(24L)). However, I still experienced the '504 Deadline Exceeded' after a few hours that shorter than 24 hours I set. Why setPartitionedDmlTimeout did not work? How should I solve the problem? Is it not possible to delete so large number of data by using Partitioned DML.

Following are the details.

Environment details

  1. Spanner
    3 nodes, 2 billion records in a table and 1 billion old records need to be cleaned
  2. OS type and version: MacOs version 10.15.3
  3. Java version: java 8
  4. spanner version(s): the latest version

Steps to reproduce

  1. Write 2 billion records with timestamp in to the spanner table with dataflow
  2. Run a clean job to delete the old 1 billion records with Partitioned DML by java client
  3. Set the timeout to be 24 hours by using setPartitionedDmlTimeout(Duration.ofHours(24L))
  4. Run the java code

Code example

SpannerOptions options = SpannerOptions.newBuilder().setPartitionedDmlTimeout(Duration.ofHours(24L)).build();

        Spanner spanner = options.getService();
        try {

            DatabaseId db = DatabaseId.of(options.getProjectId(), args[0], args[1]);
            // [END init_client]
            //This will return the default project id based on the environment.
            String clientProject = spanner.getOptions().getProjectId();
            if (!db.getInstanceId().getProject().equals(clientProject)) {
                System.err.println(
                        "Invalid project specified. Project in the database id should match the"
                                + "project name set in the environment variable GOOGLE_CLOUD_PROJECT. Expected: "
                                + clientProject);
            }

            // [START init_client]
            DatabaseClient dbClient = spanner.getDatabaseClient(db);

            SimpleDateFormat formatter = new SimpleDateFormat("dd/MM/yyyy hh:mm:ss.SSS");
            String sql="DELETE FROM MyTable WHERE updateTimestamp<'2020-05-02T03:00:00Z'";
            Date currentDate = new Date();
            System.out.println(String.format("Delete Start:%s",formatter.format(currentDate)));

            long lStartTime = Instant.now().toEpochMilli();
            deleteUsingPartitionedDml(dbClient,sql);
            long lEndTime = Instant.now().toEpochMilli();
            long elaspedTime = lEndTime - lStartTime;
            System.out.println("Elapsed time in : " + elaspedTime/1000);

        } finally {
            spanner.close();
        }
        // [END init_client]
        System.out.println("Closed client");
    }
    private static void deleteUsingPartitionedDml(DatabaseClient dbClient,String sql) {
        long rowCount = dbClient.executePartitionedUpdate(Statement.of(sql));

        System.out.printf("%d records deleted.\n", rowCount);
    }
@product-auto-label product-auto-label bot added the api: spanner label May 13, 2020
@olavloite olavloite self-assigned this May 13, 2020
@olavloite
Copy link
Contributor

@olavloite olavloite commented May 13, 2020

@FengnaLiu Thanks for the detailed report. I'll look into it.

Could you also share the stacktrace of the exception that you get? The reason I'm asking is that I noticed that you wrote that you get a 504 Deadline Exceeded error, which indicates that it is an HTTP error and not a gRPC error. That could indicate that the timeout is occurring on the server and not in the client. The client would normally throw a DEADLINE_EXCEEDED exception with gRPC status code 4.

@olavloite
Copy link
Contributor

@olavloite olavloite commented May 13, 2020

@FengnaLiu I have a couple of additional questions in addition to the one above.

  1. Could it be that the Partitioned DML statement is at least partly executed? In other words: If you execute a SELECT COUNT(*) FROM MyTable before and after running the above sample, does it return the same value?
  2. Does your network infrastructure include a HTTP proxy?

olavloite added a commit that referenced this issue May 13, 2020
Setting a timeout value for Partitioned DML would not be respected if the timeout
value was higher than the timeout value set for the ExecuteSql RPC on the SpannerStub.
Lower timeout values would be respected.

Fixes #199
@yoshi-automation yoshi-automation added the triage me label May 13, 2020
@skuruppu skuruppu added type: bug and removed triage me labels May 14, 2020
@yoshi-automation yoshi-automation added the triage me label May 14, 2020
olavloite added a commit that referenced this issue May 14, 2020
Setting a timeout value for Partitioned DML would not be respected if the timeout
value was higher than the timeout value set for the ExecuteSql RPC on the SpannerStub.
Lower timeout values would be respected.

Fixes #199
olavloite added a commit that referenced this issue May 14, 2020
* fix: Partitioned DML timeout was not always respected

Setting a timeout value for Partitioned DML would not be respected if the timeout
value was higher than the timeout value set for the ExecuteSql RPC on the SpannerStub.
Lower timeout values would be respected.

Fixes #199

* fix: add ignored changes + InternalApi

* tests: add test for retry on UNAVAILABLE
@FengnaLiu
Copy link
Author

@FengnaLiu FengnaLiu commented May 14, 2020

@olavloite I tried to run the delete job again. I got the 'DEADLINE_EXCEEDED' instead of '504 DEADLINE_EXCEEDED' about four hours later after started the delete job.

Following are the details.

  1. Half of the data was deleted.

  2. The stacktrace of the exception i got
    com.google.cloud.spanner.SpannerException: DEADLINE_EXCEEDED: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 3599999079514ns. [buffered_nanos=186035342, remote_addr=spanner.googleapis.com/172.217.175.106:443] at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted (SpannerExceptionFactory.java:182) at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException (SpannerExceptionFactory.java:45) at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException (SpannerExceptionFactory.java:139) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.get (GapicSpannerRpc.java:691) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executePartitionedDml (GapicSpannerRpc.java:543) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:87) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:69) at com.google.api.gax.retrying.DirectRetryingExecutor.submit (DirectRetryingExecutor.java:105) at com.google.cloud.RetryHelper.run (RetryHelper.java:76) at com.google.cloud.RetryHelper.runWithRetries (RetryHelper.java:50) at com.google.cloud.spanner.SpannerRetryHelper.runTxWithRetriesOnAborted (SpannerRetryHelper.java:47) at com.google.cloud.spanner.PartitionedDMLTransaction.executePartitionedUpdate (PartitionedDMLTransaction.java:91) at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate (SessionImpl.java:103) at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate (SessionPool.java:647) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:199) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:196) at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry (DatabaseClientImpl.java:213) at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate (DatabaseClientImpl.java:194) at jp.co.sej.central.spanner.DeleteOldStoreRecords.deleteUsingPartitionedDml (DeleteOldStoreRecords.java:93) at jp.co.sej.central.spanner.DeleteOldStoreRecords.main (DeleteOldStoreRecords.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:748) Caused by: java.util.concurrent.ExecutionException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 3599999079514ns. [buffered_nanos=186035342, remote_addr=spanner.googleapis.com/172.217.175.106:443] at com.google.common.util.concurrent.AbstractFuture.getDoneValue (AbstractFuture.java:552) at com.google.common.util.concurrent.AbstractFuture.get (AbstractFuture.java:533) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.get (GapicSpannerRpc.java:685) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executePartitionedDml (GapicSpannerRpc.java:543) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:87) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:69) at com.google.api.gax.retrying.DirectRetryingExecutor.submit (DirectRetryingExecutor.java:105) at com.google.cloud.RetryHelper.run (RetryHelper.java:76) at com.google.cloud.RetryHelper.runWithRetries (RetryHelper.java:50) at com.google.cloud.spanner.SpannerRetryHelper.runTxWithRetriesOnAborted (SpannerRetryHelper.java:47) at com.google.cloud.spanner.PartitionedDMLTransaction.executePartitionedUpdate (PartitionedDMLTransaction.java:91) at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate (SessionImpl.java:103) at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate (SessionPool.java:647) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:199) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:196) at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry (DatabaseClientImpl.java:213) at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate (DatabaseClientImpl.java:194) at jp.co.sej.central.spanner.DeleteOldStoreRecords.deleteUsingPartitionedDml (DeleteOldStoreRecords.java:93) at jp.co.sej.central.spanner.DeleteOldStoreRecords.main (DeleteOldStoreRecords.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:748) Caused by: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 3599999079514ns. [buffered_nanos=186035342, remote_addr=spanner.googleapis.com/172.217.175.106:443] at com.google.api.gax.rpc.ApiExceptionFactory.createException (ApiExceptionFactory.java:51) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create (GrpcApiExceptionFactory.java:72) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create (GrpcApiExceptionFactory.java:60) at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure (GrpcExceptionCallable.java:97) at com.google.api.core.ApiFutures$1.onFailure (ApiFutures.java:68) at com.google.common.util.concurrent.Futures$CallbackListener.run (Futures.java:982) at com.google.common.util.concurrent.DirectExecutor.execute (DirectExecutor.java:30) at com.google.common.util.concurrent.AbstractFuture.executeListener (AbstractFuture.java:1138) at com.google.common.util.concurrent.AbstractFuture.complete (AbstractFuture.java:957) at com.google.common.util.concurrent.AbstractFuture.setException (AbstractFuture.java:748) at io.grpc.stub.ClientCalls$GrpcFuture.setException (ClientCalls.java:520) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose (ClientCalls.java:495) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor$1$1.onClose (SpannerErrorInterceptor.java:100) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose (CensusStatsModule.java:700) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose (CensusTracingModule.java:399) at io.grpc.internal.ClientCallImpl.closeObserver (ClientCallImpl.java:510) at io.grpc.internal.ClientCallImpl.access$300 (ClientCallImpl.java:66) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close (ClientCallImpl.java:630) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700 (ClientCallImpl.java:518) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal (ClientCallImpl.java:692) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext (ClientCallImpl.java:681) at io.grpc.internal.ContextRunnable.run (ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run (SerializingExecutor.java:123) at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624) at java.lang.Thread.run (Thread.java:748) Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 3599999079514ns. [buffered_nanos=186035342, remote_addr=spanner.googleapis.com/172.217.175.106:443] at io.grpc.Status.asRuntimeException (Status.java:533) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose (ClientCalls.java:495) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor$1$1.onClose (SpannerErrorInterceptor.java:100) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose (CensusStatsModule.java:700) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose (CensusTracingModule.java:399) at io.grpc.internal.ClientCallImpl.closeObserver (ClientCallImpl.java:510) at io.grpc.internal.ClientCallImpl.access$300 (ClientCallImpl.java:66) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close (ClientCallImpl.java:630) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700 (ClientCallImpl.java:518) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal (ClientCallImpl.java:692) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext (ClientCallImpl.java:681) at io.grpc.internal.ContextRunnable.run (ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run (SerializingExecutor.java:123) at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624) at java.lang.Thread.run (Thread.java:748)

  3. My network infrastructure does not include a HTTP proxy

@olavloite
Copy link
Contributor

@olavloite olavloite commented May 14, 2020

@FengnaLiu Thanks for the additional information. I have been able to reproduce this, and a fix has been merged. You could try it with that by checking out the latest version of the master branch of this repository, or wait for the next release.

@iamacleish
Copy link

@iamacleish iamacleish commented May 14, 2020

@olavloite Any idea when the next release will be?

@olavloite olavloite added priority: p1 and removed triage me labels May 15, 2020
gcf-merge-on-green bot pushed a commit that referenced this issue May 19, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.55.0](https://www.github.com/googleapis/java-spanner/compare/v1.54.0...v1.55.0) (2020-05-19)


### Features

* mark when a Spanner client is closed ([#198](https://www.github.com/googleapis/java-spanner/issues/198)) ([50cb174](https://www.github.com/googleapis/java-spanner/commit/50cb1744e7ede611758d3ff63b3df77a1d3682eb))


### Bug Fixes

* make it possible to override backups methods ([#195](https://www.github.com/googleapis/java-spanner/issues/195)) ([2d19c25](https://www.github.com/googleapis/java-spanner/commit/2d19c25ba32847d116194565e67e1b1276fcb9f8))
* Partitioned DML timeout was not always respected ([#203](https://www.github.com/googleapis/java-spanner/issues/203)) ([13cb37e](https://www.github.com/googleapis/java-spanner/commit/13cb37e55ddfd1ff4ec22b1dcdc20c4832eee444)), closes [#199](https://www.github.com/googleapis/java-spanner/issues/199)
* partitionedDml stub was not closed ([#213](https://www.github.com/googleapis/java-spanner/issues/213)) ([a2d9a33](https://www.github.com/googleapis/java-spanner/commit/a2d9a33fa31f7467fc2bfbef5a29c4b3f5aea7c8))
* reuse clientId for invalidated databases ([#206](https://www.github.com/googleapis/java-spanner/issues/206)) ([7b4490d](https://www.github.com/googleapis/java-spanner/commit/7b4490dfb61fbc81b5bd6be6c9a663b36b5ce402))
* use nanos to prevent truncation errors ([#204](https://www.github.com/googleapis/java-spanner/issues/204)) ([a608460](https://www.github.com/googleapis/java-spanner/commit/a60846043dc0ca47e1970d8ab99380b6d725c7a9)), closes [#200](https://www.github.com/googleapis/java-spanner/issues/200)


### Dependencies

* update dependency com.google.cloud:google-cloud-shared-dependencies to v0.3.1 ([#190](https://www.github.com/googleapis/java-spanner/issues/190)) ([ad41a0d](https://www.github.com/googleapis/java-spanner/commit/ad41a0d4b0cc6a2c0ae0611c767652f64cfb2fb7))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
@olavloite
Copy link
Contributor

@olavloite olavloite commented May 19, 2020

@iamacleish A new release has now been published.

@FengnaLiu
Copy link
Author

@FengnaLiu FengnaLiu commented May 20, 2020

@olavloite Thanks for your information.
I tried the new version. But I encountered the new error [com.google.cloud.spanner.SpannerException: INTERNAL: ] after 1 hour running.

Details information as following.

  1. The stacktrace of the exception i got
    com.google.cloud.spanner.SpannerException: INTERNAL: com.google.api.gax.rpc.InternalException: io.grpc.StatusRuntimeException: INTERNAL: Internal error encountered. at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted (SpannerExceptionFactory.java:210) at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException (SpannerExceptionFactory.java:56) at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException (SpannerExceptionFactory.java:150) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.get (GapicSpannerRpc.java:1214) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executePartitionedDml (GapicSpannerRpc.java:1060) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:86) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:68) at com.google.api.gax.retrying.DirectRetryingExecutor.submit (DirectRetryingExecutor.java:105) at com.google.cloud.RetryHelper.run (RetryHelper.java:76) at com.google.cloud.RetryHelper.runWithRetries (RetryHelper.java:50) at com.google.cloud.spanner.SpannerRetryHelper.runTxWithRetriesOnAborted (SpannerRetryHelper.java:57) at com.google.cloud.spanner.PartitionedDMLTransaction.executePartitionedUpdate (PartitionedDMLTransaction.java:90) at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate (SessionImpl.java:108) at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate (SessionPool.java:688) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:205) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:202) at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry (DatabaseClientImpl.java:219) at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate (DatabaseClientImpl.java:200) at jp.co.sej.central.spanner.DeleteOldStoreRecords.deleteUsingPartitionedDml (DeleteOldStoreRecords.java:56) at jp.co.sej.central.spanner.DeleteOldStoreRecords.main (DeleteOldStoreRecords.java:44) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:748) Caused by: java.util.concurrent.ExecutionException: com.google.api.gax.rpc.InternalException: io.grpc.StatusRuntimeException: INTERNAL: Internal error encountered. at com.google.common.util.concurrent.AbstractFuture.getDoneValue (AbstractFuture.java:564) at com.google.common.util.concurrent.AbstractFuture.get (AbstractFuture.java:545) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.get (GapicSpannerRpc.java:1208) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executePartitionedDml (GapicSpannerRpc.java:1060) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:86) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:68) at com.google.api.gax.retrying.DirectRetryingExecutor.submit (DirectRetryingExecutor.java:105) at com.google.cloud.RetryHelper.run (RetryHelper.java:76) at com.google.cloud.RetryHelper.runWithRetries (RetryHelper.java:50) at com.google.cloud.spanner.SpannerRetryHelper.runTxWithRetriesOnAborted (SpannerRetryHelper.java:57) at com.google.cloud.spanner.PartitionedDMLTransaction.executePartitionedUpdate (PartitionedDMLTransaction.java:90) at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate (SessionImpl.java:108) at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate (SessionPool.java:688) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:205) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:202) at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry (DatabaseClientImpl.java:219) at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate (DatabaseClientImpl.java:200) at jp.co.sej.central.spanner.DeleteOldStoreRecords.deleteUsingPartitionedDml (DeleteOldStoreRecords.java:56) at jp.co.sej.central.spanner.DeleteOldStoreRecords.main (DeleteOldStoreRecords.java:44) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:748) Caused by: com.google.api.gax.rpc.InternalException: io.grpc.StatusRuntimeException: INTERNAL: Internal error encountered. at com.google.api.gax.rpc.ApiExceptionFactory.createException (ApiExceptionFactory.java:67) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create (GrpcApiExceptionFactory.java:72) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create (GrpcApiExceptionFactory.java:60) at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure (GrpcExceptionCallable.java:97) at com.google.api.core.ApiFutures$1.onFailure (ApiFutures.java:68) at com.google.common.util.concurrent.Futures$CallbackListener.run (Futures.java:1050) at com.google.common.util.concurrent.DirectExecutor.execute (DirectExecutor.java:30) at com.google.common.util.concurrent.AbstractFuture.executeListener (AbstractFuture.java:1176) at com.google.common.util.concurrent.AbstractFuture.complete (AbstractFuture.java:969) at com.google.common.util.concurrent.AbstractFuture.setException (AbstractFuture.java:760) at io.grpc.stub.ClientCalls$GrpcFuture.setException (ClientCalls.java:526) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose (ClientCalls.java:501) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor$1$1.onClose (SpannerErrorInterceptor.java:100) at io.grpc.internal.ClientCallImpl.closeObserver (ClientCallImpl.java:426) at io.grpc.internal.ClientCallImpl.access$500 (ClientCallImpl.java:66) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close (ClientCallImpl.java:689) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900 (ClientCallImpl.java:577) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal (ClientCallImpl.java:751) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext (ClientCallImpl.java:740) at io.grpc.internal.ContextRunnable.run (ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run (SerializingExecutor.java:123) at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624) at java.lang.Thread.run (Thread.java:748) Caused by: io.grpc.StatusRuntimeException: INTERNAL: Internal error encountered. at io.grpc.Status.asRuntimeException (Status.java:533) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose (ClientCalls.java:501) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor$1$1.onClose (SpannerErrorInterceptor.java:100) at io.grpc.internal.ClientCallImpl.closeObserver (ClientCallImpl.java:426) at io.grpc.internal.ClientCallImpl.access$500 (ClientCallImpl.java:66) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close (ClientCallImpl.java:689) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900 (ClientCallImpl.java:577) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal (ClientCallImpl.java:751) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext (ClientCallImpl.java:740) at io.grpc.internal.ContextRunnable.run (ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run (SerializingExecutor.java:123) at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624) at java.lang.Thread.run (Thread.java:748)

  2. pom.xml

`



com.google.cloud
libraries-bom
5.4.0
pom
import


UTF-8 UTF-8
<!--testing dependencies-->
<dependency>
  <groupId>junit</groupId>
  <artifactId>junit</artifactId>
  <version>4.13</version>
  <scope>test</scope>
</dependency>

<dependency>
  <groupId>org.hamcrest</groupId>
  <artifactId>hamcrest-library</artifactId>
  <version>2.2</version>
  <scope>test</scope>
</dependency>

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.11.0</version>
</dependency>

<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-spanner</artifactId>
  <version>1.55.0</version>
</dependency>

`

@olavloite
Copy link
Contributor

@olavloite olavloite commented May 20, 2020

@FengnaLiu That's surprising. I'll look into it right away. Did you set any specific timeout value?

@FengnaLiu
Copy link
Author

@FengnaLiu FengnaLiu commented May 20, 2020

@olavloite I did set the timeout to be 24 hours.
SpannerOptions options = SpannerOptions.newBuilder().setPartitionedDmlTimeout(Duration.ofHours(24L)).build();

@FengnaLiu
Copy link
Author

@FengnaLiu FengnaLiu commented May 20, 2020

@olavloite I ran the job again and then the error seems different.

com.google.cloud.spanner.AbortedException: ABORTED: com.google.api.gax.rpc.AbortedException: io.grpc.StatusRuntimeException: ABORTED: Aborted due to transient fault at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted (SpannerExceptionFactory.java:196) at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException (SpannerExceptionFactory.java:56) at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException (SpannerExceptionFactory.java:150) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.get (GapicSpannerRpc.java:1214) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executePartitionedDml (GapicSpannerRpc.java:1060) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:86) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:68) at com.google.api.gax.retrying.DirectRetryingExecutor.submit (DirectRetryingExecutor.java:105) at com.google.cloud.RetryHelper.run (RetryHelper.java:76) at com.google.cloud.RetryHelper.runWithRetries (RetryHelper.java:50) at com.google.cloud.spanner.SpannerRetryHelper.runTxWithRetriesOnAborted (SpannerRetryHelper.java:57) at com.google.cloud.spanner.PartitionedDMLTransaction.executePartitionedUpdate (PartitionedDMLTransaction.java:90) at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate (SessionImpl.java:108) at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate (SessionPool.java:688) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:205) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:202) at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry (DatabaseClientImpl.java:219) at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate (DatabaseClientImpl.java:200) at jp.co.sej.central.spanner.DeleteOldStoreRecords.deleteUsingPartitionedDml (DeleteOldStoreRecords.java:56) at jp.co.sej.central.spanner.DeleteOldStoreRecords.main (DeleteOldStoreRecords.java:44) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:748) Caused by: java.util.concurrent.ExecutionException: com.google.api.gax.rpc.AbortedException: io.grpc.StatusRuntimeException: ABORTED: Aborted due to transient fault at com.google.common.util.concurrent.AbstractFuture.getDoneValue (AbstractFuture.java:564) at com.google.common.util.concurrent.AbstractFuture.get (AbstractFuture.java:545) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.get (GapicSpannerRpc.java:1208) at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executePartitionedDml (GapicSpannerRpc.java:1060) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:86) at com.google.cloud.spanner.PartitionedDMLTransaction$1.call (PartitionedDMLTransaction.java:68) at com.google.api.gax.retrying.DirectRetryingExecutor.submit (DirectRetryingExecutor.java:105) at com.google.cloud.RetryHelper.run (RetryHelper.java:76) at com.google.cloud.RetryHelper.runWithRetries (RetryHelper.java:50) at com.google.cloud.spanner.SpannerRetryHelper.runTxWithRetriesOnAborted (SpannerRetryHelper.java:57) at com.google.cloud.spanner.PartitionedDMLTransaction.executePartitionedUpdate (PartitionedDMLTransaction.java:90) at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate (SessionImpl.java:108) at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate (SessionPool.java:688) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:205) at com.google.cloud.spanner.DatabaseClientImpl$3.apply (DatabaseClientImpl.java:202) at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry (DatabaseClientImpl.java:219) at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate (DatabaseClientImpl.java:200) at jp.co.sej.central.spanner.DeleteOldStoreRecords.deleteUsingPartitionedDml (DeleteOldStoreRecords.java:56) at jp.co.sej.central.spanner.DeleteOldStoreRecords.main (DeleteOldStoreRecords.java:44) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:748) Caused by: com.google.api.gax.rpc.AbortedException: io.grpc.StatusRuntimeException: ABORTED: Aborted due to transient fault at com.google.api.gax.rpc.ApiExceptionFactory.createException (ApiExceptionFactory.java:61) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create (GrpcApiExceptionFactory.java:72) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create (GrpcApiExceptionFactory.java:60) at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure (GrpcExceptionCallable.java:97) at com.google.api.core.ApiFutures$1.onFailure (ApiFutures.java:68) at com.google.common.util.concurrent.Futures$CallbackListener.run (Futures.java:1050) at com.google.common.util.concurrent.DirectExecutor.execute (DirectExecutor.java:30) at com.google.common.util.concurrent.AbstractFuture.executeListener (AbstractFuture.java:1176) at com.google.common.util.concurrent.AbstractFuture.complete (AbstractFuture.java:969) at com.google.common.util.concurrent.AbstractFuture.setException (AbstractFuture.java:760) at io.grpc.stub.ClientCalls$GrpcFuture.setException (ClientCalls.java:526) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose (ClientCalls.java:501) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor$1$1.onClose (SpannerErrorInterceptor.java:100) at io.grpc.internal.ClientCallImpl.closeObserver (ClientCallImpl.java:426) at io.grpc.internal.ClientCallImpl.access$500 (ClientCallImpl.java:66) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close (ClientCallImpl.java:689) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900 (ClientCallImpl.java:577) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal (ClientCallImpl.java:751) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext (ClientCallImpl.java:740) at io.grpc.internal.ContextRunnable.run (ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run (SerializingExecutor.java:123) at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624) at java.lang.Thread.run (Thread.java:748) Caused by: io.grpc.StatusRuntimeException: ABORTED: Aborted due to transient fault at io.grpc.Status.asRuntimeException (Status.java:533) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose (ClientCalls.java:501) at io.grpc.PartialForwardingClientCallListener.onClose (PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose (ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose (ForwardingClientCallListener.java:40) at com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor$1$1.onClose (SpannerErrorInterceptor.java:100) at io.grpc.internal.ClientCallImpl.closeObserver (ClientCallImpl.java:426) at io.grpc.internal.ClientCallImpl.access$500 (ClientCallImpl.java:66) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close (ClientCallImpl.java:689) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900 (ClientCallImpl.java:577) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal (ClientCallImpl.java:751) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext (ClientCallImpl.java:740) at io.grpc.internal.ContextRunnable.run (ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run (SerializingExecutor.java:123) at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624) at java.lang.Thread.run (Thread.java:748)

@olavloite
Copy link
Contributor

@olavloite olavloite commented May 20, 2020

@FengnaLiu Thanks for the additional information. That is indeed a different error. The Partitioned DML transaction has been aborted, and normally that should lead to a retry by the client, but it seems that that is not happening in this case. I'll try to figure out why that is.

olavloite added a commit that referenced this issue May 20, 2020
The PartitionedDML retry settings were only applied for the RPC, and not
for the generic retryer that would retry the PDML transaction if it was
aborted by Spanner. This could cause long-running PDML transactions to
fail with an Aborted exception.

Fixes #199
skuruppu pushed a commit that referenced this issue May 20, 2020
* fix: PDML retry settings were not applied for aborted tx

The PartitionedDML retry settings were only applied for the RPC, and not
for the generic retryer that would retry the PDML transaction if it was
aborted by Spanner. This could cause long-running PDML transactions to
fail with an Aborted exception.

Fixes #199

* fix: add ignored diff to clirr
@olavloite
Copy link
Contributor

@olavloite olavloite commented May 21, 2020

@FengnaLiu The last problem you ran into seems to be caused by the following scenario:

  1. The Partitioned DML transaction and statement is started.
  2. The transaction is aborted by Cloud Spanner.
  3. The client starts an automatic retry of the PDML transaction. The retry and timeout settings that were used by the automatic retryer for the retried transaction were the default retry settings, and not the specific PDML retry settings that have been introduced.

The last point above has now also been fixed in #232

@FengnaLiu
Copy link
Author

@FengnaLiu FengnaLiu commented May 22, 2020

@olavloite I really appreciate your quick response.
There are still something that I am not clear.

  1. Does it mean I have to wait for next release to apply the changes?
  2. And how about the first problem [com.google.cloud.spanner.SpannerException: INTERNAL: ]?
    Did it also have been fixed?

@FengnaLiu
Copy link
Author

@FengnaLiu FengnaLiu commented May 29, 2020

@olavloite I really sorry to bother you again. I tried version 1.55.1 and 1.55.0 for several delete patterns. And I found out that with version 1.55.1 it did experience no timeout error but the job kept running even the records have been all deleted. Additionally, it appears to deleting much far more slower than the older version.

Details as followings:

  1. delete 1 billion records.
    Version 1.55.0 - The job experiences timeout error after a few hours running , and timeout I set is 24 hours
    Version 1.55.1 - the job keeps running even after 24 hours( the timeout I set). Additionally, I checked the left records during the job and it appears to stop deleting records after 5-6 hours.

  2. delete 100million records.
    Version 1.55.0 and Version 1.46.0 - Job Succeeded after 30 minutes.
    Version 1.55.1 - After 2 hours the job is still running and I checked the left record number through query is 0. I manually stopped the job after 3 hours.

  3. delete 200 million records
    Version 1.55.1 - Job succeeded after 6 hours. I was not able to check when the records were all deleted because I ran the job during the night.
    Version 1.55.0 - Not checked yet. I will find out today.

@olavloite
Copy link
Contributor

@olavloite olavloite commented May 29, 2020

Hi @FengnaLiu

I really sorry to bother you again.
Don't worry about that. This is valuable information for improving the client library.

Thanks for the detailed information and examples. I don't think the new version is literally deleting the records slower, but it might be that it is slower and less reliable in detecting when it has finished.

The big challenge here is that the Partitioned DML RPC is defined as an RPC that returns its response when the operation has finished. That works well for operations that normally takes seconds or maybe a couple of minutes, but not so well if the operation can take hours as any network hick-up during this time might cause unexpected errors. That is why for example RPCs to update the DDL of a database respond with a long-running operation that can be polled for its progress, instead of waiting to respond until the operation has finished on the backend.

I'll start a discussion about this with a couple of others to see if we can come up with a better solution for this. One solution might be to create a kind of long-running operation in the client library that handles all this for you. In the meantime, I'm afraid that your best option is to retry the delete operation with reasonable chunks of the table as you are already doing.

I'll get back to you on this as soon as I can.

@iamacleish
Copy link

@iamacleish iamacleish commented Jun 2, 2020

Hey @olavloite !
Any updates on this?

@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Jun 4, 2020

@olavloite and I spoke yesterday about whether we can update the client library to support this case. I have a couple of issues with this:

  1. Does it make sense for us to retry these transactions on the client library? There may be some queries where this doesn't make sense?
  2. Under what errors should we retry?

I'm a bit concerned about making the client library logic more complicated than it already is. It also means that we'll have an inconsistency between how Java handles this vs client libraries for other languages.

Ideally the backend would return a LRO here instead of a standard response which would make it easier for us to consistently handle this type of issue across all the client libraries. But I don't see a change like this being made anytime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner priority: p1 type: bug
Projects
None yet
5 participants