Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanner: TransactionContext hangs thread indefinitely #799

Closed
msk-midas opened this issue Jan 14, 2021 · 6 comments · Fixed by #800 or #807
Closed

Spanner: TransactionContext hangs thread indefinitely #799

msk-midas opened this issue Jan 14, 2021 · 6 comments · Fixed by #800 or #807
Assignees
Labels
api: spanner priority: p1 type: bug

Comments

@msk-midas
Copy link

@msk-midas msk-midas commented Jan 14, 2021

This has been communicated with google cloud support, case 26471342, and they have requested I open an issue here

Environment details

  1. API: Spanner
  2. OS type and version: Debian 8
  3. Java version: 1.8.0_162
  4. google-cloud-java version(s): 3.3.0 (also reproduced on 3.0.5)

Steps to reproduce

We have a loop running in one of our coorporate test VMs that updates a row for a "last checkin made by machine" every ten seconds. Let it run and randomly encounter non-recoverable freezes. (the longest one we let run was over the holidays, 20 days frozen thread). Reproduced on two separate machines multiple times, average executions before encountering freeze is around 3000 (sometimes 800, sometimes 6000). On one third machine we have not encountered it at all (14000 successful attempts and counting). The data written is minimal, essentially just name and timestamp. Normal communication time is a question of milliseconds, so ten seconds is ample time between calls.

Code example

rough pseudo (this is part of a very large framework, but this is the item it happens on - regular scheduled executor service at 10 second ticks to run, fixed rate not fixed delay)

TransactionRunner transactionRunner = databaseClient.readWriteTransaction();
transactionRunner.run(transaction -> {
                transaction.readRow(tableName, Key.of(getHashForId(id), getTableId(), id, sid), Collections.singleton("json"));
                }

Thread dump

"com.king.spanner.Availability_monitor-1" googleapis/google-cloud-java#91 prio=5 os_prio=0 tid=0x00007f3a00d42800 nid=0x57d4 waiting on condition [0x00007f39984a1000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000f272eff0> (a com.google.api.core.AbstractApiFuture$InternalSettableFuture)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:537)
	at com.google.api.core.AbstractApiFuture.get(AbstractApiFuture.java:56)
	at com.google.cloud.spanner.TransactionRunnerImpl$TransactionContextImpl.getTransactionSelector(TransactionRunnerImpl.java:437)
	at com.google.cloud.spanner.AbstractReadContext$2.startStream(AbstractReadContext.java:739)
	at com.google.cloud.spanner.AbstractResultSet$ResumableStreamIterator.computeNext(AbstractResultSet.java:1020)
	at com.google.cloud.spanner.AbstractResultSet$ResumableStreamIterator.computeNext(AbstractResultSet.java:898)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
	at com.google.cloud.spanner.AbstractResultSet$GrpcValueIterator.ensureReady(AbstractResultSet.java:260)
	at com.google.cloud.spanner.AbstractResultSet$GrpcValueIterator.getMetadata(AbstractResultSet.java:236)
	at com.google.cloud.spanner.AbstractResultSet$GrpcResultSet.next(AbstractResultSet.java:118)
	at com.google.cloud.spanner.AbstractReadContext.consumeSingleRow(AbstractReadContext.java:755)
	at com.google.cloud.spanner.AbstractReadContext.readRow(AbstractReadContext.java:457)

Any additional information below

We have repeatedly reproduced this over the regular public APIs, but we have the impression that it happens more often (for reproduction) if you add

Builder builder = SpannerOptions.newBuilder();
builder.setHost("https://restricted.googleapis.com");
builder.setInterceptorProvider(() -> Collections.singletonList(new ClientInterceptor() {
                public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
                    return next.newCall(method, callOptions.withAuthority("spanner.googleapis.com"));
                }
            }));
@skuruppu skuruppu transferred this issue from googleapis/google-cloud-java Jan 14, 2021
@product-auto-label product-auto-label bot added the api: spanner label Jan 14, 2021
@skuruppu skuruppu added type: bug priority: p1 labels Jan 14, 2021
@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Jan 14, 2021

@msk-midas I transferred the issue to the Spanner-specific repo as we no longer use the other repo.

@olavloite if you could please take a look at this issue, that would be great.

FYI @thiagotnunes

@olavloite
Copy link
Contributor

@olavloite olavloite commented Jan 14, 2021

@msk-midas Thank you so much for the very detailed report.

Your pseudo example indicates that

transaction.readRow(tableName, Key.of(getHashForId(id), getTableId(), id, sid), Collections.singleton("json"));

is the statement that gets stuck. I have a couple of additional questions regarding this:

  1. Is this the first statement in the transaction? Or is there some other statement in the transaction that precedes it?
  2. If the given example is not the first statement in the transaction, what type of statement is the first statement (read / query / update)?
  3. Is there any try-catch block inside the transaction around the first statement of the transaction? Or will any error that is returned by that statement propagate?
  4. Do you apply any kind of parallelism inside the transaction? E.g. do you execute two queries in parallel, or do you pass any query results to a separate worker thread to let that thread process the results, or anything similar?

(Note: My questions above should not in any way be interpreted as indication that any of the above is not supported, I'm just trying to figure out a way to reproduce it)

@msk-midas
Copy link
Author

@msk-midas msk-midas commented Jan 15, 2021

I see that it's closed and you may well have found the solution! Just to answer your questions:
Short answer: "actual transaction interactions are as simple as the pseudo"
Longer: After transaction runner is called and we get a transaction callback, the first interaction with said transaction is the read where it hangs in the pseudo example. There is nothing else going on with the transaction at that point and no other thread picking it up on our end. There's no catch block inside of the transaction execution. I believe this is the only spanner interaction we run at all on these machines at the moment, meaning that it's only a single thread doing a single call every ten seconds, on the entire machine, so there should only be a single transactioncontext alive at a time, at all.

@olavloite
Copy link
Contributor

@olavloite olavloite commented Jan 15, 2021

@msk-midas Thanks for your response, that is actually very interesting information. I would have expected that the statement that gets stuck would not be the first statement in the transaction. I will do some additional investigations based on that.

@olavloite olavloite reopened this Jan 15, 2021
olavloite added a commit that referenced this issue Jan 16, 2021
…stuck

If the first query or read operation of a read/write transaction would return UNAVAILABLE for
the first element of the result stream, the transaction could get stuck. This was caused by the
internal retry mechanism that would wait for the initial attempt to return a transaction, which
was never returned as the UNAVAILABLE exception was internally handled by the result stream
iterator.

Fixes #799
thiagotnunes pushed a commit that referenced this issue Jan 17, 2021
…stuck (#807)

If the first query or read operation of a read/write transaction would return UNAVAILABLE for
the first element of the result stream, the transaction could get stuck. This was caused by the
internal retry mechanism that would wait for the initial attempt to return a transaction, which
was never returned as the UNAVAILABLE exception was internally handled by the result stream
iterator.

Fixes #799
@msk-midas
Copy link
Author

@msk-midas msk-midas commented Jan 19, 2021

I updated to 3.3.2 about 20 hours ago, and the error has not recurred. (due to its random nature it's not guaranteed fixed, but previously I'd encountered it at least a couple of times in the same timespan) I believe the UNAVAILABLE fix has solved our problems.

@olavloite
Copy link
Contributor

@olavloite olavloite commented Jan 19, 2021

@msk-midas Thank you for the update. It is very much appreciated.

mr-salty added a commit to mr-salty/google-cloud-cpp that referenced this issue Jan 20, 2021
If the intial read from a streaming operation fails when also
trying to implicitly start a transaction, ensure the library properly
recovers and completes the operation (using an explicit
`BeginTransaction` call).

This was definitely a gap in our test coverage; @thiagotnunes brought a
Java customer issue (googleapis/java-spanner#799)
to my attention, so I wanted to ensure C++ users were not susceptible.
mr-salty added a commit to googleapis/google-cloud-cpp that referenced this issue Jan 21, 2021
…sts (#5718)

If the initial read from a streaming operation fails when also trying to implicitly start a transaction, ensure the library properly recovers and completes the operation.

Test both permanent and transient failures. They behave slightly differently, a permanent failure will cause an explicit `BeginTransaction`, whereas a transient one causes the RPC to be retried with the `begin` selector still set.

This was definitely a gap in our test coverage; @thiagotnunes brought a Java customer issue (googleapis/java-spanner#799) to my attention, so I wanted to ensure C++ users were not susceptible.
thiagotnunes pushed a commit that referenced this issue Jan 22, 2021
…stuck (#807)

If the first query or read operation of a read/write transaction would return UNAVAILABLE for
the first element of the result stream, the transaction could get stuck. This was caused by the
internal retry mechanism that would wait for the initial attempt to return a transaction, which
was never returned as the UNAVAILABLE exception was internally handled by the result stream
iterator.

Fixes #799
thiagotnunes pushed a commit to thiagotnunes/java-spanner that referenced this issue Feb 8, 2021
…stuck (googleapis#807)

If the first query or read operation of a read/write transaction would return UNAVAILABLE for
the first element of the result stream, the transaction could get stuck. This was caused by the
internal retry mechanism that would wait for the initial attempt to return a transaction, which
was never returned as the UNAVAILABLE exception was internally handled by the result stream
iterator.

Fixes googleapis#799
thiagotnunes added a commit that referenced this issue Feb 9, 2021
…to get stuck (#856)

* fix: UNAVAILABLE error on first query could cause transaction to get stuck (#807)

If the first query or read operation of a read/write transaction would return UNAVAILABLE for
the first element of the result stream, the transaction could get stuck. This was caused by the
internal retry mechanism that would wait for the initial attempt to return a transaction, which
was never returned as the UNAVAILABLE exception was internally handled by the result stream
iterator.

Fixes #799

* chore: re-formats source files

To fix lint errors

* fix: removes unrelated changes

Co-authored-by: Knut Olav Løite <koloite@gmail.com>
thiagotnunes pushed a commit that referenced this issue May 6, 2021
…stuck (#807)

If the first query or read operation of a read/write transaction would return UNAVAILABLE for
the first element of the result stream, the transaction could get stuck. This was caused by the
internal retry mechanism that would wait for the initial attempt to return a transaction, which
was never returned as the UNAVAILABLE exception was internally handled by the result stream
iterator.

Fixes #799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner priority: p1 type: bug
Projects
None yet
3 participants