feat: support partitioned queries + data boost in Connection API #2540

olavloite · 2023-07-24T12:11:11Z

Adds support for Partitioned Queries and Data Boost in the Connection API. This enables the use of these features in the JDBC driver and PGAdapter.

This PR builds on #2556 which is a small refactoring that allows client-side statements to use statement parameters.

Adds support for Partitioned Queries and Data Boost in the Connection API. This enables the use of these features in the JDBC driver and PGAdapter.

…atement Refactor the internal interface of client-side statements so these receive the entire parsed statement, including any query parameters in the statement. This allows us to create client-side statements that actually use the query parameters that have been specified by the user.

…onnection-api

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

arpan14 · 2023-08-03T09:17:04Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/Statement.java

@@ -86,6 +86,12 @@ private Builder(Statement statement) {
          statement.queryOptions == null ? null : statement.queryOptions.toBuilder().build();
    }

+    /** Replaces the current SQL of this builder with the given string. */
+    public Builder withSql(String sql) {


Nit - Should we name this replace or replaceWithSql ?

I chose the withSql name to be consistent with the existing withQueryOptions method on this class, which also replaces any existing value. I'm happy to change it to something else if you feel strongly about this.

In this case the naming convention is more like existing append() method. So going by that we can name this replace().

arpan14 · 2023-08-03T09:18:55Z

...-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/AbstractBaseUnitOfWork.java

+        transaction.partitionQuery(partitionOptions, query.getStatement(), options);
+    return ResultSets.forRows(
+        com.google.cloud.spanner.Type.struct(
+            StructField.of("PARTITION", com.google.cloud.spanner.Type.string())),


Can we define a private static constant for "PARTITION".

I've introduced a local final variable for it. It is not used anywhere else, and it is more readable to have it in the vicinity of where it is being used, than in a constant defined at the top of the file.

arpan14 · 2023-08-03T09:26:31Z

.../src/main/java/com/google/cloud/spanner/connection/ClientSideStatementPartitionExecutor.java

+    if (matcher.find() && matcher.groupCount() >= 2) {
+      String space = matcher.group(1);
+      String value = matcher.group(2);
+      return (space + value).trim();


Suggestion - StringBuilder would be a bit more optimal as compared to + operation.

The Java compiler automatically optimizes this internally, as it is not dynamic concatenation (e.g. in a loop or other hard-to-understand construct), so for these simple cases you should keep it as is. You will see that if you change this to using a StringBuilder, IntelliJ will even give you a warning.

arpan14 · 2023-08-03T09:39:53Z

...c/main/java/com/google/cloud/spanner/connection/ClientSideStatementRunPartitionExecutor.java

+
+  String getParameterValue(ParsedStatement parsedStatement) {
+    Matcher matcher = statement.getPattern().matcher(parsedStatement.getSqlWithoutComments());
+    if (matcher.find() && matcher.groupCount() >= 1) {


It could be just me but a first time reader will have little difficulty in understanding what we are doing here. My understanding is we are parsing and obtaining the partition ID from the statement. In what cases will a statement have a groupCount >=1 ?

Should we beef up the documentation a bit by adding examples for future readers?

Added some commentary to explain what is going on with this regex.

arpan14 · 2023-08-03T09:43:34Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/Connection.java

+   * value to <code>0</code>> to use the number of available processors as returned by {@link
+   * Runtime#availableProcessors()}.
+   */
+  void setMaxPartitionedParallelism(int maxThreads);


I had this query earlier as well, but thought of clarifying on code review. Isn't it more direct to use "threads" instead of "parallelism" ? Or is this a convention used widely? I would think setMaxPartitionedThreads is more specific?

I chose parallelism over threads to indicate that it is not only the number of threads being used to iterate over the returned results, it is also the maximum number of queries that will be executed in parallel on Cloud Spanner. The latter matters, as it also indicates the amount of resources that this query could potentially consume on Cloud Spanner (and not only the amount of resources on this specific client).

arpan14 · 2023-08-03T11:27:40Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/ConnectionImpl.java

@@ -1234,6 +1372,19 @@ public ApiFuture<long[]> executeBatchUpdateAsync(Iterable<Statement> updates) {
    return internalExecuteBatchUpdateAsync(CallType.ASYNC, parsedStatements);
  }

+  private QueryOption[] mergeDataBoost(QueryOption... options) {


Nit: Should we break databoost as a separate PR? So that its easy to use as a reference later? This PR can probably be about just adding support for partitioned reads?

I am ok to review this together, but breaking it just adds a future reference PR on how databoost support was added. This suggestion is only if it does not result in a lot of re-work for you.

Yeah, in hindsight we should maybe have done that. On the other hand; the only reason to add this to the Connection API is the databoost feature. Partitioned queries have been around for a long time (since 2018 I think), and there has never been a need for it so far in the Connection API. With the release of databoost there is a real use case for it in this API, as it means that users can send queries to a different type of server directly from the JDBC driver.

arpan14 · 2023-08-03T11:35:14Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/ConnectionImpl.java

@@ -1234,6 +1372,19 @@ public ApiFuture<long[]> executeBatchUpdateAsync(Iterable<Statement> updates) {
    return internalExecuteBatchUpdateAsync(CallType.ASYNC, parsedStatements);
  }

+  private QueryOption[] mergeDataBoost(QueryOption... options) {
+    if (this.dataBoostEnabled) {


Query - Just checking if its safe to not generate a query option when dataBoostEnabled is false? Wouldn't we want to differentiate the case when customer explicitly marked the option as false vs case where customer did not pass the property in connection string?

The reason that we are not explicitly including a DataBoostOption when dataBoostEnabled is false, is that it is a typical configuration option that is by default off, unless it has been enabled in one or the other way. In this case, it is possible that the user has passed a QueryOption that already contains a dataBoostEnabled option, and we don't want to override that, unless the user has explicitly turned it on for this connection.

(Unfortunately, the way that QueryOptions are implemented in the client library, we cannot check the actual values here, as all concrete implementations are package-private classes without any public interface, so we can't check what the user might have passed in.)

arpan14 · 2023-08-03T13:35:59Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/PartitionId.java

+  public static PartitionId decodeFromString(String id) {
+    try (ObjectInputStream objectInputStream =
+        new ObjectInputStream(
+            new GZIPInputStream(new ByteArrayInputStream(Base64.getUrlDecoder().decode(id))))) {
+      return (PartitionId) objectInputStream.readObject();
+    } catch (Exception exception) {
+      throw SpannerExceptionFactory.newSpannerException(exception);
+    }
+  }
+
+  /**
+   * @return A string-encoded version of this {@link PartitionId}. This encoded version can be sent
+   *     to any other {@link Connection} to be executed there, including connections on different
+   *     hosts than the current host.
+   */
+  public static String encodeToString(BatchTransactionId transactionId, Partition partition) {
+    PartitionId id = new PartitionId(transactionId, partition);
+    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
+    try (ObjectOutputStream objectOutputStream =
+        new ObjectOutputStream(new GZIPOutputStream(byteArrayOutputStream))) {
+      objectOutputStream.writeObject(id);
+    } catch (Exception exception) {
+      throw SpannerExceptionFactory.newSpannerException(exception);
+    }
+    return Base64.getUrlEncoder().encodeToString(byteArrayOutputStream.toByteArray());
+  }


Can we have a general utility to decode and encode (using Gzip)? That could be re-used at more than one place? We can use some generics to model more than one Input/Output?

I'm not sure I completely understand what you mean in this case. There are plenty of generic Gzip utils around. The GZIPOutputStream that is being used here for example uses the built-in Inflater class in Java (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/zip/Inflater.html), which already contains a lot of generic methods for compressing/decompressing. There are also no other places in this library where we are currently doing any gzipping. Pre-creating a library for gzipping any kind of object sounds like an example of premature optimization.

Not pre-creating a library. But basically breaking down this class into two classes - Partition and say GzipEncodingUtility. GzipEncodingUtility will use generics to take in dynamic input/output types. Partition class could internally use GzipEncodingUtility to do what it's doing currently.

arpan14 · 2023-08-03T13:38:26Z

...oud-spanner/src/main/resources/com/google/cloud/spanner/connection/ClientSideStatements.json

+		"statementType": "PARTITION",
+		"regex": "(?is)\\A\\s*partition(\\s+|\\()(.*)\\z",
+		"method": "statementPartition",
+		"exampleStatements": []


Referring back to a previous comment of mine where I asked for a few examples, can we add a few examples here which can be used to understand the ClientSideStatementPartitionExecutor code better?

I've added some example statements here (and update the test framework a bit, as it was not able to fully cope with these statements that can have anything at the end).

arpan14 · 2023-08-03T13:46:49Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/MergedResultSet.java

+    private final Type type;
+    private final ResultSetMetadata metadata;
+
+    static PartitionExecutorResult data(Struct data) {


Wouldn't a builder pattern be better suited for such object construction? Otherwise for every member that we introduce it will be difficult to multiplex such method to partially build the object.

Otherwise for every member that we introduce it will be difficult to multiplex such method to partially build the object.

That is actually the reason that I think that a Builder pattern is not suitable here. This (internal) object is not intended to allow each possible permutation. By using static builder methods for the specific permutations we allow, we gain the advantage of:

Being able to check the specific arguments of that permutation (e.g. calling data(Struct data) with a null value should be disallowed (I've added a null-check for that).

The place where it is being called is a lot easier to read: PartititionExecutorResult.data(...) is a clear indication that this result contains data, and that it is logical that it does not need to for example also include metadata.

arpan14 · 2023-08-03T13:55:35Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/MergedResultSet.java

+ * multiple queries. Each query uses its own {@link RowProducer} that feeds rows into the {@link
+ * MergedResultSet}. The order of the records in the {@link MergedResultSet} is not guaranteed.
+ */
+class MergedResultSet extends ForwardingStructReader implements PartitionedQueryResultSet {


Have we explored re-using some elements present in AsyncResultSetImpl ? Or in another way are there ways in which we can make MergedResultSet and AsyncResultSetImpl have some common code?

Also, there is not much difference in the way we would like to implement batch read in client library and connection API, apart from the fact that connection API supports a bunch of other configurations in connection string that client library does not support.

Am I missing the other major differences between these two implementations of ForwardingStructReader ?

There are two big differences between AsyncResultSetImpl and MergedResultSet:

AsyncResultSetImpl uses a single-threaded row producer that reads from a single query result. MergedResultSet uses a multi-threaded row producer that reads from multiple query results in parallel.

AsyncResultSetImpl implements the additional AsyncResultSet interface methods that allow users to consume the results in an asynchronous way. That is not needed for MergedResultSet, as it is intended for APIs that are synchronous by definition (JDBC and PostgreSQL).

So sharing the code between them won't make any of them any simpler.

The main difference between how we implement partitioned queries in the Java client and how we implement them in the Java Connection API is that the Connection API is aimed at supporting standard APIs (JDBC and PostgreSQL) that do not know anything about partitioned queries. That means that:

Both need to be able to access the feature through SQL statements, as anyone who is using these APIs through a generic tool will not know how to call any custom methods.

Both need all results to be returned as ResultSets.

In addition, we decided to add a convenience method for executing a query directly as a partitioned query to the Connection API (the executePartitionedQuery(...) method and RUN PARTITIONED QUERY SQL statement). This is also needed to use this feature in generic tools that have no idea how to send partition tokens to other hosts. This could also be of interest to the Java client, but for the time being I'm not pushing for that, as the direct request at the moment is to add this feature for JDBC. That means that:

The partitionQuery(..) method in the Connection API is a simple wrapper around the partitionQuery(..) method in the BatchClient, but it returns the results as a ResultSet with encoded strings instead of a List<Partition>.

The runPartition(..) method in the Connection API is a simple wrapper around the runPartition(..) method in the BatchClient, but it takes an encoded string instead of a Partition as an input argument, as a ResultSet cannot contain a Partition.

The MergedResultSet and corresponding runPartitionedQuery(..) methods could at a later moment easily be moved to the Java client and used there as well without causing a breaking change. But I would suggest that we wait with that until there is an actual demand for it.

* chore: add ClientSideStatementPartitionExecutor to SpannerFeature * chore: wrap AbstractStatementParser static initialization in try/catch * chore: add ClientSideStatementRunPartitionExecutor to SpannerFeature * chore: add ClientSideStatementRunPartitionedQueryExecutor to SpannerFeature * chore: lint formatting

feat: support partitioned queries + data boost in Connection API

052688a

Adds support for Partitioned Queries and Data Boost in the Connection API. This enables the use of these features in the JDBC driver and PGAdapter.

product-auto-label bot added size: xl Pull request size is extra large. api: spanner Issues related to the googleapis/java-spanner API. labels Jul 24, 2023

olavloite added 8 commits July 24, 2023 15:51

fix: match the correct group in regex

9817586

Merge branch 'main' into batch-read-connection-api

dc91c9a

feat: add more SQL statements for partitioned queries

bca9b12

Merge branch 'parsed-statement-to-client-executors' into batch-read-c…

56dfb75

…onnection-api

chore: simplify test

f0e9f5b

Merge branch 'parsed-statement-to-client-executors' into batch-read-c…

af211d3

…onnection-api

chore: cleanup differences

090868c

olavloite changed the base branch from main to parsed-statement-to-client-executors July 28, 2023 14:25

chore: cleanup unrelated changes

31ba134

olavloite marked this pull request as ready for review July 28, 2023 14:53

olavloite requested review from a team as code owners July 28, 2023 14:53

olavloite mentioned this pull request Jul 28, 2023

chore: refactor client side statements to accept the entire parsed statement #2556

Merged

olavloite requested a review from arpan14 July 28, 2023 14:55

olavloite added 8 commits July 28, 2023 17:26

fix: update converter name

6960e11

test: add more tests

549e870

chore: add missing license header

0b5269e

fix: handle empty partitioned queries correctly

d5248cc

fix: do not use any random staleness for partitioned queries

a461c39

fix: only return false for next() if all have finished

13541b1

chore: rename to autoPartitionMode

3ae95e3

chore: rename sql statements + add tests for empty results

ed6a34e

Base automatically changed from parsed-statement-to-client-executors to main August 2, 2023 10:03

olavloite and others added 2 commits August 2, 2023 12:08

Merge branch 'main' into batch-read-connection-api

1b39e8c

🦉 Updates from OwlBot post-processor

9e6dc6e

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

arpan14 reviewed Aug 3, 2023

View reviewed changes

chore: address review comments

89889ee

arpan14 approved these changes Aug 4, 2023

View reviewed changes

olavloite added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 4, 2023

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 4, 2023

olavloite added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 4, 2023

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 4, 2023

olavloite added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 4, 2023

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 4, 2023

olavloite merged commit 4e31d04 into main Aug 4, 2023
21 of 22 checks passed

olavloite deleted the batch-read-connection-api branch August 4, 2023 20:26

release-please bot mentioned this pull request Aug 4, 2023

chore(main): release 6.45.0 #2560

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support partitioned queries + data boost in Connection API #2540

feat: support partitioned queries + data boost in Connection API #2540

olavloite commented Jul 24, 2023 •

edited

arpan14 Aug 3, 2023

olavloite Aug 3, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

arpan14 Aug 3, 2023

olavloite Aug 4, 2023

feat: support partitioned queries + data boost in Connection API #2540

feat: support partitioned queries + data boost in Connection API #2540

Conversation

olavloite commented Jul 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olavloite commented Jul 24, 2023 •

edited