feat(bigquery/storage/managedwriter): support variadic appends #5102

shollyman · 2021-11-09T18:03:23Z

BREAKING CHANGE: adds a variadic option to the AppendRows() method, removes offset argument

This updates the call signature to allow variadic appends, and introduces two new AppendOption options: one for setting the offset in an optional fashion (WithOffset()), and one for updating the schema (UpdateSchemaDescriptor()).

Due to current API limitations, this means that we need to close/reconnect the open connection when this option is passed. However, this should be resolved in the backend eventually. Internal issue 205756033 tracks this.

In practice this means the following changes need attention by consumers of this library:

For the "don't set an offset" behavior:

Old: mystream.AppendRows(ctx, data, managedwriter.NoStreamOffset)
New: mystream.AppendRows(ctx, data)

For the "set an an offset" behavior:

Old: mystream.AppendRows(ctx, data, offset)
New: mystream.AppendRows(ctx, data, managedwriter.WithOffset(offset))

codyoss

LGTM from a Go perspective. I will let Tim sign off from the BQ perspective.

tswast

A few questions.

tswast · 2021-11-10T21:25:31Z

bigquery/storage/managedwriter/appendresult.go

@@ -66,7 +67,9 @@ func (ar *AppendResult) GetResult(ctx context.Context) (int64, error) {
 // append request.
 type pendingWrite struct {
 	request *storagepb.AppendRowsRequest
-	result  *AppendResult
+	// for schema evolution cases, accept a new schema
+	newSchema *descriptorpb.DescriptorProto


Will this property need to be removed once 205756033 is resolved? It might be redundant with AppendRowsRequest.proto_rows.writer_schema.proto_descriptor.

Or am I understanding that email thread that we won't be exposing the underlying AppendRowsRequest to users?

In this veneer I'm wrapping the append request. My expectation is we'll end up with per-stream caches of schema and per-stream append queues for the multiplexing case. Currently we just retain a single schema and append queue (though in go it's a channel rather than a queue).

Is this also the reason why you can optionally have a new schema updated as part of a return error. I was puzzled how one can automatically fix schema updates on runtime though? Is there any realistic example of this?

Yeah, the returned schema is the notification of schema extension. It's a case where you do something like an ALTER TABLE ADD COLUMN or extend schema via tables.update while streaming data, and the change gets acknowledged by the streaming backend by setting the new schema. My expectation is we'll add a callback registration for this as well, but not in this change.

tswast · 2021-11-10T21:39:58Z

bigquery/storage/managedwriter/integration_test.go

+
+	// setup a new stream.
+	ms, err := mwClient.NewManagedStream(ctx,
+		WithDestinationTable(fmt.Sprintf("projects/%s/datasets/%s/tables/%s", testTable.ProjectID, testTable.DatasetID, testTable.TableID)),


Interesting. So the users don't see the stream ID, either? I guess that gives us enough flexibility for the current schema evolution workaround.

There's effectively two paths to getting a managed stream: allow NewManagedStream() to deal with the stream construction by specifying table/type/etc, or do it yourself and pass it in via WithStreamName() option.

I expect users who explicitly do stream construction to be more likely to be doing dynamic proto schema stuff, as one of the things you get back from stream creation is table schema.

Only thing I was wondering is why in a typed language we are in this API passing it as a string. Wouldn't it make much more sense to actually request 3 separate parameters or some kind of struct if you want to make some of it optional. Having to format it ourselves feels a bit weird. I mean, it's possible, and I've done so, but it feels odd.

Two reasons: this is a bit of a standard in the APIs (see aip.dev for more context), and we want to avoid some circular dependencies by having the managedwriter depend on cloud.google.com/go/bigquery directly.

#5017 will make it easier to generate the string for the table resources if you end up using this option from a bigquery resource.

tswast · 2021-11-10T21:40:39Z

bigquery/storage/managedwriter/managed_stream.go

 		return ms.arc, ms.pending, nil
 	}
+	if arc != ms.arc && forceReconnect && ms.arc != nil {
+		// TODO: is closing send sufficient?


TODO was verified by integration test?

Or do we need to create a new stream if it's not the default stream?

tswast · 2021-11-10T21:43:17Z

bigquery/storage/managedwriter/managed_stream.go

+// The format of the row data is binary serialized protocol buffer bytes.  The message must be compatible
+// with the schema currently set for the stream.
+//
+// Use the sentinel value NoStreamOffset to omit sending of the offset value.


From my testing, they must send an offset if using PENDING mode. And they can't if using default stream, right? I didn't test with BUFFERED, so maybe it's optional there? Maybe there's a better way to communicate when NoStreamOffset should be set?

Interestingly enough, this gets to the other breaking change I've been considering here. Remove offset as a static argument from the AppendRows() function and add a WithOffset() AppendOption.

The origin of the NoStreamOffset was to simplify the AppendResult, as Go's lack of null vs default value makes it more complex to deal with the optional offset. I didn't want to do *int64, but it's an option to consider as well.

Given this is already a kind of managed writer i wonder if this cannot be abstracted away completely? But I guess to not force how to handle errors in a retry-able manner it is needed to expose it? Either way, I do not find the NoStreamOffset sentient value a big deal, works fine.

Went ahead and made the offset part of the variadic options.

This changes the signatures for appending to:

No offset set: <ManagedStream>.AppendRows(ctx, data)

Offset set: <ManagedStream>.AppendRows(ctx, data, WithOffset(offset))

GlenDC · 2021-11-12T14:59:11Z

bigquery/storage/managedwriter/integration_test.go

+
+	// setup a new stream.
+	ms, err := mwClient.NewManagedStream(ctx,
+		WithDestinationTable(fmt.Sprintf("projects/%s/datasets/%s/tables/%s", testTable.ProjectID, testTable.DatasetID, testTable.TableID)),


Only thing I was wondering is why in a typed language we are in this API passing it as a string. Wouldn't it make much more sense to actually request 3 separate parameters or some kind of struct if you want to make some of it optional. Having to format it ourselves feels a bit weird. I mean, it's possible, and I've done so, but it feels odd.

GlenDC · 2021-11-12T15:01:11Z

bigquery/storage/managedwriter/integration_test.go

+		Value: proto.Int64(180),
+		Other: proto.String("hello evolution"),
+	}
+	descriptorProto = protodesc.ToDescriptorProto(m2.ProtoReflect().Descriptor())


This I find one of the more harder parts as a beginner to start making use of proto models for streaming into the storage API. It's quite a chain of commands that aren't really something you would ever figure out by just looking at the API. Only way I learned how to do this was by checking these examples. I wonder if somehow there isn't an easier way we can make it just pass in a the generated proto type somehow. Dunno. Haven't found an easy way myself for my bqwriter wrapper, otherwise I would have already done it.

GlenDC · 2021-11-12T15:02:26Z

bigquery/storage/managedwriter/managed_stream.go

 		return ms.arc, ms.pending, nil
 	}
+	if arc != ms.arc && forceReconnect && ms.arc != nil {
+		// TODO: is closing send sufficient?
+		(*ms.arc).CloseSend()


this smells fishy, or is it just me?

It's a temporary issue until the backend allows schema change on an already open stream connection; I'm not enamored of it but this will get cleaned up.

GlenDC · 2021-11-12T15:04:21Z

bigquery/storage/managedwriter/managed_stream.go

+// The format of the row data is binary serialized protocol buffer bytes.  The message must be compatible
+// with the schema currently set for the stream.
+//
+// Use the sentinel value NoStreamOffset to omit sending of the offset value.


Given this is already a kind of managed writer i wonder if this cannot be abstracted away completely? But I guess to not force how to handle errors in a retry-able manner it is needed to expose it? Either way, I do not find the NoStreamOffset sentient value a big deal, works fine.

tswast

AppendOption design LGTM, but one worry:

By hiding the fact that we're recreating the stream, does it make the use of offset harder to understand / make it a breaking change when the API no longer requires a stream reset? Though, perhaps there's some signal we send at the end of the stream so that folks don't need to know specifically that a schema change could cause that?

shollyman · 2021-11-30T22:22:35Z

AppendOption design LGTM, but one worry:

By hiding the fact that we're recreating the stream, does it make the use of offset harder to understand / make it a breaking change when the API no longer requires a stream reset? Though, perhaps there's some signal we send at the end of the stream so that folks don't need to know specifically that a schema change could cause that?

Management of the network stream connection is abstracted from the user. This PR uses a CloseSend() on the network stream to cleanly signal a new connection: existing appends in flight will still process on the recv side, and the next append (either new or due to a retry) will pick up a new connection. Schema change notification from the backend isn't currently in this veneer (but will come in a future PR).

If you change the schema in a compatible way, retrying an old proto message with a new schema that has additional fields/tags shouldn't be an issue as that's the power of proto extension in a nutshell. If the table is changed to an incompatible schema, the stream itself is invalid (for explicitly created streams). Default streams are a special case here, but essentially a similar metadata inconsistency like the existing tabledata.insertall when schema changes arrive.

tswast

LGTM after clarifying offline that it's reconnecting but not creating a new (backend) stream. Hooray for ambiguous streams 🙄

shollyman added 2 commits November 8, 2021 22:54

feat(bigquery/storage/managedwriter): support variadic AppendOption

22b2cba

add additional testing, fix testutil

8352e7d

product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Nov 9, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 9, 2021

force reconnection on schema change

61e6640

shollyman marked this pull request as ready for review November 9, 2021 22:08

shollyman requested a review from a team November 9, 2021 22:08

shollyman requested a review from a team as a code owner November 9, 2021 22:08

shollyman requested a review from loferris November 9, 2021 22:08

shollyman added 2 commits November 9, 2021 22:44

fix comment

f3ec966

Merge branch 'main' into fr-variadic-append

74633d1

shollyman requested review from tswast and codyoss and removed request for loferris November 9, 2021 23:08

codyoss reviewed Nov 10, 2021

View reviewed changes

tswast reviewed Nov 10, 2021

View reviewed changes

shollyman mentioned this pull request Nov 11, 2021

bigquery/storage/managedwriter: unclear how to send data formatted using nested protobuf messages #5097

Closed

GlenDC reviewed Nov 12, 2021

View reviewed changes

shollyman added 2 commits November 30, 2021 19:25

Merge branch 'main' into fr-variadic-append

410879c

make offset request variadic

70d8925

tswast reviewed Nov 30, 2021

View reviewed changes

tswast approved these changes Nov 30, 2021

View reviewed changes

shollyman added 3 commits December 1, 2021 02:36

Merge branch 'main' into fr-variadic-append

1365ec6

use WithOffset in integration tests, improve comments

cba474f

more explicit response checking for pending stream integration test

77b3df7

shollyman merged commit 014b314 into googleapis:main Dec 1, 2021

shollyman mentioned this pull request Dec 21, 2021

bigquery: build veneer for bigquery write client #4366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery/storage/managedwriter): support variadic appends #5102

feat(bigquery/storage/managedwriter): support variadic appends #5102

shollyman commented Nov 9, 2021 •

edited

codyoss left a comment

tswast left a comment

tswast Nov 10, 2021

shollyman Nov 11, 2021

GlenDC Nov 12, 2021

shollyman Nov 30, 2021

tswast Nov 10, 2021

shollyman Nov 11, 2021

shollyman Nov 11, 2021

GlenDC Nov 12, 2021

shollyman Nov 30, 2021

tswast Nov 10, 2021

tswast Nov 10, 2021

tswast Nov 10, 2021

shollyman Nov 11, 2021

GlenDC Nov 12, 2021

shollyman Nov 30, 2021

GlenDC Nov 12, 2021

GlenDC Nov 12, 2021

GlenDC Nov 12, 2021

shollyman Nov 30, 2021

GlenDC Nov 12, 2021

tswast left a comment

shollyman commented Nov 30, 2021

tswast left a comment

feat(bigquery/storage/managedwriter): support variadic appends #5102

feat(bigquery/storage/managedwriter): support variadic appends #5102

Conversation

shollyman commented Nov 9, 2021 • edited

codyoss left a comment

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

shollyman commented Nov 30, 2021

tswast left a comment

Choose a reason for hiding this comment

shollyman commented Nov 9, 2021 •

edited