feat(bigquery): expose Apache Arrow data through ArrowIterator #8506

alvarowolfx · 2023-08-29T15:42:10Z

As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow.

Resolves #8100

alvarowolfx · 2023-08-29T15:47:31Z

bigquery/storage_integration_test.go

 		t.Fatal("expected stream to be done")
 	}
 }

+func TestIntegration_StorageReadArrow(t *testing.T) {


@k-anshul @zeroshade this integration test show an example on how that interface would be used.

zeroshade · 2023-09-05T18:45:35Z

@alvarowolfx I'm gonna try to get a thorough review of this in the next day or two from the Arrow perspective. Thanks for doing this!

zeroshade · 2023-09-07T20:15:46Z

bigquery/storage_integration_test.go

+	r, err := ipc.NewReader(&arrowIteratorReader{
+		it: arrowIt,
+	})


can we have an interface that doesn't require consumers to implement their own arrowIteratorReader?

zeroshade · 2023-09-07T20:27:44Z

bigquery/storage_iterator.go

 	wg := sync.WaitGroup{}
 	wg.Add(len(streams))
 	sem := semaphore.NewWeighted(int64(it.session.settings.maxWorkerCount))


would it be feasible to expose the streams themselves to a consumer to allow them to control the parallelization instead of forcing them to a specific route like this?

zeroshade · 2023-09-22T21:48:18Z

The new ArrowIteratorReader looks good to me, so this seems pretty good from my perspective, though it would be nice to be able to expose the raw streams to the consumer directly allowing them to control parallelization and partition downloading possibly. But that's certainly something for a future enhancement I think.

shollyman · 2023-10-10T16:17:56Z

bigquery/arrow.go

+	if err == io.EOF {
+		batch, err := r.it.Next()
+		if err == iterator.Done {
+			return -1, io.EOF


Reading of the io.Reader interface would suggest we should return 0 rather than a negative value.

good point, just sent a fix

shollyman · 2023-10-10T16:21:30Z

bigquery/storage_integration_test.go

+	numrec := 0
+	for r.Next() {
+		rec := r.Record()
+		rec.Retain()


I'm confused by the release/retains here, but I've not been spending much time with arrow recently. If you retain individual records do you need to release them?

they are going to be releases by the ipc.Reader later on the r.Release() call right after.

not exactly. If you call Retain on an individual record, then you will need to call Release on that record.

The ipc.Reader keeps only the current Record, reusing that member. When you call Next() it will release the record it had before loading the next one. This is why you need to call Retain on the records that you put into the slice, so that they aren't deallocated by the ipc.Reader calling Release on them. However you should also add a defer rec.Release() in the loop to ensure that record gets released, the ipc.Reader will not retain any references to those records and therefore will not call Release on them.

interesting, I didn't know that. I'll make the changes to call rec.Release on each record.

If you want to verify that everything is released/retained correctly, you could use memory.CheckedAllocator and defer mem.AssertSize(t, 0) then pass the checked allocator to everything (like to ipc.NewReader) so that it is used for all the memory allocations.

Not absolutely necessary, but an optional way to add some assurances if desired.

I'm gonna add support for changing the allocator just internally for now for test purposes, I liked the idea of verifying that there are no memory leaks. Thanks for the tip.

zeroshade · 2023-10-10T16:32:31Z

bigquery/storage_integration_test.go

+	var totalFromArrow int64
+	for tr.Next() {
+		rec := tr.Record()
+		vec := array.NewInt64Data(rec.Column(1).Data())


I'm confused here, why array.NewInt64Data(rec.Column(1).Data())? Why not just rec.Column(1).(*array.Int64)?

You're performing an additional allocation here when you don't need to be, and it's also a potential memory leak since this new array instance will not get released when the record is released.

I didn't know that the rec.Column here could be type converted to *array.Int64 directly. Is always hard to know when you can do those type conversions directly, so I was trying to use the API methods to do that instead. I'll make the change here.

Yea, rec.Column returns an arrow.Array interface value, so as long as you know the column is an int64 column, then you can type assert it to *array.Int64. Alternately you can do a type switch / check the data type, etc. before doing the type assertion. The intent is to minimize copying and minimize allocations when handling arrays and utilizing them for records etc.

zeroshade · 2023-10-11T21:18:29Z

bigquery/storage_integration_test.go

+	}
+	totalFromSQL := sumValues[0].(int64)
+
+	tr := array.NewTableReader(arrowTable, arrowTable.NumRows())


add defer tr.Release() please.

after adding support for changing the allocator I found that without the tr.Release a memory leak was happening. Good catch and awesome tips on how to catch those leaks 🎉

zeroshade · 2023-10-11T21:24:00Z

bigquery/arrow.go

+	arrowSchema *arrow.Schema
+}
+
+func newArrowDecoder(arrowSerializedSchema []byte, schema Schema) (*arrowDecoder, error) {


would probably be worthwhile (though certainly could be done as a follow-up) to allow passing a memory.Allocator interface here that would be stored in the arrowDecoder to allow a user to configure how memory gets allocated for the arrow batches (it would be passed as ipc.WithAllocator(mem) to ipc.NewReader)

In most cases users would probalby just use memory.DefaultAllocator but in other cases, depending on the constraints of the system, they might want to use a custom allocator such as a malloc based allocator that uses C memory to avoid garbage collection passes, or any other custom allocation they might want for specialized situations.

The other benefit of this would be that you could use memory.CheckedAllocator in unit tests to verify that everything properly has Release called if necessary, etc.

zeroshade · 2023-10-11T21:25:06Z

bigquery/arrow.go

-	buf.Write(serializedArrowRecordBatch)
-	return ipc.NewReader(buf, ipc.WithSchema(ap.arrowSchema))
+func (ap *arrowDecoder) createIPCReaderForBatch(arrowRecordBatch *ArrowRecordBatch) (*ipc.Reader, error) {
+	return ipc.NewReader(arrowRecordBatch, ipc.WithSchema(ap.arrowSchema))


as above, it would be great (but could be a follow-up) if either newArrowDecoder accepted a memory.Allocator which would get used here as ipc.WithAllocator(mem) or if this method optionally took an allocator (defaulting to memory.DefaultAllocator if nil was passed)

zeroshade

Added a couple comments, but overall this looks good to me, though I would like to point at #8506 (comment) for possible consideration.

ADBC allows for a query to return a result set via partition identifiers (arbitrary byte blobs for a given driver) and then allow the consumer to retrieve the streams of data from each partition in parallel. Since BigQuery already can return the data as multiple streams, there could be a benefit to allowing this Arrow iterator to return the raw underlying streams of IPC data instead of only exposing an interface for the collected streams with the parallelization handled in here, rather than by the consumer.

It's not something that I believe should block this PR, but should possibly be picked up in a follow-up as an enhancement.

alvarowolfx · 2023-10-12T15:32:38Z

I'll create another issue to keep track of this request as a future enhancement. Thanks for the deeper review on the PR, I'm gonna push some improvements based on the comments.

Added a couple comments, but overall this looks good to me, though I would like to point at #8506 (comment) for possible consideration.

…row structures

As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow. Resolves #8100

Yifeng-Sigma · 2023-11-22T20:01:23Z

I'm wondering if we have plans to make arrowDecoder public, so we can obtain arrow.Record. Currently seems there's no way to get arrow.Record.

alvarowolfx added 2 commits August 29, 2023 11:34

feat(bigquery): detach storage api iterator from arrow decoding

699d262

feat(bigquery): make ArrowIterator public and add integration tests

8ad2672

product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the BigQuery API. labels Aug 29, 2023

alvarowolfx mentioned this pull request Aug 29, 2023

bigquery: Expose Apache Arrow data #8100

Closed

alvarowolfx commented Aug 29, 2023

View reviewed changes

alvarowolfx requested a review from shollyman August 29, 2023 15:47

chore(bigquery): upgrade arrow to v13

6969f67

alvarowolfx marked this pull request as ready for review August 29, 2023 21:41

alvarowolfx requested review from a team as code owners August 29, 2023 21:41

k-anshul mentioned this pull request Aug 30, 2023

Bigquery connector changes rilldata/rill#3001

Merged

zeroshade reviewed Sep 7, 2023

View reviewed changes

feat(bigquery): add ArrowIteratorReader

6765573

product-auto-label bot added the stale: old Pull request is old and needs attention. label Sep 29, 2023

alvarowolfx added 4 commits October 5, 2023 13:32

fix(bigquery): return err if ArrowIterator is nil

87fb373

chore: rollback arrow version to upgrade it separately

aee8018

chore: rollback other lib upgrade to upgrade them separately

d142c53

Merge branch 'main' into bq-arrow-iterator

01fc5e4

shollyman reviewed Oct 10, 2023

View reviewed changes

fix: return 0 and io.EOF when iterator is done

0762adc

zeroshade reviewed Oct 10, 2023

View reviewed changes

fix: release arrow records and avoid extra alloc of array.Int64

c08080e

alvarowolfx requested review from zeroshade and shollyman October 10, 2023 18:18

shollyman approved these changes Oct 11, 2023

View reviewed changes

zeroshade reviewed Oct 11, 2023

View reviewed changes

zeroshade approved these changes Oct 11, 2023

View reviewed changes

alvarowolfx added 3 commits October 12, 2023 11:33

test: add check allocator to check for memory leaks when retaining ar…

3c50652

…row structures

Merge branch 'main' into bq-arrow-iterator

ba17463

Merge branch 'main' into bq-arrow-iterator

bcb7329

alvarowolfx added the automerge Merge the pull request once unit tests and other checks pass. label Oct 23, 2023

gcf-merge-on-green bot merged commit c8e7692 into googleapis:main Oct 23, 2023
9 checks passed

gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Oct 23, 2023

release-please bot mentioned this pull request Oct 23, 2023

chore(main): release bigquery 1.57.0 #8696

Merged

This was referenced Nov 7, 2023

November 06, 2023 kitta65/bq-extension-vscode#251

Closed

November 06, 2023 kitta65/prettier-plugin-bq#258

Closed

November 06, 2023 kitta65/bq2cst#268

Closed

josevalim mentioned this pull request Mar 31, 2024

Implement BigQuery Driver apache/arrow-adbc#168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery): expose Apache Arrow data through ArrowIterator #8506

feat(bigquery): expose Apache Arrow data through ArrowIterator #8506

alvarowolfx commented Aug 29, 2023

alvarowolfx Aug 29, 2023

zeroshade commented Sep 5, 2023

zeroshade Sep 7, 2023

zeroshade Sep 7, 2023

zeroshade commented Sep 22, 2023

shollyman Oct 10, 2023

alvarowolfx Oct 10, 2023 •

edited

shollyman Oct 10, 2023

alvarowolfx Oct 10, 2023

zeroshade Oct 10, 2023

alvarowolfx Oct 10, 2023

zeroshade Oct 11, 2023

alvarowolfx Oct 12, 2023

zeroshade Oct 10, 2023

alvarowolfx Oct 10, 2023

zeroshade Oct 10, 2023

zeroshade Oct 11, 2023

alvarowolfx Oct 12, 2023

zeroshade Oct 11, 2023

zeroshade Oct 11, 2023

zeroshade left a comment

alvarowolfx commented Oct 12, 2023

Yifeng-Sigma commented Nov 22, 2023

feat(bigquery): expose Apache Arrow data through ArrowIterator #8506

feat(bigquery): expose Apache Arrow data through ArrowIterator #8506

Conversation

alvarowolfx commented Aug 29, 2023

Choose a reason for hiding this comment

zeroshade commented Sep 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade commented Sep 22, 2023

Choose a reason for hiding this comment

alvarowolfx Oct 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade left a comment

Choose a reason for hiding this comment

alvarowolfx commented Oct 12, 2023

Yifeng-Sigma commented Nov 22, 2023

alvarowolfx Oct 10, 2023 •

edited