Zero column projection not handled correctly #73

aray · 2019-10-24T14:28:41Z

When a simple count(*) is done on a table spark pushes down a zero column projection. This connector passes this empty list down to TableReadOptions here:

https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/src/main/scala/com/google/cloud/spark/bigquery/direct/DirectBigQueryRelation.scala#L79

However the API for this states "If empty, all fields will be read" see https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1#google.cloud.bigquery.storage.v1beta1.TableReadOptions

The result is all columns are read for a simple table count.

superhadooper · 2019-10-24T14:39:54Z

Here is an example scala snippet that repros the issue:
spark.read.bigquery("<large_table>").select("_col0").count()

we can clearly see that all columns are pulled to Spark (vs 0 columns) via network traffic monitor using current JAR (gs://spark-lib/bigquery/spark-bigquery-latest.jar)

davidrabinowitz · 2020-01-16T01:00:31Z

Debugging this issue it seems that not always spark provides those fields to the connector. In the upcoming release (0.12.0-beta) I will add additional logging so that the pushed down columns and filters (the WHERE conditions) will be more transparent.

aray · 2020-01-18T04:36:58Z

Do you have a reproducible example where Spark does not provide the columns?

…s and filters recieved from spark and pushed down to bigquery

* General cleanups, removing scala-logging due to ependency issues on Databricks runtime. (PR #27) * Connector now verifies Scala version compatibility, some assebly bug fixes * Issue #73: Adding additional logging of the columns and filters recieved from spark and pushed down to bigquery * Updating to Scala 2.12.10, fixing integration test dependencies * Extending integration test timeout due to running on cloudbuild

davidrabinowitz · 2020-01-29T17:51:42Z

Version 0.12.0-beta adds logging for the columns and the filters it receives from the spark DataSource API, and which it pushes down to BigQuery.

@superhadooper Can you please try again?
@aray as mentioned in the README and as you found out, the BigQuery Storage API does not allow us to have a zero column projection. The suggestion is to select the smaller field for the count, for minimal data transfer. The new logging should help to understand what is being pushed down.

aray · 2020-02-01T03:43:29Z

@davidrabinowitz Thanks for updating the README with a workaround. Although I would quickly note that count(col) is only equivalent to count(*) if col is non null.

I see two logical ways to solve this.

Push the requirement upstream that the bigquery storage api needs to support zero column projections. Since the api is still beta maybe there is still a chance to change?
Special case zero column projections in this connector to do select count(*) from $t where $f in BigQuery and then generate the given number of empty rows.

Do you see any other options?

To your prior comment:

it seems that not always spark provides those fields to the connector

I'm curious because the column projection pushdown is used by many other sources and so if you can reproduce that then its a bug in Spark that needs fixed.

For reference the Spark ORC source had a similar issue with zero column projections that I fixed a little over 3 years ago. apache/spark#15898

davidrabinowitz · 2020-02-03T20:58:56Z

@aray Thanks for your notes. Please notice that df.select(col).count() should not necessary mean count(col) as Spark seems to read the rows, regardless of the of the content.

I have tried to have a special treatment for count, and it would have worked if it was an RDD rather than DataFrame. In DataFrame the connector is limited to the DataSource API and unfortunately there is no hook for performing the count() action. We haven't given up on this, and we will defintely try other approaches, including in the API level.

I can't seem to find the case where I had issues with column projections, perhaps because caching was involved? I have added further logging in the latest release (0.12.0-beta) to help debugging such cases.

* Issue #73: Optimized empty projection read * changed the `parallelism` parameter to in order to reflect the Change in the underlining API

davidrabinowitz · 2020-02-12T20:47:09Z

Should work now.

davidrabinowitz self-assigned this Oct 24, 2019

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Jan 27, 2020

Issue GoogleCloudDataproc#73: Adding additional logging of the column…

e224685

…s and filters recieved from spark and pushed down to bigquery

davidrabinowitz mentioned this issue Jan 28, 2020

Bug Fixes and some cleanups #113

Merged

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Feb 12, 2020

Issue GoogleCloudDataproc#73: Optimized empty projection read

a1450ab

davidrabinowitz added a commit that referenced this issue Feb 12, 2020

Issue #73: Optimized empty projection read (#128)

f898dc2

* Issue #73: Optimized empty projection read * changed the `parallelism` parameter to in order to reflect the Change in the underlining API

davidrabinowitz closed this as completed Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero column projection not handled correctly #73

Zero column projection not handled correctly #73

aray commented Oct 24, 2019

superhadooper commented Oct 24, 2019

davidrabinowitz commented Jan 16, 2020

aray commented Jan 18, 2020

davidrabinowitz commented Jan 29, 2020

aray commented Feb 1, 2020

davidrabinowitz commented Feb 3, 2020

davidrabinowitz commented Feb 12, 2020 •

edited

Zero column projection not handled correctly #73

Zero column projection not handled correctly #73

Comments

aray commented Oct 24, 2019

superhadooper commented Oct 24, 2019

davidrabinowitz commented Jan 16, 2020

aray commented Jan 18, 2020

davidrabinowitz commented Jan 29, 2020

aray commented Feb 1, 2020

davidrabinowitz commented Feb 3, 2020

davidrabinowitz commented Feb 12, 2020 • edited

davidrabinowitz commented Feb 12, 2020 •

edited