Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero column projection not handled correctly #73

Closed
aray opened this issue Oct 24, 2019 · 7 comments
Closed

Zero column projection not handled correctly #73

aray opened this issue Oct 24, 2019 · 7 comments
Assignees

Comments

@aray
Copy link

aray commented Oct 24, 2019

When a simple count(*) is done on a table spark pushes down a zero column projection. This connector passes this empty list down to TableReadOptions here:

https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/src/main/scala/com/google/cloud/spark/bigquery/direct/DirectBigQueryRelation.scala#L79

However the API for this states "If empty, all fields will be read" see https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1#google.cloud.bigquery.storage.v1beta1.TableReadOptions

The result is all columns are read for a simple table count.

@superhadooper
Copy link

Here is an example scala snippet that repros the issue:
spark.read.bigquery("<large_table>").select("_col0").count()

we can clearly see that all columns are pulled to Spark (vs 0 columns) via network traffic monitor using current JAR (gs://spark-lib/bigquery/spark-bigquery-latest.jar)

@davidrabinowitz davidrabinowitz self-assigned this Oct 24, 2019
@davidrabinowitz
Copy link
Member

Debugging this issue it seems that not always spark provides those fields to the connector. In the upcoming release (0.12.0-beta) I will add additional logging so that the pushed down columns and filters (the WHERE conditions) will be more transparent.

@aray
Copy link
Author

aray commented Jan 18, 2020

Do you have a reproducible example where Spark does not provide the columns?

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Jan 27, 2020
…s and filters recieved from spark and pushed down to bigquery
davidrabinowitz added a commit that referenced this issue Jan 29, 2020
* General cleanups, removing scala-logging due to ependency issues on Databricks runtime. (PR #27)
* Connector now verifies Scala version compatibility, some assebly bug fixes
* Issue #73: Adding additional logging of the columns and filters recieved from spark and pushed down to bigquery
* Updating to Scala 2.12.10, fixing integration test dependencies
* Extending integration test timeout due to running on cloudbuild
@davidrabinowitz
Copy link
Member

Version 0.12.0-beta adds logging for the columns and the filters it receives from the spark DataSource API, and which it pushes down to BigQuery.

@superhadooper Can you please try again?
@aray as mentioned in the README and as you found out, the BigQuery Storage API does not allow us to have a zero column projection. The suggestion is to select the smaller field for the count, for minimal data transfer. The new logging should help to understand what is being pushed down.

@aray
Copy link
Author

aray commented Feb 1, 2020

@davidrabinowitz Thanks for updating the README with a workaround. Although I would quickly note that count(col) is only equivalent to count(*) if col is non null.

I see two logical ways to solve this.

  1. Push the requirement upstream that the bigquery storage api needs to support zero column projections. Since the api is still beta maybe there is still a chance to change?
  2. Special case zero column projections in this connector to do select count(*) from $t where $f in BigQuery and then generate the given number of empty rows.

Do you see any other options?

To your prior comment:

it seems that not always spark provides those fields to the connector

I'm curious because the column projection pushdown is used by many other sources and so if you can reproduce that then its a bug in Spark that needs fixed.

For reference the Spark ORC source had a similar issue with zero column projections that I fixed a little over 3 years ago. apache/spark#15898

@davidrabinowitz
Copy link
Member

@aray Thanks for your notes. Please notice that df.select(col).count() should not necessary mean count(col) as Spark seems to read the rows, regardless of the of the content.

I have tried to have a special treatment for count, and it would have worked if it was an RDD rather than DataFrame. In DataFrame the connector is limited to the DataSource API and unfortunately there is no hook for performing the count() action. We haven't given up on this, and we will defintely try other approaches, including in the API level.

I can't seem to find the case where I had issues with column projections, perhaps because caching was involved? I have added further logging in the latest release (0.12.0-beta) to help debugging such cases.

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Feb 12, 2020
davidrabinowitz added a commit that referenced this issue Feb 12, 2020
* Issue #73: Optimized empty projection read
* changed the `parallelism` parameter to  in order to reflect the Change in the underlining API
@davidrabinowitz
Copy link
Member

davidrabinowitz commented Feb 12, 2020

Should work now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants