-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero column projection not handled correctly #73
Comments
Here is an example scala snippet that repros the issue: we can clearly see that all columns are pulled to Spark (vs 0 columns) via network traffic monitor using current JAR (gs://spark-lib/bigquery/spark-bigquery-latest.jar) |
Debugging this issue it seems that not always spark provides those fields to the connector. In the upcoming release (0.12.0-beta) I will add additional logging so that the pushed down columns and filters (the WHERE conditions) will be more transparent. |
Do you have a reproducible example where Spark does not provide the columns? |
…s and filters recieved from spark and pushed down to bigquery
* General cleanups, removing scala-logging due to ependency issues on Databricks runtime. (PR #27) * Connector now verifies Scala version compatibility, some assebly bug fixes * Issue #73: Adding additional logging of the columns and filters recieved from spark and pushed down to bigquery * Updating to Scala 2.12.10, fixing integration test dependencies * Extending integration test timeout due to running on cloudbuild
Version 0.12.0-beta adds logging for the columns and the filters it receives from the spark DataSource API, and which it pushes down to BigQuery. @superhadooper Can you please try again? |
@davidrabinowitz Thanks for updating the README with a workaround. Although I would quickly note that I see two logical ways to solve this.
Do you see any other options? To your prior comment:
I'm curious because the column projection pushdown is used by many other sources and so if you can reproduce that then its a bug in Spark that needs fixed. For reference the Spark ORC source had a similar issue with zero column projections that I fixed a little over 3 years ago. apache/spark#15898 |
@aray Thanks for your notes. Please notice that df.select(col).count() should not necessary mean count(col) as Spark seems to read the rows, regardless of the of the content. I have tried to have a special treatment for count, and it would have worked if it was an RDD rather than DataFrame. In DataFrame the connector is limited to the DataSource API and unfortunately there is no hook for performing the count() action. We haven't given up on this, and we will defintely try other approaches, including in the API level. I can't seem to find the case where I had issues with column projections, perhaps because caching was involved? I have added further logging in the latest release (0.12.0-beta) to help debugging such cases. |
Should work now. |
When a simple count(*) is done on a table spark pushes down a zero column projection. This connector passes this empty list down to TableReadOptions here:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/src/main/scala/com/google/cloud/spark/bigquery/direct/DirectBigQueryRelation.scala#L79
However the API for this states "If empty, all fields will be read" see https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1#google.cloud.bigquery.storage.v1beta1.TableReadOptions
The result is all columns are read for a simple table count.
The text was updated successfully, but these errors were encountered: