Implemented sql_join for Google BigQuery #1244
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello everyone! This pull request features an implementation of
sql_join
for "bigquery".(This is a resubmission of #1220 that I had to close in order to move to a different branch)
Motivation
The major motivation for this pull request is that Google BigQuery has powerful SQL join facilities but requires a particular dialect of SQL when doing joins such that regular
sql_join
fails to produce a runnable query.sql_join.DBIConnection
solution when joining two tables A and B on the common column id is to say "A JOIN B USING id" while Google BigQuery requires to say "A JOIN B on A.id = B.id")sql_join
.Implementation Details
Some points that are probably worth clarifying for this pull request:
sql_join.DBIConnection
and relies on pretty much the same non-exported dependencies:auto_names, unique_names, common_by
etc.sql_subquery.bigquery
is modified. The problem with the current versionsql_subquery.bigquery
that is presented insidebigrquery
package is thatbuild_sql
erases "vars" attribute when creating a subquery. However, "vars" are needed for correct subsequent processing of the join. I decided to go with the smallest and safest change possible and just added back "vars" to SQL object.collect()
works and notcompute()
would work with this pull request. (compute()
fails due to missingdb_save_query.bigquery
which I submitted as a separate pull request here hadley/bigrquery#52)cross_join
that is planned to be implemented as in Cross join #197Please note that now
sql_subquery.bigquery
is also defined inbigrquery
but it probably belongs todplyr
if that solution is found preferable.Tests
The code belows demonstrates a test of BigQuery JOIN via dplyr using publicly available data on Google Cloud. Please note that running the test does require to have a Google Cloud billing project set up already.
The test should run for several seconds and billed for
9GB of data ($0.05 USD)