Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented sql_join for Google BigQuery #1244

Closed
wants to merge 1 commit into from
Closed

Implemented sql_join for Google BigQuery #1244

wants to merge 1 commit into from

Conversation

realAkhmed
Copy link

Hello everyone! This pull request features an implementation of sql_join for "bigquery".

(This is a resubmission of #1220 that I had to close in order to move to a different branch)

Motivation

The major motivation for this pull request is that Google BigQuery has powerful SQL join facilities but requires a particular dialect of SQL when doing joins such that regular sql_join fails to produce a runnable query.

  • Example: SQL keyword "USING" is not supported and the complete column name specification is required. (The default sql_join.DBIConnection solution when joining two tables A and B on the common column id is to say "A JOIN B USING id" while Google BigQuery requires to say "A JOIN B on A.id = B.id")
  • Example: Subsequently, Google Query is very sensitive to ambigous names inside SELECT statement and would require one to specify SELECT A.id or SELECT B.id instead of SELECT id which required some change in the logic of sql_join.
  • Example: When joining large tables "JOIN EACH" keyword is required instead of "JOIN"

Implementation Details

Some points that are probably worth clarifying for this pull request:

  • Pull request is submitted to dplyr instead of bigrquery. I thought it is better to submit the pull request to dplyr since the new function heavily mimicks sql_join.DBIConnection and relies on pretty much the same non-exported dependencies: auto_names, unique_names, common_by etc.
  • sql_subquery.bigquery is modified. The problem with the current version sql_subquery.bigquery that is presented inside bigrquery package is that build_sql erases "vars" attribute when creating a subquery. However, "vars" are needed for correct subsequent processing of the join. I decided to go with the smallest and safest change possible and just added back "vars" to SQL object.
  • only collect() works and not compute() would work with this pull request. (compute() fails due to missing db_save_query.bigquery which I submitted as a separate pull request here hadley/bigrquery#52)
  • the function is compatible (and tested) with future cross_join that is planned to be implemented as in Cross join #197

Please note that now sql_subquery.bigquery is also defined in bigrquery but it probably belongs to dplyr if that solution is found preferable.

Tests

The code belows demonstrates a test of BigQuery JOIN via dplyr using publicly available data on Google Cloud. Please note that running the test does require to have a Google Cloud billing project set up already.

The test should run for several seconds and billed for 9GB of data ($0.05 USD)

library(dplyr)
library(bigrquery)

shakespeare <- 
   src_bigquery("publicdata", "samples", billing="<BILLING_PROJECT>") %>% 
   tbl("shakespeare")
wikipedia <- 
   src_bigquery("publicdata", "samples", billing="<BILLING_PROJECT>") %>% 
   tbl("wikipedia")

result <-
  shakespeare %>% 
  select(word, word_count) %>%
  inner_join(wikipedia %>% select(title, num_characters),
             by=c("word"="title")) %>% 
  summarise( n = n(),
             word_count = sum(word_count),
             num_characters = max(num_characters)) %>%
  collect

result

@realAkhmed
Copy link
Author

Pull request is moved to bigrquery project: r-dbi/bigrquery#56

@realAkhmed realAkhmed closed this Jul 3, 2015
@realAkhmed realAkhmed deleted the sql_join_bigquery branch July 3, 2015 07:16
@lock
Copy link

lock bot commented Jan 19, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant