-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Enable cross project queries #1428
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
ibis/bigquery/client.py
Outdated
| (self.data_project_id, | ||
| self.run_project_id, | ||
| self.dataset_id) = parse_project_and_dataset(project_id, dataset_id) | ||
| self.client = bq.Client(self.data_project_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The google-cloud-python team considers project a keyword parameter to Client and doesn't make guarantees on the positional order.
Also, shouldn't this be the run_project_id? I guess either way we'll have to pass an explicit project somewhere. I just tend to think of the one on the client as the one I'm getting charged to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The google-cloud-python team considers project a keyword parameter to Client and doesn't make guarantees on the positional order.
Thanks for letting me know, I'll update that.
The reason I put self.data_project_id there that is because we're using self.client everywhere to get table names and schema information. Ultimatley I only care about the project getting billed just before execution, and everything else is related to the project I'm getting data from. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, yeah. Only one place where we run queries whereas there are many places where we want the default project for the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a breaking change, but we could just require passing in the project to execute instead of requiring it to be part of the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think it would be a breaking change. Let me see what I can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tinkered with this a bit, and I actually like the way it is now, so that users don't have to pass in a project keyword on every execution. We could have a default, but that brings additional complexity in the form of defaults, since we'd want to have a way to make it easy to execute if you know your billing project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per: #1429
I think rather than using the default project on bigquery.Client at all, Ibis should explicit about which project it is using (data project vs billing project). This extra explicitness is still hidden from the user, but it'll be a bit easier to reason about and also handle cross-project queries.
| @@ -305,24 +348,30 @@ def _build_ast(self, expr, context): | |||
| return result | |||
|
|
|||
| def _execute_query(self, dml, async=False): | |||
| if async: | |||
| raise NotImplementedError( | |||
| 'Asynchronous queries not implemented in the BigQuery backend' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QueryJob class does implement the concurrent.futures interface, but yeah no async keyword since google-cloud-bigquery has to support Python 2.7 still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was really to solidify the expectations about the async parameter here so that a reasonable error message would be visible to users if they pass async=True.
ibis/bigquery/client.py
Outdated
| table_id, dataset_id = _ensure_split(name, database) | ||
| table_ref = self.client.dataset(dataset_id).table(table_id) | ||
| table_ref = self.client.dataset( | ||
| database or self.dataset_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like database should be optionally split into project_id and dataset_id, too. (Here and other methods.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that self.client's project is the one we're getting data from is correct, then we don't need to split here. We should enforce that assumption though, through verifying that the project name isn't included in the database or it's equivalent to self.project_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone wants to write a query that does a cross-project join, like joining a public dataset with their own dataset? Can they construct such a query by creating two connections in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was wondering if you'd ask that :) This is actually enforced generically in ibis, you can't have multiple clients underlying the same expression. I made an issue about this (#1429). It's marked as "Future", but if missing this functionality is a non-starter for you then we can definitely push up the priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, well the 1 client restriction does kind of correlate with my thoughts that the default on the client is usually the one you are charging to. You couldn't do a cross-client query that tries to charge the query to two different projects (or I guess you could, but it'd be ambiguous which one gets charged).
ibis/bigquery/client.py
Outdated
|
|
||
| def _get_table_schema(self, qualified_name): | ||
| return self.get_schema(qualified_name) | ||
| _, dataset, table = qualified_name.split('.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why discard the project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming (possible incorrectly) that self.client's project is the one we're getting data from.
| con = ibis.bigquery.connect( | ||
| project_id='ibis-gbq', | ||
| dataset_id='bigquery-public-data.stackoverflow') | ||
| table = con.table('posts_questions') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add another test cross-project query by setting the database parameter here and in other methods (like exists_database, exists_table, list_tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast How do you feel about disabling cross project versions of these methods? You'd only be able to pass in a database name/table name, list tables/databases in the current data_project_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting the functionality of these methods is more consistent with the fact that there is one data project per client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the expectation that we'll follow-up to open them up later? Or are you still thinking we'd want multiple backends for that use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd caution against multiple clients if only because the auth flow can take some time, especially if it's pinging the metadata server for GCE-based credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: googleapis/python-bigquery-pandas#127 (comment)
In [4]: %timeit google.auth.default()
573 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The Client constructor calls google.auth.default() if no credentials are passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can take even longer on some systems (if it is trying to get credentials from GCE metadata server IIRC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. What is the recommended way to query across projects that doesn't involve creating a new bq.Client object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't realize that some of the key methods like dataset take a project parameter. Let me tinker a bit
ibis/bigquery/client.py
Outdated
| ... ) | ||
| >>> data_project | ||
| 'foo-bar' | ||
| >>> run_project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this billing_project instead of run_project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
ibis/bigquery/client.py
Outdated
|
|
||
| def exists_database(self, name): | ||
| return name in self.list_databases() | ||
| # TODO(cpcloud): There doesn't appear to be a way to list datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast Is it possible to list the datasets in another project without creating a new client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no project parameter to list_datests() (Filed: googleapis/google-cloud-python#5216 I plan to tackle this today)
Workaround: You can override the project (and then reset it back to the old one) in the client before calling list_datasets()
In [1]: from google.cloud import bigquery
In [2]: client = bigquery.Client(project='swast-scratch')
In [3]: print([d.dataset_id for d in client.list_datasets()])
['datavis', 'delete_table_false_1521482245020', ..., 'workflow_test_eu', 'workflow_test_us']
In [4]: client.project = 'bigquery-public-data'
In [5]: print([d.dataset_id for d in client.list_datasets()])
['baseball', ..., 'stackoverflow', 'the_met', 'usa_names', 'utility_eu', 'utility_us', ...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment on a workaround for list_databases #1428 (comment) (Though, I guess that function doesn't take a project parameter and doesn't necessarily have to)
For the exists_database function, I'd actually recommend calling get_dataset() and catching the NotFound exception. DatasetReference does have a project property.
from google.api_core.exceptions import NotFound
dataset_reference = client.dataset(dataset_id, project=data_project_id)
try:
client.get_dataset(dataset_reference)
return True
except NotFound:
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent.
|
@tswast You can now join across different projects :) Check out the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks great.
ibis/bigquery/client.py
Outdated
|
|
||
| def exists_database(self, name): | ||
| return name in self.list_databases() | ||
| # TODO(cpcloud): There doesn't appear to be a way to list datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment on a workaround for list_databases #1428 (comment) (Though, I guess that function doesn't take a project parameter and doesn't necessarily have to)
For the exists_database function, I'd actually recommend calling get_dataset() and catching the NotFound exception. DatasetReference does have a project property.
from google.api_core.exceptions import NotFound
dataset_reference = client.dataset(dataset_id, project=data_project_id)
try:
client.get_dataset(dataset_reference)
return True
except NotFound:
return False
ibis/bigquery/client.py
Outdated
| return table_id in self.list_tables( | ||
| database=dataset_id or self.dataset_id | ||
| ) | ||
| return name in self.list_tables(database=database or self.dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto that I'd recommend get_table and catch NotFound rather than listing for exists_table (some projects have a lot [thousands] of dataset / tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks!
| assert result.get_result() == 1 | ||
|
|
||
|
|
||
| def test_multiple_project_queries(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woohoo!
|
Merging on green. |
|
@tswast Thanks for the review! |
|
@cpcloud Happy to help! I'm really excited about this change. I think it really unlocks a lot of potential for BQ with Ibis. |
Closes #1427
cc @tswast