Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable cross project queries #1428

Closed
wants to merge 3 commits into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Apr 18, 2018

Closes #1427

cc @tswast

@cpcloud cpcloud self-assigned this Apr 18, 2018
@cpcloud cpcloud added bug Incorrect behavior inside of ibis feature Features or general enhancements bigquery labels Apr 18, 2018
@cpcloud cpcloud added this to the 0.14 milestone Apr 18, 2018
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

(self.data_project_id,
self.run_project_id,
self.dataset_id) = parse_project_and_dataset(project_id, dataset_id)
self.client = bq.Client(self.data_project_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The google-cloud-python team considers project a keyword parameter to Client and doesn't make guarantees on the positional order.

Also, shouldn't this be the run_project_id? I guess either way we'll have to pass an explicit project somewhere. I just tend to think of the one on the client as the one I'm getting charged to.

Copy link
Member Author

@cpcloud cpcloud Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The google-cloud-python team considers project a keyword parameter to Client and doesn't make guarantees on the positional order.

Thanks for letting me know, I'll update that.

The reason I put self.data_project_id there that is because we're using self.client everywhere to get table names and schema information. Ultimatley I only care about the project getting billed just before execution, and everything else is related to the project I'm getting data from. Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, yeah. Only one place where we run queries whereas there are many places where we want the default project for the dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a breaking change, but we could just require passing in the project to execute instead of requiring it to be part of the constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think it would be a breaking change. Let me see what I can do here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tinkered with this a bit, and I actually like the way it is now, so that users don't have to pass in a project keyword on every execution. We could have a default, but that brings additional complexity in the form of defaults, since we'd want to have a way to make it easy to execute if you know your billing project.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per: #1429

I think rather than using the default project on bigquery.Client at all, Ibis should explicit about which project it is using (data project vs billing project). This extra explicitness is still hidden from the user, but it'll be a bit easier to reason about and also handle cross-project queries.

@@ -305,24 +348,30 @@ def _build_ast(self, expr, context):
return result

def _execute_query(self, dml, async=False):
if async:
raise NotImplementedError(
'Asynchronous queries not implemented in the BigQuery backend'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QueryJob class does implement the concurrent.futures interface, but yeah no async keyword since google-cloud-bigquery has to support Python 2.7 still.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was really to solidify the expectations about the async parameter here so that a reasonable error message would be visible to users if they pass async=True.

table_id, dataset_id = _ensure_split(name, database)
table_ref = self.client.dataset(dataset_id).table(table_id)
table_ref = self.client.dataset(
database or self.dataset_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like database should be optionally split into project_id and dataset_id, too. (Here and other methods.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that self.client's project is the one we're getting data from is correct, then we don't need to split here. We should enforce that assumption though, through verifying that the project name isn't included in the database or it's equivalent to self.project_id.

Copy link
Collaborator

@tswast tswast Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone wants to write a query that does a cross-project join, like joining a public dataset with their own dataset? Can they construct such a query by creating two connections in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was wondering if you'd ask that :) This is actually enforced generically in ibis, you can't have multiple clients underlying the same expression. I made an issue about this (#1429). It's marked as "Future", but if missing this functionality is a non-starter for you then we can definitely push up the priority.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, well the 1 client restriction does kind of correlate with my thoughts that the default on the client is usually the one you are charging to. You couldn't do a cross-client query that tries to charge the query to two different projects (or I guess you could, but it'd be ambiguous which one gets charged).


def _get_table_schema(self, qualified_name):
return self.get_schema(qualified_name)
_, dataset, table = qualified_name.split('.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why discard the project?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming (possible incorrectly) that self.client's project is the one we're getting data from.

con = ibis.bigquery.connect(
project_id='ibis-gbq',
dataset_id='bigquery-public-data.stackoverflow')
table = con.table('posts_questions')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add another test cross-project query by setting the database parameter here and in other methods (like exists_database, exists_table, list_tables).

Copy link
Member Author

@cpcloud cpcloud Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast How do you feel about disabling cross project versions of these methods? You'd only be able to pass in a database name/table name, list tables/databases in the current data_project_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting the functionality of these methods is more consistent with the fact that there is one data project per client.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the expectation that we'll follow-up to open them up later? Or are you still thinking we'd want multiple backends for that use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd caution against multiple clients if only because the auth flow can take some time, especially if it's pinging the metadata server for GCE-based credentials.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: googleapis/python-bigquery-pandas#127 (comment)

In [4]: %timeit google.auth.default()
573 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The Client constructor calls google.auth.default() if no credentials are passed in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can take even longer on some systems (if it is trying to get credentials from GCE metadata server IIRC)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. What is the recommended way to query across projects that doesn't involve creating a new bq.Client object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realize that some of the key methods like dataset take a project parameter. Let me tinker a bit

... )
>>> data_project
'foo-bar'
>>> run_project
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this billing_project instead of run_project?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.


def exists_database(self, name):
return name in self.list_databases()
# TODO(cpcloud): There doesn't appear to be a way to list datasets
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast Is it possible to list the datasets in another project without creating a new client?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no project parameter to list_datests() (Filed: googleapis/google-cloud-python#5216 I plan to tackle this today)

Workaround: You can override the project (and then reset it back to the old one) in the client before calling list_datasets()

In [1]: from google.cloud import bigquery

In [2]: client = bigquery.Client(project='swast-scratch')

In [3]: print([d.dataset_id for d in client.list_datasets()])
['datavis', 'delete_table_false_1521482245020', ...,  'workflow_test_eu', 'workflow_test_us']

In [4]: client.project = 'bigquery-public-data'

In [5]: print([d.dataset_id for d in client.list_datasets()])
['baseball', ..., 'stackoverflow',  'the_met', 'usa_names', 'utility_eu', 'utility_us', ...]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment on a workaround for list_databases #1428 (comment) (Though, I guess that function doesn't take a project parameter and doesn't necessarily have to)

For the exists_database function, I'd actually recommend calling get_dataset() and catching the NotFound exception. DatasetReference does have a project property.

from google.api_core.exceptions import NotFound

dataset_reference = client.dataset(dataset_id, project=data_project_id)
try:
    client.get_dataset(dataset_reference)
    return True
except NotFound:
    return False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 20, 2018

@tswast You can now join across different projects :) Check out the test_multiple_project_queries tests for examples.

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great.


def exists_database(self, name):
return name in self.list_databases()
# TODO(cpcloud): There doesn't appear to be a way to list datasets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment on a workaround for list_databases #1428 (comment) (Though, I guess that function doesn't take a project parameter and doesn't necessarily have to)

For the exists_database function, I'd actually recommend calling get_dataset() and catching the NotFound exception. DatasetReference does have a project property.

from google.api_core.exceptions import NotFound

dataset_reference = client.dataset(dataset_id, project=data_project_id)
try:
    client.get_dataset(dataset_reference)
    return True
except NotFound:
    return False

return table_id in self.list_tables(
database=dataset_id or self.dataset_id
)
return name in self.list_tables(database=database or self.dataset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto that I'd recommend get_table and catch NotFound rather than listing for exists_table (some projects have a lot [thousands] of dataset / tables).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

assert result.get_result() == 1


def test_multiple_project_queries(client):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo!

@cpcloud
Copy link
Member Author

cpcloud commented Apr 20, 2018

Merging on green.

@cpcloud cpcloud closed this in efe3587 Apr 20, 2018
@cpcloud cpcloud deleted the cross-project branch April 20, 2018 19:35
@cpcloud
Copy link
Member Author

cpcloud commented Apr 20, 2018

@tswast Thanks for the review!

@tswast
Copy link
Collaborator

tswast commented Apr 20, 2018

@cpcloud Happy to help! I'm really excited about this change. I think it really unlocks a lot of potential for BQ with Ibis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants