New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable cross project queries #1428

Closed
wants to merge 3 commits into
base: master
from

Conversation

Projects
2 participants
@cpcloud
Member

cpcloud commented Apr 18, 2018

Closes #1427

cc @tswast

@cpcloud cpcloud self-assigned this Apr 18, 2018

@cpcloud cpcloud added this to the 0.14 milestone Apr 18, 2018

@cpcloud cpcloud added this to To do in BigQuery via automation Apr 18, 2018

(self.data_project_id,
self.run_project_id,
self.dataset_id) = parse_project_and_dataset(project_id, dataset_id)
self.client = bq.Client(self.data_project_id)

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Nit: The google-cloud-python team considers project a keyword parameter to Client and doesn't make guarantees on the positional order.

Also, shouldn't this be the run_project_id? I guess either way we'll have to pass an explicit project somewhere. I just tend to think of the one on the client as the one I'm getting charged to.

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

Nit: The google-cloud-python team considers project a keyword parameter to Client and doesn't make guarantees on the positional order.

Thanks for letting me know, I'll update that.

The reason I put self.data_project_id there that is because we're using self.client everywhere to get table names and schema information. Ultimatley I only care about the project getting billed just before execution, and everything else is related to the project I'm getting data from. Does that make sense?

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Makes sense, yeah. Only one place where we run queries whereas there are many places where we want the default project for the dataset.

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

It would be a breaking change, but we could just require passing in the project to execute instead of requiring it to be part of the constructor.

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

Actually, I don't think it would be a breaking change. Let me see what I can do here.

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

I tinkered with this a bit, and I actually like the way it is now, so that users don't have to pass in a project keyword on every execution. We could have a default, but that brings additional complexity in the form of defaults, since we'd want to have a way to make it easy to execute if you know your billing project.

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Per: #1429

I think rather than using the default project on bigquery.Client at all, Ibis should explicit about which project it is using (data project vs billing project). This extra explicitness is still hidden from the user, but it'll be a bit easier to reason about and also handle cross-project queries.

@@ -305,24 +348,30 @@ def _build_ast(self, expr, context):
return result
def _execute_query(self, dml, async=False):
if async:
raise NotImplementedError(
'Asynchronous queries not implemented in the BigQuery backend'

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

The QueryJob class does implement the concurrent.futures interface, but yeah no async keyword since google-cloud-bigquery has to support Python 2.7 still.

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

Yeah, this was really to solidify the expectations about the async parameter here so that a reasonable error message would be visible to users if they pass async=True.

table_id, dataset_id = _ensure_split(name, database)
table_ref = self.client.dataset(dataset_id).table(table_id)
table_ref = self.client.dataset(
database or self.dataset_id

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Seems like database should be optionally split into project_id and dataset_id, too. (Here and other methods.)

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

Assuming that self.client's project is the one we're getting data from is correct, then we don't need to split here. We should enforce that assumption though, through verifying that the project name isn't included in the database or it's equivalent to self.project_id.

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

What if someone wants to write a query that does a cross-project join, like joining a public dataset with their own dataset? Can they construct such a query by creating two connections in that case?

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

Ah, I was wondering if you'd ask that :) This is actually enforced generically in ibis, you can't have multiple clients underlying the same expression. I made an issue about this (#1429). It's marked as "Future", but if missing this functionality is a non-starter for you then we can definitely push up the priority.

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Ah, well the 1 client restriction does kind of correlate with my thoughts that the default on the client is usually the one you are charging to. You couldn't do a cross-client query that tries to charge the query to two different projects (or I guess you could, but it'd be ambiguous which one gets charged).

def _get_table_schema(self, qualified_name):
return self.get_schema(qualified_name)
_, dataset, table = qualified_name.split('.')

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Why discard the project?

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

I'm assuming (possible incorrectly) that self.client's project is the one we're getting data from.

con = ibis.bigquery.connect(
project_id='ibis-gbq',
dataset_id='bigquery-public-data.stackoverflow')
table = con.table('posts_questions')

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Could we add another test cross-project query by setting the database parameter here and in other methods (like exists_database, exists_table, list_tables).

This comment has been minimized.

@cpcloud

cpcloud Apr 18, 2018

Member

Absolutely, will do.

This comment has been minimized.

@cpcloud

cpcloud Apr 19, 2018

Member

@tswast How do you feel about disabling cross project versions of these methods? You'd only be able to pass in a database name/table name, list tables/databases in the current data_project_id.

This comment has been minimized.

@cpcloud

cpcloud Apr 19, 2018

Member

Limiting the functionality of these methods is more consistent with the fact that there is one data project per client.

This comment has been minimized.

@tswast

tswast Apr 19, 2018

Contributor

With the expectation that we'll follow-up to open them up later? Or are you still thinking we'd want multiple backends for that use?

This comment has been minimized.

@tswast

tswast Apr 19, 2018

Contributor

I'd caution against multiple clients if only because the auth flow can take some time, especially if it's pinging the metadata server for GCE-based credentials.

This comment has been minimized.

@tswast

tswast Apr 19, 2018

Contributor

See: pydata/pandas-gbq#127 (comment)

In [4]: %timeit google.auth.default()
573 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The Client constructor calls google.auth.default() if no credentials are passed in.

This comment has been minimized.

@tswast

tswast Apr 19, 2018

Contributor

It can take even longer on some systems (if it is trying to get credentials from GCE metadata server IIRC)

This comment has been minimized.

@cpcloud

cpcloud Apr 20, 2018

Member

Interesting. What is the recommended way to query across projects that doesn't involve creating a new bq.Client object.

This comment has been minimized.

@cpcloud

cpcloud Apr 20, 2018

Member

Ah I didn't realize that some of the key methods like dataset take a project parameter. Let me tinker a bit

... )
>>> data_project
'foo-bar'
>>> run_project

This comment has been minimized.

@tswast

tswast Apr 18, 2018

Contributor

Can we call this billing_project instead of run_project?

This comment has been minimized.

@cpcloud

@cpcloud cpcloud force-pushed the cpcloud:cross-project branch from 1dd44c4 to 4241158 Apr 20, 2018

def exists_database(self, name):
return name in self.list_databases()
# TODO(cpcloud): There doesn't appear to be a way to list datasets

This comment has been minimized.

@cpcloud

cpcloud Apr 20, 2018

Member

@tswast Is it possible to list the datasets in another project without creating a new client?

This comment has been minimized.

@tswast

tswast Apr 20, 2018

Contributor

There's no project parameter to list_datests() (Filed: googleapis/google-cloud-python#5216 I plan to tackle this today)

Workaround: You can override the project (and then reset it back to the old one) in the client before calling list_datasets()

In [1]: from google.cloud import bigquery

In [2]: client = bigquery.Client(project='swast-scratch')

In [3]: print([d.dataset_id for d in client.list_datasets()])
['datavis', 'delete_table_false_1521482245020', ...,  'workflow_test_eu', 'workflow_test_us']

In [4]: client.project = 'bigquery-public-data'

In [5]: print([d.dataset_id for d in client.list_datasets()])
['baseball', ..., 'stackoverflow',  'the_met', 'usa_names', 'utility_eu', 'utility_us', ...]

This comment has been minimized.

@tswast

tswast Apr 20, 2018

Contributor

See my other comment on a workaround for list_databases #1428 (comment) (Though, I guess that function doesn't take a project parameter and doesn't necessarily have to)

For the exists_database function, I'd actually recommend calling get_dataset() and catching the NotFound exception. DatasetReference does have a project property.

from google.api_core.exceptions import NotFound

dataset_reference = client.dataset(dataset_id, project=data_project_id)
try:
    client.get_dataset(dataset_reference)
    return True
except NotFound:
    return False

This comment has been minimized.

@cpcloud

cpcloud Apr 20, 2018

Member

Excellent.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 20, 2018

@tswast You can now join across different projects :) Check out the test_multiple_project_queries tests for examples.

@tswast

Thanks! Looks great.

def exists_database(self, name):
return name in self.list_databases()
# TODO(cpcloud): There doesn't appear to be a way to list datasets

This comment has been minimized.

@tswast

tswast Apr 20, 2018

Contributor

See my other comment on a workaround for list_databases #1428 (comment) (Though, I guess that function doesn't take a project parameter and doesn't necessarily have to)

For the exists_database function, I'd actually recommend calling get_dataset() and catching the NotFound exception. DatasetReference does have a project property.

from google.api_core.exceptions import NotFound

dataset_reference = client.dataset(dataset_id, project=data_project_id)
try:
    client.get_dataset(dataset_reference)
    return True
except NotFound:
    return False
return table_id in self.list_tables(
database=dataset_id or self.dataset_id
)
return name in self.list_tables(database=database or self.dataset)

This comment has been minimized.

@tswast

tswast Apr 20, 2018

Contributor

Ditto that I'd recommend get_table and catch NotFound rather than listing for exists_table (some projects have a lot [thousands] of dataset / tables).

This comment has been minimized.

@cpcloud

cpcloud Apr 20, 2018

Member

Will do, thanks!

assert result.get_result() == 1
def test_multiple_project_queries(client):

This comment has been minimized.

@tswast

tswast Apr 20, 2018

Contributor

Woohoo!

@cpcloud cpcloud force-pushed the cpcloud:cross-project branch from 4241158 to ab3e9d8 Apr 20, 2018

@cpcloud cpcloud force-pushed the cpcloud:cross-project branch from 67456b1 to 5b693b3 Apr 20, 2018

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 20, 2018

Merging on green.

@tswast

tswast approved these changes Apr 20, 2018

cpcloud added some commits Apr 20, 2018

@cpcloud cpcloud closed this in efe3587 Apr 20, 2018

BigQuery automation moved this from To do to Done Apr 20, 2018

@cpcloud cpcloud deleted the cpcloud:cross-project branch Apr 20, 2018

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 20, 2018

@tswast Thanks for the review!

@tswast

This comment has been minimized.

Contributor

tswast commented Apr 20, 2018

@cpcloud Happy to help! I'm really excited about this change. I think it really unlocks a lot of potential for BQ with Ibis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment