Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Upgrade google-cloud-bigquery dependency to 1.0.0 #1424

Closed
wants to merge 1 commit into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Apr 14, 2018

Upgrade to google-cloud-bigquery 1.0.0

cc @tswast

@cpcloud cpcloud added this to the 0.14 milestone Apr 14, 2018
@cpcloud cpcloud self-assigned this Apr 14, 2018
@cpcloud cpcloud added feature Features or general enhancements docs Documentation related issues or PRs bigquery labels Apr 14, 2018
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A comments, but overall looks good.

@@ -106,11 +100,13 @@ def __init__(self, query):
self.query = query

def fetchall(self):
return list(self.query.fetch_data())
result = self.query.result()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the QueryJob is itself an iterator, and will call result() under the covers as soon as you try to iterate over it. No need to update, it's fine to call result() to be more explicit that the code will block to wait for the query to complete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, good to know. Thanks!

@@ -189,28 +181,18 @@ def get_datasets(self):
return list(self.client.list_datasets())

def get_dataset(self, dataset_id):
return self.client.dataset(dataset_id)
dataset_ref = self.client.dataset(dataset_id)
return self.client.get_dataset(dataset_ref)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent.

Sidenote: the previous version (based on 0.27 client) didn't actually make a GET call. To do that you would have had to also call .reload() on the Dataset object.

Update: Looking at my subsequent comments, maybe we don't actually want to GET a dataset resource? I guess the function name makes it sound like we should. Maybe what we really need is a def dataset_ref(self, dataset_id) function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and removed the BigQueryAPIProxy in favor of calls into the client. The additional level of indirection was unnecessary.

Copy link
Member Author

@cpcloud cpcloud Apr 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made it easier to use refs as much as possible, so a dataset_ref method shouldn't be necessary anymore.

table = self.client.dataset(dataset_id).table(table_id)
if reload:
table.reload()
table_ref = self.get_dataset(dataset_id).table(table_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually have to call get_dataset() here. All you need is a reference, so self.client.dataset(dataset_id).table(table_id) should work. Then you're only doing 1 API call to GET the Table instead of 2 to GET the Dataset and the Table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, I'll make this change.

query.run()
job_config = bq.job.QueryJobConfig()
job_config.query_parameters = query_parameters or []
job_config.use_legacy_sql = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 0.28, the client has set standard SQL as the default, but okay to keep this line to be explicit about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I will leave this here for now to be explicit.

job_config = bq.job.QueryJobConfig()
job_config.query_parameters = query_parameters or []
job_config.use_legacy_sql = False
query = self._proxy.client.query(stmt, job_config=job_config)

# run_sync_query is not really synchronous: there's a timeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could call query.result() here to wait. It effectively does this, but with a little more retry logic. If you don't explicitly pass in a timeout to result() it will wait indefinitely as you are doing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good idea.

@@ -381,11 +363,13 @@ def exists_table(self, name, database=None):

def list_tables(self, like=None, database=None):
dataset = self._proxy.get_dataset(database or self.dataset_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need to make a GET call to fetch a Dataset resource. All you need to list is a reference, so you could do dataset = client.dataset(database or self.dataset_id)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I'll make this change.

assert fields
names = [el.name for el in fields]
ibis_types = [_discover_type(el) for el in fields]
ibis_type = dt.Struct(names, ibis_types)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: to_dataframe() in the google-cloud-bigquery library just uses object as the datatype for record and array columns. (I think because Pandas doesn't have these datatypes) Will that cause a problem for Ibis? Is there actually a Pandas Struct and Array type now that we should be using in that library?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no struct/array type in pandas. This won't affect ibis, since we already assume this to be true.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Would it make sense for google-cloud-bigquery to add a to_arrow() method like to_dataframe() so that these types have better support here?

except BadRequest:
pass
if table.partitioning_type is not None:
pairs.append((NATIVE_PARTITION_COL, dt.timestamp))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this won't be correct for tables that use the new column-based time partitioning.

Unfortunately google-cloud-bigquery still doesn't have the partitioning field property yet googleapis/google-cloud-python#4658 Whenever we add that there, this logic should change to only add the NATIVE_PARTITION_COL when the partitioning field is None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a surefire way (even if either is a workaround) to 1) tell if a table is partitioned, and 2) what column(s) are being used for partitioning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I see that there's some more info in the issue, I'll read through that :)

@cpcloud cpcloud force-pushed the bigquery-1.0.0-upgrade branch 2 times, most recently from e40386d to 2a59ddf Compare April 15, 2018 18:51
class BigQueryClient(SQLClient):

sync_query = BigQueryQuery
database_class = BigQueryDatabase
proxy_class = BigQueryAPIProxy
table_class = BigQueryTable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality to make this work isn't yet implemented. I've got a PR in the works to do this.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 15, 2018

@tswast This is ready for another review if you're up for it. In particular these two functions (https://github.com/ibis-project/ibis/pull/1424/files#diff-8e85aac896fd8a71d7151fe44de24b5fR257, https://github.com/ibis-project/ibis/pull/1424/files#diff-8e85aac896fd8a71d7151fe44de24b5fR91) that are mucking around in _properties to figure out partition fields.

I'll add some more tests to make sure this works with column based partitions.

def schema_from_pairs(lst):
return Schema.from_tuples(lst)


@schema.register((tuple, list), (tuple, list))
@schema.register(collections.Iterable, collections.Iterable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigQuery stuff looks great.

Yes, bq_table._properties.get('timePartitioning', None) is the correct workaround.

assert fields
names = [el.name for el in fields]
ibis_types = [_discover_type(el) for el in fields]
ibis_type = dt.Struct(names, ibis_types)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Would it make sense for google-cloud-bigquery to add a to_arrow() method like to_dataframe() so that these types have better support here?

def test_parted_column(client, kind, option, expected_fn):
table_name = '{}_column_parted'.format(kind)
option_key = 'bigquery.partition_col'
with ibis.config.option_context(option_key, option):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does Ibis have an option to rename the partition column? Or does this do something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is a convenience to allow users to refer to _PARTITIONTIME, since refering to it without renaming generates an error in BigQuery. For column-based partitioning this isn't a problem, but for tables that are implicitly partitioned we want to do this.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 18, 2018

@tswast regarding to_arrow I think that's a great idea. I'll open an issue over on the bigquery python tracker.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 18, 2018

Merging on green.

@cpcloud cpcloud closed this in 63c9274 Apr 18, 2018
@cpcloud cpcloud deleted the bigquery-1.0.0-upgrade branch April 18, 2018 15:06
@cpcloud
Copy link
Member Author

cpcloud commented Apr 18, 2018

@tswast Created googleapis/google-cloud-python#5204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related issues or PRs feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants