-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Upgrade google-cloud-bigquery dependency to 1.0.0 #1424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A comments, but overall looks good.
| @@ -106,11 +100,13 @@ def __init__(self, query): | |||
| self.query = query | |||
|
|
|||
| def fetchall(self): | |||
| return list(self.query.fetch_data()) | |||
| result = self.query.result() | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the QueryJob is itself an iterator, and will call result() under the covers as soon as you try to iterate over it. No need to update, it's fine to call result() to be more explicit that the code will block to wait for the query to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, good to know. Thanks!
ibis/bigquery/client.py
Outdated
| @@ -189,28 +181,18 @@ def get_datasets(self): | |||
| return list(self.client.list_datasets()) | |||
|
|
|||
| def get_dataset(self, dataset_id): | |||
| return self.client.dataset(dataset_id) | |||
| dataset_ref = self.client.dataset(dataset_id) | |||
| return self.client.get_dataset(dataset_ref) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent.
Sidenote: the previous version (based on 0.27 client) didn't actually make a GET call. To do that you would have had to also call .reload() on the Dataset object.
Update: Looking at my subsequent comments, maybe we don't actually want to GET a dataset resource? I guess the function name makes it sound like we should. Maybe what we really need is a def dataset_ref(self, dataset_id) function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and removed the BigQueryAPIProxy in favor of calls into the client. The additional level of indirection was unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made it easier to use refs as much as possible, so a dataset_ref method shouldn't be necessary anymore.
ibis/bigquery/client.py
Outdated
| table = self.client.dataset(dataset_id).table(table_id) | ||
| if reload: | ||
| table.reload() | ||
| table_ref = self.get_dataset(dataset_id).table(table_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't actually have to call get_dataset() here. All you need is a reference, so self.client.dataset(dataset_id).table(table_id) should work. Then you're only doing 1 API call to GET the Table instead of 2 to GET the Dataset and the Table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, I'll make this change.
ibis/bigquery/client.py
Outdated
| query.run() | ||
| job_config = bq.job.QueryJobConfig() | ||
| job_config.query_parameters = query_parameters or [] | ||
| job_config.use_legacy_sql = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 0.28, the client has set standard SQL as the default, but okay to keep this line to be explicit about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I will leave this here for now to be explicit.
ibis/bigquery/client.py
Outdated
| job_config = bq.job.QueryJobConfig() | ||
| job_config.query_parameters = query_parameters or [] | ||
| job_config.use_legacy_sql = False | ||
| query = self._proxy.client.query(stmt, job_config=job_config) | ||
|
|
||
| # run_sync_query is not really synchronous: there's a timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could call query.result() here to wait. It effectively does this, but with a little more retry logic. If you don't explicitly pass in a timeout to result() it will wait indefinitely as you are doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea.
ibis/bigquery/client.py
Outdated
| @@ -381,11 +363,13 @@ def exists_table(self, name, database=None): | |||
|
|
|||
| def list_tables(self, like=None, database=None): | |||
| dataset = self._proxy.get_dataset(database or self.dataset_id) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't actually need to make a GET call to fetch a Dataset resource. All you need to list is a reference, so you could do dataset = client.dataset(database or self.dataset_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I'll make this change.
| assert fields | ||
| names = [el.name for el in fields] | ||
| ibis_types = [_discover_type(el) for el in fields] | ||
| ibis_type = dt.Struct(names, ibis_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: to_dataframe() in the google-cloud-bigquery library just uses object as the datatype for record and array columns. (I think because Pandas doesn't have these datatypes) Will that cause a problem for Ibis? Is there actually a Pandas Struct and Array type now that we should be using in that library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no struct/array type in pandas. This won't affect ibis, since we already assume this to be true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Would it make sense for google-cloud-bigquery to add a to_arrow() method like to_dataframe() so that these types have better support here?
ibis/bigquery/client.py
Outdated
| except BadRequest: | ||
| pass | ||
| if table.partitioning_type is not None: | ||
| pairs.append((NATIVE_PARTITION_COL, dt.timestamp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this won't be correct for tables that use the new column-based time partitioning.
Unfortunately google-cloud-bigquery still doesn't have the partitioning field property yet googleapis/google-cloud-python#4658 Whenever we add that there, this logic should change to only add the NATIVE_PARTITION_COL when the partitioning field is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a surefire way (even if either is a workaround) to 1) tell if a table is partitioned, and 2) what column(s) are being used for partitioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I see that there's some more info in the issue, I'll read through that :)
e40386d
to
2a59ddf
Compare
ibis/bigquery/client.py
Outdated
| class BigQueryClient(SQLClient): | ||
|
|
||
| sync_query = BigQueryQuery | ||
| database_class = BigQueryDatabase | ||
| proxy_class = BigQueryAPIProxy | ||
| table_class = BigQueryTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality to make this work isn't yet implemented. I've got a PR in the works to do this.
2a59ddf
to
d0bdbc7
Compare
|
@tswast This is ready for another review if you're up for it. In particular these two functions (https://github.com/ibis-project/ibis/pull/1424/files#diff-8e85aac896fd8a71d7151fe44de24b5fR257, https://github.com/ibis-project/ibis/pull/1424/files#diff-8e85aac896fd8a71d7151fe44de24b5fR91) that are mucking around in I'll add some more tests to make sure this works with column based partitions. |
| def schema_from_pairs(lst): | ||
| return Schema.from_tuples(lst) | ||
|
|
||
|
|
||
| @schema.register((tuple, list), (tuple, list)) | ||
| @schema.register(collections.Iterable, collections.Iterable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
d0bdbc7
to
097abdb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigQuery stuff looks great.
Yes, bq_table._properties.get('timePartitioning', None) is the correct workaround.
| assert fields | ||
| names = [el.name for el in fields] | ||
| ibis_types = [_discover_type(el) for el in fields] | ||
| ibis_type = dt.Struct(names, ibis_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Would it make sense for google-cloud-bigquery to add a to_arrow() method like to_dataframe() so that these types have better support here?
| def test_parted_column(client, kind, option, expected_fn): | ||
| table_name = '{}_column_parted'.format(kind) | ||
| option_key = 'bigquery.partition_col' | ||
| with ibis.config.option_context(option_key, option): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does Ibis have an option to rename the partition column? Or does this do something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is a convenience to allow users to refer to _PARTITIONTIME, since refering to it without renaming generates an error in BigQuery. For column-based partitioning this isn't a problem, but for tables that are implicitly partitioned we want to do this.
|
@tswast regarding |
d91ae8d
to
fa7f0f8
Compare
|
Merging on green. |
Upgrade to
google-cloud-bigquery1.0.0cc @tswast