New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Upgrade google-cloud-bigquery dependency to 1.0.0 #1424

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
3 participants
@cpcloud
Member

cpcloud commented Apr 14, 2018

Upgrade to google-cloud-bigquery 1.0.0

cc @tswast

@cpcloud cpcloud added this to the 0.14 milestone Apr 14, 2018

@cpcloud cpcloud added this to To do in BigQuery via automation Apr 14, 2018

@cpcloud cpcloud self-assigned this Apr 14, 2018

@tswast

Thanks! A comments, but overall looks good.

@@ -106,11 +100,13 @@ def __init__(self, query):
self.query = query
def fetchall(self):
return list(self.query.fetch_data())
result = self.query.result()

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

Note: the QueryJob is itself an iterator, and will call result() under the covers as soon as you try to iterate over it. No need to update, it's fine to call result() to be more explicit that the code will block to wait for the query to complete.

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

Cool, good to know. Thanks!

@@ -189,28 +181,18 @@ def get_datasets(self):
return list(self.client.list_datasets())
def get_dataset(self, dataset_id):
return self.client.dataset(dataset_id)
dataset_ref = self.client.dataset(dataset_id)
return self.client.get_dataset(dataset_ref)

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

Excellent.

Sidenote: the previous version (based on 0.27 client) didn't actually make a GET call. To do that you would have had to also call .reload() on the Dataset object.

Update: Looking at my subsequent comments, maybe we don't actually want to GET a dataset resource? I guess the function name makes it sound like we should. Maybe what we really need is a def dataset_ref(self, dataset_id) function?

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

I've gone ahead and removed the BigQueryAPIProxy in favor of calls into the client. The additional level of indirection was unnecessary.

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

This made it easier to use refs as much as possible, so a dataset_ref method shouldn't be necessary anymore.

table = self.client.dataset(dataset_id).table(table_id)
if reload:
table.reload()
table_ref = self.get_dataset(dataset_id).table(table_id)

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

You don't actually have to call get_dataset() here. All you need is a reference, so self.client.dataset(dataset_id).table(table_id) should work. Then you're only doing 1 API call to GET the Table instead of 2 to GET the Dataset and the Table.

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

Excellent, I'll make this change.

query.run()
job_config = bq.job.QueryJobConfig()
job_config.query_parameters = query_parameters or []
job_config.use_legacy_sql = False

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

Since 0.28, the client has set standard SQL as the default, but okay to keep this line to be explicit about it.

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

Great. I will leave this here for now to be explicit.

job_config = bq.job.QueryJobConfig()
job_config.query_parameters = query_parameters or []
job_config.use_legacy_sql = False
query = self._proxy.client.query(stmt, job_config=job_config)
# run_sync_query is not really synchronous: there's a timeout

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

You could call query.result() here to wait. It effectively does this, but with a little more retry logic. If you don't explicitly pass in a timeout to result() it will wait indefinitely as you are doing here.

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

This sounds like a good idea.

@@ -381,11 +363,13 @@ def exists_table(self, name, database=None):
def list_tables(self, like=None, database=None):
dataset = self._proxy.get_dataset(database or self.dataset_id)

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

You don't actually need to make a GET call to fetch a Dataset resource. All you need to list is a reference, so you could do dataset = client.dataset(database or self.dataset_id)

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

Great, I'll make this change.

assert fields
names = [el.name for el in fields]
ibis_types = [_discover_type(el) for el in fields]
ibis_type = dt.Struct(names, ibis_types)

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

Note: to_dataframe() in the google-cloud-bigquery library just uses object as the datatype for record and array columns. (I think because Pandas doesn't have these datatypes) Will that cause a problem for Ibis? Is there actually a Pandas Struct and Array type now that we should be using in that library?

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

There's no struct/array type in pandas. This won't affect ibis, since we already assume this to be true.

This comment has been minimized.

@tswast

tswast Apr 16, 2018

Contributor

👍

Would it make sense for google-cloud-bigquery to add a to_arrow() method like to_dataframe() so that these types have better support here?

except BadRequest:
pass
if table.partitioning_type is not None:
pairs.append((NATIVE_PARTITION_COL, dt.timestamp))

This comment has been minimized.

@tswast

tswast Apr 15, 2018

Contributor

Note: this won't be correct for tables that use the new column-based time partitioning.

Unfortunately google-cloud-bigquery still doesn't have the partitioning field property yet googleapis/google-cloud-python#4658 Whenever we add that there, this logic should change to only add the NATIVE_PARTITION_COL when the partitioning field is None.

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

Is there a surefire way (even if either is a workaround) to 1) tell if a table is partitioned, and 2) what column(s) are being used for partitioning?

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

Update: I see that there's some more info in the issue, I'll read through that :)

@cpcloud cpcloud force-pushed the cpcloud:bigquery-1.0.0-upgrade branch 2 times, most recently from e40386d to 2a59ddf Apr 15, 2018

class BigQueryClient(SQLClient):
sync_query = BigQueryQuery
database_class = BigQueryDatabase
proxy_class = BigQueryAPIProxy
table_class = BigQueryTable

This comment has been minimized.

@cpcloud

cpcloud Apr 15, 2018

Member

The functionality to make this work isn't yet implemented. I've got a PR in the works to do this.

@cpcloud cpcloud force-pushed the cpcloud:bigquery-1.0.0-upgrade branch from 2a59ddf to d0bdbc7 Apr 15, 2018

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 15, 2018

@tswast This is ready for another review if you're up for it. In particular these two functions (https://github.com/ibis-project/ibis/pull/1424/files#diff-8e85aac896fd8a71d7151fe44de24b5fR257, https://github.com/ibis-project/ibis/pull/1424/files#diff-8e85aac896fd8a71d7151fe44de24b5fR91) that are mucking around in _properties to figure out partition fields.

I'll add some more tests to make sure this works with column based partitions.

def schema_from_pairs(lst):
return Schema.from_tuples(lst)
@schema.register((tuple, list), (tuple, list))
@schema.register(collections.Iterable, collections.Iterable)

This comment has been minimized.

@kszucs

kszucs Apr 15, 2018

Member

Nice!

@cpcloud cpcloud force-pushed the cpcloud:bigquery-1.0.0-upgrade branch from d0bdbc7 to 097abdb Apr 16, 2018

@tswast

tswast approved these changes Apr 16, 2018

BigQuery stuff looks great.

Yes, bq_table._properties.get('timePartitioning', None) is the correct workaround.

assert fields
names = [el.name for el in fields]
ibis_types = [_discover_type(el) for el in fields]
ibis_type = dt.Struct(names, ibis_types)

This comment has been minimized.

@tswast

tswast Apr 16, 2018

Contributor

👍

Would it make sense for google-cloud-bigquery to add a to_arrow() method like to_dataframe() so that these types have better support here?

def test_parted_column(client, kind, option, expected_fn):
table_name = '{}_column_parted'.format(kind)
option_key = 'bigquery.partition_col'
with ibis.config.option_context(option_key, option):

This comment has been minimized.

@tswast

tswast Apr 16, 2018

Contributor

Why does Ibis have an option to rename the partition column? Or does this do something else?

This comment has been minimized.

@cpcloud

cpcloud Apr 16, 2018

Member

This option is a convenience to allow users to refer to _PARTITIONTIME, since refering to it without renaming generates an error in BigQuery. For column-based partitioning this isn't a problem, but for tables that are implicitly partitioned we want to do this.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 18, 2018

@tswast regarding to_arrow I think that's a great idea. I'll open an issue over on the bigquery python tracker.

@cpcloud cpcloud force-pushed the cpcloud:bigquery-1.0.0-upgrade branch from d91ae8d to fa7f0f8 Apr 18, 2018

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 18, 2018

Merging on green.

@cpcloud cpcloud closed this in 63c9274 Apr 18, 2018

BigQuery automation moved this from To do to Done Apr 18, 2018

@cpcloud cpcloud deleted the cpcloud:bigquery-1.0.0-upgrade branch Apr 18, 2018

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment