Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Convert read_gbq() function to use google-cloud-python #25

Merged
merged 43 commits into from Dec 20, 2017

Conversation

jasonqng
Copy link
Contributor

@jasonqng jasonqng commented Apr 8, 2017

Description

I've rewritten the current read_gbq() function using google-cloud-python, which handles the naming of structs and arrays out of the box. For more discussion about this, see: #23.

However, because of the fact that google-cloud-python potentially uses different authentication flows and may break existing behavior, I've left the existing read_gbq() function and and named this new function from_gbq(). If in the future we are able to reconcile the authentication flows and/or decide to deprecate flows that are not supported in google-cloud-python, we can rename this to read_gbq().

UPDATE: As requested in comment by @jreback (https://github.com/pydata/pandas-gbq/pull/25/files/a763cf071813c836b7e00ae40ccf14e93e8fd72b#r110518161), I deleted old read_gbq() and named my new function read_gbq(), deleting all legacy functions and code.

Added in a few lines to requirements file, but I'll leave it to you @jreback to deal with conda dependency issues which you mentioned in Issue 23.

Let know if any questions or if any tests need to be written. You can confirm that it works by running the following:

q = """
select ROW_NUMBER() over () row_num, struct(a,b) col, c, d, c*d c_times_d, e
from
(select * from
    (SELECT 1 a, 2 b, null c, 0 d, 100 e)
    UNION ALL
    (SELECT 5 a, 6 b, 0 c, null d, 200 e)
    UNION ALL
    (SELECT 8 a, 9 b, 10.0 c, 10 d, 300 e)
)
"""
df = gbq.read_gbq(q, dialect='standard')
df
row_num col c d c_times_d e
2 {u'a': 5, u'b': 6} 0.0 NaN NaN 200
1 {u'a': 1, u'b': 2} NaN 0.0 NaN 100
3 {u'a': 8, u'b': 9} 10.0 10.0 100.0 300
q = """
select array_agg(a) mylist
from
(select "1" a UNION ALL select "2" a)
"""
df = gbq.read_gbq(q, dialect='standard')
df
mylist
[1, 2]
q = """
select array_agg(struct(a,b)) col, f
from
(select * from
    (SELECT 1 a, 2 b, null c, 0 d, 100 e, "hello" f)
    UNION ALL
    (SELECT 5 a, 6 b, 0 c, null d, 200 e, "ok" f)
    UNION ALL
    (SELECT 8 a, 9 b, 10.0 c, 10 d, 300 e, "ok" f)
)
group by f
"""
df = gbq.read_gbq(q, dialect='standard')
df
col f
[{u'a': 5, u'b': 6}, {u'a': 8, u'b': 9}] ok
[{u'a': 1, u'b': 2}] hello

Confirmed that col_order and index_col still work (feel free to pull that out into a separate function since there's now redundant code with read_gbq()), and I removed the type conversion lines which appear to be unnecessary (google-cloud-python and/or pandas appears to do the necessary type conversion automatically, even if there are nulls; can confirm by examining the datatypes in the resulting dataframes).

@codecov-io
Copy link

codecov-io commented Apr 8, 2017

Codecov Report

Merging #25 into master will decrease coverage by 43.58%.
The diff coverage is 7.84%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #25       +/-   ##
===========================================
- Coverage   72.56%   28.97%   -43.59%     
===========================================
  Files           4        4               
  Lines        1578     1491       -87     
===========================================
- Hits         1145      432      -713     
- Misses        433     1059      +626
Impacted Files Coverage Δ
pandas_gbq/tests/test_gbq.py 27.98% <30%> (-54.36%) ⬇️
pandas_gbq/gbq.py 20.55% <6.29%> (-53.6%) ⬇️
pandas_gbq/_version.py 44.4% <0%> (+1.8%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd76dde...26d6431. Read the comment docs.

@jasonqng jasonqng force-pushed the read-gbq-google-cloud-library branch from 1381dd7 to a763cf0 Compare April 8, 2017 05:33
@@ -767,6 +770,116 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,

return final_df

def from_gbq(query, project_id=None, index_col=None, col_order=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should simply replace read_gbq, changing the top-level API is a non-starter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can rename mine to read_gbq and add in the query length/bytes processed info. Want me to delete the old read_gbq and related code or just rename it?

@jreback
Copy link
Contributor

jreback commented Apr 9, 2017

this need to pass all of the original test
it's simply an implementation change

@jasonqng jasonqng force-pushed the read-gbq-google-cloud-library branch from 94065f3 to 2f93f7b Compare April 9, 2017 14:00
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the output of running the test suite.


Parameters
----------
query : str
SQL-Like Query to return data values
project_id : str
project_id : str (optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project (str) – the project which the client acts on behalf of. Will be passed when creating a dataset / job. If not passed, falls back to the default inferred from the environment.

https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-client.html#google.cloud.bigquery.client.Client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, doc still says that the project will be inferred from environment (https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery/client), but I don't think it does anymore in the latest version. Thus, project_id is now required again, which is incredibly annoying as most of the time, you're probably ok with the query being associated with the same project. Thoughts on whether we should allow the user to specify a default project env variable or other method? (~/.bigqueryrc looks like it can hold your default project_id, but I don't know if that's deprecated and/or a command-line only implementation: https://cloud.google.com/bigquery/bq-command-line-tool)

@MaximilianR @jreback

Google BigQuery Account project ID.
index_col : str (optional)
Name of result column to use for index in results DataFrame
col_order : list(str) (optional)
List of BigQuery column names in the desired order for results
DataFrame
reauth : boolean (default False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this now not needed? if so, then simply mark it as deprecated (and raise a warning if its passed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not sure how to implement or if such behavior can be replicated in new api (or is even desired). If any folks have thought, let know. Otherwise, if I can't, will do as you suggest and raise a warning if reauth is passed.

dialect : {'legacy', 'standard'}, default 'legacy'
'legacy' : Use BigQuery's legacy SQL dialect.
'standard' : Use BigQuery's standard SQL (beta), which is
compliant with the SQL 2011 standard. For more information
see `BigQuery SQL Reference
<https://cloud.google.com/bigquery/sql-reference/>`__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can gather from the discussion here (googleapis/google-cloud-python#2765) passing an arbitrary JSON of configuration settings isn't supported in the way it was in the previous python api. As such, we might as well make the passing of configuration settings a little easier with a dict like so, but happy to consider alternatives.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this is one reason I'd prefer to use google-auth library directly. #26

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast @jasonqng
You can get a Credentials object from the json with google.oauth2.service_account.Credentials.from_service_account_info(json.loads(key))

Forgive me for not PRing this sort of thing in directly - I'm totally jammed at the moment (forgive me @jreback too...). But happy to help with any questions on this stuff - we've run through a lot of it over here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaximilianR solved the auth with a key in latest commit. is this other concern about the config settings still an open issue to be resolved in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is solved now, thank you @jasonqng

@jreback
Copy link
Contributor

jreback commented Apr 9, 2017

@jasonqng we have to be careful about back-compat here.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2017

you need to update the ci/requirements-*.pip files, removing the old requirements and adding the new

@jasonqng jasonqng changed the title Add new from_gbq() function using google-cloud-python Convert read_gbq() function to use google-cloud-python Apr 10, 2017
@jasonqng
Copy link
Contributor Author

@jreback Yeah, the back compatibility issues with the authentication is partly why I suggested writing it as a new function, but hopefully we can replicate a form of the pop-up authentication->refresh token with the new api (https://googlecloudplatform.github.io/google-cloud-python/stable/google-cloud-auth.html#user-accounts-3-legged-oauth-2-0-with-a-refresh-token). I might need some help with that if others are more familiar with it. Almost everything else should carry over, so I'm not too concerned with compatibility otherwise.

if dialect not in ('legacy', 'standard'):
raise ValueError("'{0}' is not valid for dialect".format(dialect))
if private_key:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = private_key
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is nice to set environ vars, can't you just include a custom reference to the private_key, if it exists, when creating the Client on row 540?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thejens Fixed and verified it works with my service account json key.

requirements.txt Outdated
@@ -2,3 +2,5 @@ pandas
httplib2
google-api-python-client
oauth2client
google-cloud
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google-cloud is not the most stable package (I've noticed), I'd require a specific version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine, you can pip specifically

```google-cloud=0.24.0`` (or whatever)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also be more specific and only require the bigquery library with google-cloud-bigquery: https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/bigquery

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using google-cloud-bigquery.

Also, the google-cloud-bigquery package is still beta, meaning there will likely be breaking changes, so we'd have to pin a very specific version number to maintain stability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and pinned to google-cloud-bigquery to 0.25.0 since 0.26.0 breaks things

@@ -603,106 +458,56 @@ def delete_and_recreate_table(self, dataset_id, table_id, table_schema):
table.create(table_id, table_schema)
sleep(delay)


def _parse_data(schema, rows):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_gbq.py, there are several tests that have gbq._parse_data(...). Could you update test_gbq.py as well? See the tests with prefix test_should_return_bigquery*(...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will tackle this weekend hopefully. Thanks!

@tswast
Copy link
Collaborator

tswast commented Apr 21, 2017

I recommend closing this PR.

The google-cloud-bigquery library is not yet 1.0. Breaking changes are very likely as we (Google) get the libraries in good shape for a 1.0 release (no specific timeline on that yet). There are also a few things yet-to-be-implemented in google-cloud-bigquery such as retry logic, which are in the "API client library".

We should revisit this after the google-cloud-bigquery library goes 1.0.

@jasonqng
Copy link
Contributor Author

@jreback In light of @tswast's comment, should we just close this or should we go back to building this as a separate function (e.g. from_gbq()) as I had it before? We could clearly mark it as experimental. For my own selfish sake, I'd vote for latter just so I can start using it in the mean time before google-cloud API goes 1.0, but obviously defer to others here.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2017

we can easily just pin to a specific version of api stability is a concern but in general i don't see this as a big deal

no reason to wait for a 1.0

Pandas itself is not even 1.0 and lots of people use / depend on it

@parthea parthea added this to the 0.2.0 milestone Apr 22, 2017
By default "application default credentials" are used.

If default application credentials are not found or are restrictive,
user account credentials are used. In this case, you will be asked to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to remove the user account credentials method? I believe it can be made to work with the google-cloud-biguery library. I'm working on adding a sample to the Google docs that does the user-auth flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be awesome. Do feel free to point to the doc when it's added. Sorry, been a bit distracted with work but hope to address all comments next week (and agree with @jreback's comment to move forward with this). Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GoogleCloudPlatform/python-docs-samples#925

That pull request adds a sample which uses https://pypi.python.org/pypi/google-auth-oauthlib to run a query with user account credentials.

I'll be writing some tests and docs around it, but the code should be easy enough to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast Latest commit uses existing GbqConnector to generate appropriate credentials. Let know if anything inappropriate with the way I've implemented.

@tswast
Copy link
Collaborator

tswast commented May 23, 2017

@jasonqng You'll probably want to rebase this change after #39 gets merged. The google-auth library is used by both google-api-python-client and google-cloud-python, so you should be able to reuse my changes for user and service account authentication.

@parthea parthea changed the title Convert read_gbq() function to use google-cloud-python ENH: Convert read_gbq() function to use google-cloud-python Jun 13, 2017
@jreback jreback removed this from the 0.2.0 milestone Jul 7, 2017
@jreback
Copy link
Contributor

jreback commented Jul 7, 2017

is this PR relevant after #39 ?

@tswast

@parthea
Copy link
Contributor

parthea commented Jul 7, 2017

We still have #23 as an open issue. It would be great to move this forward to address that. I recently added a conda recipe for google-cloud-bigquery.
https://github.com/conda-forge/google-cloud-bigquery-feedstock

@jreback
Copy link
Contributor

jreback commented Jul 7, 2017

@parthea ok is it worth waiting on this for 0.2.0 to avoid even more changes?

@parthea
Copy link
Contributor

parthea commented Jul 7, 2017

My initial thought is that the milestone for this PR should be 0.3.0 as thorough testing is required. I think ultimately it depends on how soon we can begin testing this PR and whether we are in a hurry to release 0.2.0.

@jasonqng Please could you rebase ?

@tswast
Copy link
Collaborator

tswast commented Jul 8, 2017

Yeah, this PR is still relevant. #39 moves pandas-gbq to use google-auth but still use the Google API client libraries. This PR is to move it to the Google Cloud client libraries (which incidentally also uses google-auth, so rebasing would be really helpful in completing the work required to get this working properly)

@jasonqng
Copy link
Contributor Author

@tswast @parthea @jreback Sorry, been swamped these past few months. Hope to scratch out some time this week to incorporate all comments (and also get it to working with queries with large results, which it currently fails with). Just checking, any particular reason to rebase vs a merge? Happy to do former, just haven't done rebase on any collaborative projects so this would be first time. (Haha, worst case scenario, I mess up my branch and I just rewrite and open a new PR branched off new master.)

@parthea
Copy link
Contributor

parthea commented Jul 11, 2017

@jasonqng Thanks for taking care of this! Rebase is preferred because it will allow you to add commits on top of the latest master which is much nicer to look at during code review.

Also says which libraries are no longer required, for easier upgrades.
@tswast
Copy link
Collaborator

tswast commented Nov 30, 2017

@jreback Could you take a look at the docs changes I made? I've documented both the dependencies for 0.3.0 and what they were before (with notes on how they've changed)

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change LGTM, but since I made some contributions to this one I'd like one of the other maintainers to also review before we merge it.

@@ -781,6 +656,14 @@ def verify_schema(self, dataset_id, table_id, schema):
key=lambda x: x['name'])
fields_local = sorted(schema['fields'], key=lambda x: x['name'])

# Ignore mode when comparing schemas.
for field in fields_local:
if 'mode' in field:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not worth changing, but this could be marginally simpler as field.pop('mode', None)

dtype_map.get(field['type'].upper(), object)
for field in fields
]
print(fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to print all the fields here?

If we do: elsewhere, this uses self._print, which can can transition to logging at some point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Good catch. I added that to debug the "ignore mode when comparing schemas" logic. Removed.

tableId=table_id,
body=body).execute()
except HttpError as ex:
self.client.load_table_from_file(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better than the existing method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's technically a change in behavior (kicks off a load job instead of using the streaming API), but I think the change is small enough to be worth it. Load jobs should be much more reliable for the use case of this library.

rows.append(row_dict)
row_json = row.to_json(
force_ascii=False, date_unit='s', date_format='iso')
rows.append(row_json)
Copy link
Contributor

@max-sixty max-sixty Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't worse than the last version, but this would be much faster if .to_json was called on the whole table rather than each row, iterating in python

CSV might be even faster given the reduced space (and pandas can't use nesting or structs anyway). But potentially wait until parquet is GA to make the jump

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the current behavior for now and do a subsequent PR for any changes like this for performance improvements. I've filed #96 to track the work for speeding up encoding for the to_gbq() method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% re doing that separately

field_type)
for row_num, entries in enumerate(rows):
for col_num in range(len(col_types)):
field_value = entries[col_num]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'd realized we were looping over all the values in python before. This explains a lot of why exporting a query to a file on GCS and then reading from that file is an order of magnitude faster.

If we could pass rows directly into DataFrame that would be much faster, but I'm not sure if that's possible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed #97 to track improving the performance in the read case.

field_type)
for row_num, entries in enumerate(rows):
for col_num in range(len(col_types)):
field_value = entries[col_num]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is being called so many times, you may even get a small speed up from eliminating assigning to field_value

(but these are all things that are either the same or better than the existing version)

self._print('Standard price: ${:,.2f} USD\n'.format(
bytes_processed * self.query_price_for_TB))
bytes_billed * self.query_price_for_TB))

self._print('Retrieving results...')
Copy link
Contributor

@max-sixty max-sixty Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is never going to be relevant because the prior part is blocking?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. I think it indeed is less relevant, but actually fetching the rows and constructing the dataframe will take non-zero time, especially for larger result sets.

@max-sixty
Copy link
Contributor

max-sixty commented Dec 6, 2017

I took a proper read through, though needs someone like @jreback to approve

I think this strictly dominates the existing version. There are a couple of extremely small tweaks that we can do in a follow-up if not now.

There are also some areas for huge speed-ups - IIUC the code is currently running through each value in python atm.

In line with that: we've built a function for exporting to a file to GCS and loading that in, which works much better for > 1-2m rows. We can do a PR for that if people are interested, in addition to speeding up the current path.

@max-sixty
Copy link
Contributor

@jreback I know there's lots going on in pandas, but would be super if you could take a glance at this. A few follow-ups are dependent on this merging.

Thanks v much

@jreback
Copy link
Contributor

jreback commented Dec 12, 2017

sure will look

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some small doc comments that I would do. better to be over explanatory in the whatsnew.

0.3.0 / 2017-??-??
------------------

- Use the `google-cloud-bigquery <https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html>`__ library for API calls instead of ``google-api-client`` and ``httplib2``. (:issue:`93`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a more comprehensive note here. show what you used to import / depend on, and what it should be now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can link to install.rst as well

@@ -181,14 +158,6 @@ class QueryTimeout(ValueError):
pass


class StreamingInsertError(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention that this is eliminated in whatsnew

@max-sixty
Copy link
Contributor

Congrats @jasonqng & @tswast !

@tswast
Copy link
Collaborator

tswast commented Dec 20, 2017

Thanks! A couple things to clean up before we make a release. I'd like to add a couple tests for some of the other issues we think this PR might address.

Plus I'm not sure it makes sense to do a release right before the holidays.

This was referenced Dec 21, 2017
@jasonqng
Copy link
Contributor Author

TY @tswast for carrying this across the finish line!!! I learned so much watching you and @jreback mold this. Can't wait to stop having to rely on my janky fork and install the new version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants