Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103

Closed
jasonqng opened this issue Jan 2, 2018 · 17 comments

Comments

@jasonqng
Copy link
Contributor

jasonqng commented Jan 2, 2018

One frustrating thing is having to pass the project_id (among other parameters) every time you write a query. For example, personally, I usually use the same project_id, almost always query with standard sql, and usually turn off verbose. I have to pass those three with every read_gbq, typing which adds up.

Potential options include setting an environment variable and reading from these default settings, but sometimes it can be different each time and fiddling with environment variables feels unfriendly. My suggestion would perhaps be to add a class that can wrap read_gbq() and to_gbq() in a client object. You could set the project_id attribute and dialect and whatever else in the client object, then re-use the object every time you want a query with those settings.

A very naive implementation here in this branch:
https://github.com/pydata/pandas-gbq/compare/master...jasonqng:client-object-class?expand=1

Usage would be like:

>>> import gbq
>>> client = gbq.Client(project_id='project-name',dialect='standard',verbose=False)
>>> client.read("select 1")
   f0_
0    1
>>> client.read("select 2")
   f0_
0    2
>>> client.verbose=True
>>> client.read("select 3")
Requesting query... ok.
Job ID: c7d7e4c0-883a-4e14-b35f-61c9fae0c08b
Query running...
Query done.
Processed: 0.0 B Billed: 0.0 B
Standard price: $0.00 USD

Retrieving results...
Got 1 rows.

Total time taken 1.66 s.
Finished at 2018-01-02 14:06:01.
   f0_
0    3

Does that seem like a reasonable solution to all this extra typing or is there another preferred way? If so, I can open up a PR with the above branch.

Thanks, my tired fingers thank you all!

@tswast @jreback @parthea @maxim-lian

@max-sixty
Copy link
Contributor

Can you go into more detail on why the env variables aren't sufficient? The Google libraries generally natively support GOOGLE_CLOUD_PROJECT, and so I'd vote to use that (or even to get out of the way and let the underlying google libraries read that)

@jasonqng jasonqng changed the title Set project_id once for all subsequent queries so you don't have to pass every time Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time Jan 2, 2018
@tswast
Copy link
Collaborator

tswast commented Jan 3, 2018

I agree that there are times when it is more convenient to set the project once in code versus using an environment variable. For example, users in a hosted notebook environment like https://colab.research.google.com wouldn't be able to set an environment variable.

@tswast
Copy link
Collaborator

tswast commented Jan 3, 2018

Question for the Pandas folks: is there a standard way to set default / config options in Pandas? For example, Flask has a config object/dictionary.

@max-sixty
Copy link
Contributor

I just ran this in colab
image

@max-sixty
Copy link
Contributor

There are pandas configs with pd.set_option, but it's not used nearly as often as an application (e.g. Flask / Airflow / Celery)

@jasonqng
Copy link
Contributor Author

jasonqng commented Jan 3, 2018

I hear you about environment variables as being possible, but it just doesn't feel very user friendly. And while GOOGLE_CLOUD_PROJECT might be used by other libraries or elsewhere, some of the possible environment variables (for instance the query dialect) would probably never be used outside of pandas-gbq which makes it weird to set at the system level.

Yeah, the way pandas uses set_option might be a good comparable. We could have a separate function called set_options which saves these gbq-level settings. Downside I guess is having to go through the code and pointing to these new settings? The upside of creating the new wrapper class above is that it doesn't touch any existing code, but then again, not sure which would be more maintainable in the future.

Either way, we should probably come up with some sort of solution for these common settings, be it an environment variable (which we don't currently handle), a set_options function, or a wrapper class. Even on an aesthetic level not to mention the wasted code, scrolling through notebooks with all these project_id and dialect parameters repeated is ugly and fundamentally unnecessary.

@tswast
Copy link
Collaborator

tswast commented Jan 3, 2018

An intermediate step we could take is to make the project_id parameter optional when we are able to get it from the google-auth library. I believe @jasonqng originally had this logic in #25, but unfortunately I dropped that code when I updated that PR to use the 0.28 version of the google-cloud-bigquery library.

@max-sixty
Copy link
Contributor

max-sixty commented Jan 3, 2018

make the project_id parameter optional when we are able to get it from the google-auth library

+1

@jasonqng
Copy link
Contributor Author

jasonqng commented Jan 3, 2018

+1 also to getting project_id from google-auth for now. Just make sure that setting GOOGLE_CLOUD_PROJECT (or whatever) via env variable in script should override however google-auth figures it out. This way notebooks are reproducible with minimal tweaks in case querying from or to a project needs specific project permissions.

Still need to figure out where dialect, verbose, and other settings live, but this would be a great first step!

@max-sixty
Copy link
Contributor

If we want to reuse more than project_id, one option would be to house config options in a GBQConnector class, which would be an optional parameter to pass into gbq functions

@tswast
Copy link
Collaborator

tswast commented Feb 23, 2018

#127 is the first step in this.

@tswast
Copy link
Collaborator

tswast commented Apr 4, 2018

GBQConnector feels right, but I don't think we cache this at the module-level, do we?

Any solution we come up with for defaults should also work with the common way of accessing this library via pandas.read_gbq() and pandas.DataFrame.to_gbq().

Thinking aloud: what if we had a global Context object which has dictionaries for default options to read_gbq and to_gbq (and maybe some shared args like private_key)?

@tswast
Copy link
Collaborator

tswast commented Apr 4, 2018

I do still think that making project_id (sometimes) optional would be a good first step.

@tswast tswast mentioned this issue Apr 7, 2018
9 tasks
@jasonqng
Copy link
Contributor Author

jasonqng commented May 12, 2018

A temporary solution to this issue using functools.partial in case folks want a stopgap workaround:

>>> from functools import partial
>>> from pandas.io import gbq
>>> bq = partial(gbq.read_gbq, project_id='xxxxx',verbose=False,dialect='standard')
>>> bq("select 1")
   f0_
0    1
>>> bq("select sum(a) from (select 5 a)")
   f0_
0    5

@tswast
Copy link
Collaborator

tswast commented Aug 7, 2018

The project ID default logic was added in #127

I propose in #161 that other default settings be delegated to the google.cloud.bigquery.Client object.

@tswast tswast closed this as completed Aug 7, 2018
@tswast tswast reopened this Aug 31, 2018
@tswast
Copy link
Collaborator

tswast commented Aug 31, 2018

Re-opening this because I think we'll want a credentials arg, not client in #161. I think the pandas_gbq.context object that I propose in #161 could be used for the purpose of setting default values.

@tswast
Copy link
Collaborator

tswast commented Sep 5, 2018

The global pandas_gbq.context object (added in #208) fulfills the request in this issue. I think we can slowly add properties to that class over time. (My first targets are SQL dialect #195 and maybe location.)

@tswast tswast closed this as completed Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants