Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas-gbq auth proposal #161

Closed
9 tasks done
tswast opened this issue Apr 7, 2018 · 8 comments
Closed
9 tasks done

pandas-gbq auth proposal #161

tswast opened this issue Apr 7, 2018 · 8 comments
Assignees
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Collaborator

tswast commented Apr 7, 2018

Overview

The current auth flows for pandas-gbq are a bit confusing and hard to customize.

Final desired state. The pandas_gbq module should have the following (changes in bold):

  • read_gbq(query, project_id [optional], index_col=None, col_order=None, reauth, verbose [deprecated], private_key [deprecated], auth_local_webserver, dialect='legacy', configuration [optional], credentials [new param, optional])
  • to_gbq(dataframe, destination_table, project_id [optional], chunksize=None, verbose [deprecated], reauth, if_exists='fail', private_key [deprecated], auth_local_webserver, table_schema=None, credentials [new param, optional])
  • CredentialsCache (and WriteOnlyCredentialsCache, NoopCredentialsCache) - new class (and subclasses) for configuring user credentials caching behavior
  • context - global singleton with "client" property for caching default client in-memory.
  • get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False) - Helper function to get user authentication credentials.

Tasks:

  • Add authentication documentation with examples.
  • Add optional credentials parameter to read_gbq, taking a google.cloud.bigquery.Client object.
  • Add optional credentials parameter to to_gbq, taking a google.cloud.bigquery.Client object.
  • Add pandas_gbq.get_user_credentials() helper for fetching user credentials with installed-app OAuth2 flow.
  • Add pandas_gbq.CredentialsCache and related subclasses for managing user credentials cache.
  • Add pandas_gbq.context global for caching a default Client in-memory. Add examples for manually setting pandas_gbq.context.client (so that default project and other values like location can be set).
  • Update minimum google-cloud-bigquery version to 0.32.0 so that the project ID in the client can be overridden when creating query & load jobs. (Done in ENH: Add location parameter to read_gbq and to_gbq #185)
  • Deprecate private_key argument. Show examples of how to do the same thing by passing Credentials to the Client constructor.
  • Deprecate PANDAS_GBQ_CREDENTIALS_FILE environment variable. Show example using pandas_gbq.get_user_credentials with credentials_cache argument.
    * [ ] Deprecate reauth argument. Show examples using pandas_gbq.get_user_credentials with credentials_cache argument and WriteOnlyCredentialsCache or NoopCredentialsCache. Edit: No reason to deprecate reauth, since we don't need to complicate pandas-gbq's auth with pydata-google-auth's implementation details.
    * [ ] Deprecate auth_local_webserver argument. Show example using pandas_gbq.get_user_credentials with auth_local_webserver argument. Edit: No reason to deprecate auth_local_webserver, as that feature is still needed. We don't actually want to force people to use pydata-google-auth for the default credentials case.

Background

pandas-gbq has its own auth flows, which include but are distinct from "application default credentials".

See issue: #129

Current (0.4.0) state of pandas-gbq auth:

  1. Use service account key file passed in as private_key parameter. Parameter can be either as JSON bytes or a file path.
  2. Use application default credentials.
    1. Use service account key at GOOGLE_APPLICATION_CREDENTIALS environment variable.
    2. Use service account associated with Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
  3. Use user authentication.
    1. Attempt to load user credentials from cache stored at ~/.config/pandas_gbq/bigquery_credentials.dat or in path specified by PANDAS_GBQ_CREDENTIALS_FILE environment variable.
    2. Do 3-legged OAuth flow.
    3. Cache the user credentials to disk.

Why does pandas-gbq do user auth at all? Aren't application default credentials enough?

  • It's difficult in some environments to set the right environment variables, so a way to explicitly provide credentials is desired.
  • BigQuery does resource-based billing, so it is possible to use user-based authentication.
    • User-based authentication eliminates the unnecessary step of creating a service account.
    • A user with the BigQuery User IAM role wouldn't be allowed to create a service account.
    • Often datasets are shared with a specific user. Querying with user account credentials will allow them to access those shared datasets / tables.
    • User-based authentication is more intuitive in shared notebook environments like Colab, where the compute credentials might be associated with a service account in a shadow project or not available at all.

Problems with the current flow

  • The credentials order isn't always ideal.
  • It's not possible to specify user credentials in environments where application default credentials are available.
  • If someone is familiar with the google-auth library, the current auth mechanisms do not allow passing in an arbitrary Credentials object.
  • It is verbose and error-prone to pass in explicit service account credentials every time. See Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103 for a feature request for more configurable defaults.
    • Error-prone? More than once have I and the other pandas-gbq contributors forgot to add a private_key argument to a call in a test, resulting in surprising failures in CI builds.
  • It's not possible to override the scopes for the credentials. For example, it is useful to add Drive / Sheets scopes for querying external data sources.

Proposal

Document default auth behavior

Current behavior (not changing, except for deprecations).

  1. Use client if passed in.
  2. Deprecated. Use private_key to create a Client if passed in. Use google-auth and credentials argument instead.
  3. Attempt to create client using application default credentials. Intersphinx link to google.auth.default
  4. Attempt to construct client using user credentials (project_id parameter must be passed in). Link to pandas_gbq.get_user_credentials().

New default auth behavior.

  • 1b. If client not passed in, attempt to use global client at pandas_gbq.context (similar to google.cloud.bigquery.magics.context). If there is no client in the global context: run steps 2-4 and set the client it creates to the global context.

Add client parameter to read_gbq and to_gbq

The new client parameter, if provided, would bypass all other credentials fetching mechanisms.

Why a Client and not an explicit Credentials object?

  • A Client contains a default project (See feature request for default projects at Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103) and will eventually handle other defaults, such as location, encryption configuration, and maximum bytes billed.
  • A Client object supports more BigQuery operations than will ever be exposed by pandas-gbq (creating datasets, modifying ACLs, other property updates). Passing this in as a parameter could hint to developers that they can use the Client directly for those things.
  • It is more clear that BigQuery magic command is provided by google-cloud-bigquery not pandas-gbq.

Helpers for user-based authentication

No helpers are needed for default credentials or service account credentials because these can easily be constructed with the google-auth library. Link to samples for constructing these from the docs.

pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):

If credentials_cache is None, construct a pandas_gbq.CredentialsCache with defaults for arguments.

Attempt to load credentials from cache.

If credentials can't be loaded, start 3-legged oauth2 flow for installed applications. Use provided client secrets if given, otherwise use Pandas-GBQ client secrets. Use command-line flow by default. Use localhost webserver if set to True.

No credentials could be fetched? Raise an AccessDenied error. (Existing behavior of GbqConnector.get_user_account_credentials())

Save credentials to cache.

Return credentials.

pandas_gbq.CredentialsCache

Constructor takes optional credentials_path.

If credentials_path not provided, set self._credentials_path to

  • PANDAS_GBQ_CREDENTIALS_FILE - show deprecation warning that this environment variable will be ignored at a later date.
  • Default user credentials path at ~/.config/pandas_gbq/bigquery_credentials.dat

Methods

  • load() - load credentials from self._credentials_path, refresh them, and return them. Otherwise, return None if credentials not found.
  • save(credentials) - write credentials as JSON to self._credentials_path.
pandas_gbq.WriteOnlyCredentialsCache

Same as CredentialsCache, but load() is a no-op. Equivalent to "force reauth" in current versions.

pandas_gbq.NoopCredentialsCache

Satisfies the credentials cache interface, but does nothing. Useful for shared systems where you want credentials to stay in memory (e.g. Colab).

Deprecations

Some time should be given (1-year deprecation?) for folks to migrate to the new client argument. It might be used in scripts and older notebooks, and also is a parameter upstream in Pandas.

Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable

Log a deprecation warning suggesting pandas_gbq.get_user_credentials with a pandas_gbq.CredentialsCache argument.

Deprecate private_key argument

Log a deprecation warning suggesting google.oauth2.service_account.Credentials.from_service_account_info instead of passing in bytes and google.oauth2.service_account.Credentials.from_service_account_file instead of passing in a path.

Add / link to service account examples in the docs.

Deprecate reauth argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and a pandas_gbq.WriteOnlyCredentialsCache

Add user authentication examples in the docs.

Deprecate auth_local_webserver argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and set the auth_local_webserver argument there.

Add user authentication examples in the docs.

/cc @craigcitro @maxim-lian

@tswast
Copy link
Collaborator Author

tswast commented Aug 31, 2018

#171 got me thinking. There are cases when we'll want client objects besides the google.cloud.bigquery client. In the case of #171, we'll need to construct a Storage client.

I propose that whereever I suggested a client argument in this proposal, we actually ask for credentials.

@max-sixty
Copy link
Contributor

💯 , and then the library can manage Clients (i.e. this doesn't mean we'd need to create a new Client each request)

tswast added a commit to pydata/pydata-google-auth that referenced this issue Sep 7, 2018
Trim pydata-google-auth package and add tests

This is the initial version of the proposed pydata-google-auth package (to be used by pandas-gbq and ibis). It includes two methods:

* `pydata_google_auth.default()`
  * A function that does the same as pandas-gbq does auth currently. Tries `google.auth.default()` and then falls back to user credentials.
* `pydata_google_auth.get_user_credentials()`
  * A public `get_user_credentials()` function, as proposed in googleapis/python-bigquery-pandas#161. Missing in this implementation is a more configurable way to adjust credentials caching. I currently use the `reauth` logic from pandas-gbq.

I drop `try_credentials()`, as it makes less sense when this module might be used for other APIs besides BigQuery. Plus there were problems with `try_credentials()` even for pandas-gbq (googleapis/python-bigquery-pandas#202, googleapis/python-bigquery-pandas#198).
@tswast
Copy link
Collaborator Author

tswast commented Oct 26, 2018

  • Add pandas_gbq.get_user_credentials()

This was released as part of the pydata-google-auth package. Documented at https://pydata-google-auth.readthedocs.io/en/latest/api.html#pydata_google_auth.get_user_credentials

@christianramsey
Copy link

@tswast glad this was released but what does this mean for pandas-gbq? I'm still having an issue with drive scopes and was hoping this could possibly solve it. Does this solve the issue in some way?

@tswast
Copy link
Collaborator Author

tswast commented Oct 29, 2018

@christianramsey I'm glad you asked. Yes, the combination of #231 and https://pydata-google-auth.readthedocs.io/en/latest/api.html#pydata_google_auth.get_user_credentials allows you to use drive scopes. I (or some helpful contributor 😃 ) need to

A briefly example of using the drive scope:

Until pandas-gbq 0.8.0 is released, install from the latest on GitHub

pip install --upgrade git+https://github.com/pydata/pandas-gbq.git

Install pydata-google-auth

pip install --upgrade pydata-google-auth

auth_example.py:

import pandas_gbq
import pydata_google_auth
import pydata_google_auth.cache

# Instead of get_user_credentials(), you could do default(), but that may not
# be able to get the right scopes if running on GCE or using credentials from
# the gcloud command-line tool.
credentials = pydata_google_auth.get_user_credentials(
    scopes=[
        'https://www.googleapis.com/auth/drive',
        'https://www.googleapis.com/auth/cloud-platform',
    ],
    # Use reauth to get new credentials if you haven't used the drive scope
    # before. You only have to do this once.
    credentials_cache=pydata_google_auth.cache.REAUTH,
    # Set auth_local_webserver to True to have a slightly more convienient
    # authorization flow. Note, this doesn't work if you're running from a
    # notebook on a remote sever, such as with Google Colab.
    auth_local_webserver=True,
)

sql = """SELECT state_name
FROM `my_dataset.us_states_from_google_sheets`
WHERE post_abbr LIKE 'W%'
"""

df = pandas_gbq.read_gbq(
    sql,
    project_id='YOUR-PROJECT-ID',
    credentials=credentials,
    dialect='standard',
)

print(df)

@tswast
Copy link
Collaborator Author

tswast commented Oct 29, 2018

@christianramsey Actually, you can use pydata-google-auth with pandas-gbq 0.7.0 today by using the fact that we have an in-memory cache of credentials now.

import pandas
import pandas_gbq
import pydata_google_auth
import pydata_google_auth.cache

credentials = pydata_google_auth.get_user_credentials(
    scopes=[
        'https://www.googleapis.com/auth/drive',
        'https://www.googleapis.com/auth/cloud-platform',
    ],
)

# Update the in-memory credentials cache (added in pandas-gbq 0.7.0).
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = 'your-project-id'

sql = """SELECT state_name
FROM `my_dataset.us_states_from_google_sheets`
WHERE post_abbr LIKE 'W%'
"""

df = pandas_gbq.read_gbq(
    sql,
    dialect='standard',
)

print(df)

@christianramsey
Copy link

christianramsey commented Oct 30, 2018

The above code worked! xie xie @tswast

@tswast
Copy link
Collaborator Author

tswast commented Dec 20, 2018

It appears PANDAS_GBQ_CREDENTIALS_FILE isn't actually used after #176

There is some logic that reads it, but then the value is never used.

https://github.com/pydata/pandas-gbq/blob/08590bdcb2476aa7712bcee7d13afb2dfb7ea0de/pandas_gbq/gbq.py#L316

I guess I don't have to mark it deprecated since it was broken, anyway? For users that do want similar functionality: to choose the cache location with an environment variable, pydata/pydata-google-auth#7 tracks that feature request in pydata-google-auth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants