Credentials in kedro #1646

antonymilne · 2022-06-23T13:38:10Z

How do credentials currently work in kedro?

The basic pattern is as follows:

# conf/base/catalog.yml
dataset_name:
  ...
  credentials: credentials_key

# conf/local/credentials.yml
credentials_key: 
  kwarg1: value1
  kwarg2: value2
  ...

To be concrete, here's an example for Azure Blob Storage:

# conf/base/catalog.yml
shuttles:
  type: pandas.CSVDataSet
  filepath: abfs://somewhere/shuttles.csv
  credentials: abs_credentials

# conf/local/credentials.yml
abs_credentials:
  account_name: antonymilne
  account_key: verysecretpassword

The credentials key is injected into the call that instantiates pandas.CSVDataSet when kedro is run. Specifically, here:

kedro/kedro/io/data_catalog.py

Line 276 in a925fd5

ds_config = _resolve_credentials(ds_config, credentials)

Note:

credentials is a special reserved keyword. This doesn't work for any other key name
this is one of the very few customisations that kedro's ConfigLoader makes to how yaml is parsed. In-file variable injection is (kind of) supported in yaml using anchors, but injecting a variable from another file is not. The mechanism that does the injection here is entirely defined by kedro
the reason for enabling this custom behaviour is that credentials should not be committed to source control. Hence they need to be stored in a separate file outside the data catalog that lives in local and injected into the catalog at runtime

Does this work well?

In my experience and from talking to users: in the case that credentials can be stored in a file, yes. Very little confusion is caused by the custom behaviour of injecting credentials.

What are the problems with this?

The biggest problem is that credentials might not be stored in a file. Alternatives are:

very common: storing credentials in environment variables rather than files. kedro can deal with this through TemplatedConfigLoader. This works ok but feels hacky and is so common it shouldn't really require a workaround
Python objects. e.g. APIDataSet works with a requests.auth.AuthBase object for credentials; pandas.GBQTableDataSet works withgoogle.oauth2.credentials.Credentials. This is handled by instantiating the corresponding credentials class in the dataset using the kwargs given in the credentials.yml file. This works ok but is awkward and not done consistently throughout kedro (e.g. Additional options for APIDataSet (e.g. proxies) #711; Adding BigQuery authentication to credentials.yml #1621).
cloud-native solutions like AWS secrets. There's no direct way to use these in a catalog entry. I don't understand much (anything) about how these work but believe the same TemplatedConfigLoader trick as used for env vars would work here. See Cloud native credentials storage #1280 and Global credentials file for multiple pipelines #930 for more.

Another problem with credentials is that the way they are handled for PartitionedDataSet is pretty complicated. I'm not sure we'll be able to solve that here but would be nice if we could.

Possible solutions

Environment variables

At a bare minimum I think we need a way of directly injecting environment variables into credentials. Given how common this is outside credentials files also (using TemplatedConfigLoader), my opinion is that this mechanism should not be credentials-specific but instead common across all kedro configuration.

e.g. with OmegaConf you'd do this as:

abs_credentials:
  account_name: ${oc.env:ABS_ACCOUNT_NAME}
  account_key: ${oc.env:ABS_ACCOUNT_KEY}

Quotes from #770:

@idanov: [credentials] is obviously environment specific, but what we should consider doing is adding an environment variables support. Unfortunately this has been on the backlog for a while, but doesn’t seem to be such an important issue that cannot be solved by DevOps, so we never got to implementing the environment variables for credentials.
@Galileo-Galilei: I do not understand what you mean by [this]. My point is precisely that many CI/CD tools expect to communicate with the underlying application through environment variables (to my knowledge: I must confess that I am far from being a devops expert), and it is really weird to me that is not "native" in kedro. I must switch to the TemplatedConfigLoader on deployment mode even if I use a credentials.yml file while developping, and it feels uncomfortable to have to change something for deployment (even if it is very easy to change).

Beyond environment variables

So far the best discussion of this is in #1280. From @Galileo-Galilei:

I have no time for now, and it will likely take weeks before I came up with something intelligible, but this is a topic on which I plan to write a "Universal Kedro Deployment" issue. I think there are some adherence with this #770, but credentials have a lot of specificities indeed. In short my idea is that:

kedro should have an abstract class (~roughly similar to AbstractDataSet, say CredentialsManager) to implement the _get_credentials() function. It should be able to get credentials from anywhere (e.g. VaultCredentialsManager, GithubCredentialsManager and FileCredentialsManager which would default to current implementation) and return a dict of credentials.

This class should leverage the ConfigLoader when possible

This class should be parametrized in the settings.py

Also worth noting the factory approach of @daBlesr discussed in #711 (comment) and following comments.

I don't yet have any particular ideas myself here so I'd love to hear what others think and hear @Galileo-Galilei's idea in more detail 🚀 It would be especially great to hear from people who use cloud-native credentials systems like AWS secrets. This is a bit of a blindspot for us at the moment I think.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-09-01T16:37:33Z

A user recently told me that they struggled a bit with Snowflake authentication in particular. Looks like the externalbrowser method we shipped for the Snowpark dataset could be expanded for more normal ones.

References:

Oauth authentication with Snowflake: https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#connecting-with-oauth
Oauth intro: https://docs.snowflake.com/en/user-guide/oauth-intro
Oauth process overview: https://docs.snowflake.com/en/user-guide/oauth-snowflake-overview

merelcht · 2024-03-28T14:20:43Z

Moved this to a wiki page: https://github.com/kedro-org/kedro/wiki/Credentials-in-kedro

antonymilne mentioned this issue Jul 4, 2022

Cloud native credentials storage #1280

Closed

merelcht added the Type: Technical DR 💾 Decision Records (technical decisions made) label Aug 2, 2022

merelcht added this to the Configuration overhaul milestone Aug 2, 2022

yetudada added this to Roadmap Oct 4, 2022

yetudada moved this to Delivery - Later 🔧 in Roadmap Oct 4, 2022

merelcht mentioned this issue Oct 6, 2022

Allow users to pass credentials through environment variables #1909

Closed

merelcht removed this from the Adding `OmegaConf` to configuration handling milestone Jan 20, 2023

merelcht added the Component: Configuration label Jan 20, 2023

astrojuanlu mentioned this issue Aug 26, 2023

Update on credentials.md #2787

Merged

1 task

noklam added this to the Improve Credentials in Kedro milestone Sep 1, 2023

yetudada moved this from Delivery - Later 🔧 to Shipped 🚀 in Roadmap Feb 20, 2024

merelcht removed this from Roadmap Mar 28, 2024

merelcht closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024

Galileo-Galilei mentioned this issue Apr 14, 2024

Make credentials a resolver #3811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Credentials in kedro #1646

Credentials in kedro #1646

antonymilne commented Jun 23, 2022 •

edited

Loading

astrojuanlu commented Sep 1, 2023

merelcht commented Mar 28, 2024

Credentials in kedro #1646

Credentials in kedro #1646

Comments

antonymilne commented Jun 23, 2022 • edited Loading

How do credentials currently work in kedro?

Does this work well?

What are the problems with this?

Possible solutions

Environment variables

Beyond environment variables

astrojuanlu commented Sep 1, 2023

merelcht commented Mar 28, 2024

antonymilne commented Jun 23, 2022 •

edited

Loading