Skip to content

Commit

Permalink
[KED-773] Document the best practices for using credentials (#167)
Browse files Browse the repository at this point in the history
Clear description of how to configure credentials
  • Loading branch information
DmitriiDeriabinQB committed Aug 1, 2019
1 parent c46a3fa commit 5f4325f
Show file tree
Hide file tree
Showing 3 changed files with 68 additions and 4 deletions.
40 changes: 40 additions & 0 deletions docs/source/04_user_guide/03_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,43 @@ You can alternatively change the default environment by modifying the `env` vari
```python
env = "test"
```

## Credentials

> *Note:* For security reasons, we strongly recommend *not* committing any credentials or other secrets to the Version Control System. Hence, by default any file inside the `conf/` folder (and its subfolders) containing `credentials` in its name will be ignored via `.gitignore` and not committed to your git repository.
Credentials configuration can be loaded the same way as any other project configuration using the `ConfigLoader` class:

```python
from kedro.config import ConfigLoader

conf_paths = ["conf/base", "conf/local"]
conf_loader = ConfigLoader(conf_paths)
credentials = conf_loader.get("credentials*", "credentials*/**")
```

This will load all configuration files from `conf/base` and `conf/local`, which either have the filename starting with `credentials` or are located inside a folder with name starting with `credentials`.

> *Note:* Configuration path `conf/local` takes precedence in the example above since it's loaded last, therefore any overlapping top-level keys from `conf/base` will be overwritten by the ones from `conf/local`.
Calling `conf_loader.get()` in the example above will throw a `MissingConfigException` error if there are no configuration files matching the given patterns in any of the specified paths. If this is a valid workflow for your application, you can handle it as follows:

```python
from kedro.config import ConfigLoader, MissingConfigException

conf_paths = ["conf/base", "conf/local"]
conf_loader = ConfigLoader(conf_paths)

try:
credentials = conf_loader.get("credentials*", "credentials*/**")
except MissingConfigException:
credentials = {}
```

> *Note:* `kedro.context.KedroContext` class uses the approach above to load project credentials.
Credentials configuration can then be used on its own or fed into the `DataCatalog` as described in [this section](./04_data_catalog.md#feeding-in-credentials).

### AWS credentials

When working with AWS S3-backed datasets (e.g., `kedro.io.CSVS3DataSet`), you are not required to store AWS credentials in the project configuration files. Instead, you can specify them using environment variables `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and, optionally, `AWS_SESSION_TOKEN`. Please refer to the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html) for more details.
30 changes: 27 additions & 3 deletions docs/source/04_user_guide/04_data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,17 +109,41 @@ scooters_query:
index_col: ['name']
```

The above `catalog.yml` gets `dev_s3` `scooters_credentials` from `conf/local/credentials.yml`:
## Feeding in credentials

Before instantiating the `DataCatalog` Kedro will first attempt to read the credentials from project configuration (see [this section](./03_configuration.md#aws-credentials) for more details). Resulting dictionary will then be passed into `DataCatalog.from_config()` as `credentials` argument.

Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents:

```yaml
dev_s3:
aws_access_key_id: token
aws_secret_access_key: key
aws_access_key_id: token
aws_secret_access_key: key

scooters_credentials:
con: sqlite:///kedro.db
```

In the example above `catalog.yml` contains references to credentials keys `dev_s3` and `scooters_credentials`. It means that when instantiating `motorbikes` dataset, for example, the `DataCatalog` will attempt to read top-level key `dev_s3` from the received `credentials` dictionary, and then will pass its values into the dataset `__init__` as `credentials` argument. This is essentially equivalent to calling this:

```python
CSVS3DataSet(
bucket_name="test_bucket",
filepath="data/02_intermediate/company/motorbikes.csv",
load_args=dict(
sep=",",
skiprows=5,
skipfooter=1,
na_values=["#NA", "NA"],
),
credentials=dict(
aws_access_key_id="token",
aws_secret_access_key="key",
)
)
```


## Loading multiple datasets that have similar configuration

You may encounter situations where your datasets use the same file format, load and save arguments, and are stored in the same folder. YAML has a [built-in syntax](https://yaml.org/spec/1.2/spec.html#id2765878) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets so that you do not have to spend time copying and pasting dataset configurations in `catalog.yml`.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/06_resources/01_faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Kedro is built for Python 3.5+.
* Avoid committing data to notebook output cells (data can easily sneak into notebooks when you don't delete output cells)
* Don't commit sensitive results or plots to version control (in notebooks or otherwise)
* Don't commit credentials in `conf/`. There are two default folders for adding configuration - `conf/base/` and `conf/local/`. Only the `conf/local/` folder should be used for sensitive information like access credentials. To add credentials, please refer to the `conf/base/credentials.yml` file in the project template.
* By default any file inside the `conf/` folder (and its subfolders) containing `credentials` in its name will be ignored via `.gitignore` and not commited to your git repository.
* By default any file inside the `conf/` folder (and its subfolders) containing `credentials` in its name will be ignored via `.gitignore` and not committed to your git repository.
* To describe where your colleagues can access the credentials, you may edit the `README.md` to provide instructions.

## What is the philosophy behind Kedro?
Expand Down

0 comments on commit 5f4325f

Please sign in to comment.