The cpg-utils
library (pypi) contains a streamlined config management tool. This config management is used by most production CPG workflows, but is useful in projects or scripts at any scale.
This allows you to run the same code, across multiple datasets, namespaces and even clouds without any change to your code. Configurations like this can make it tricky to work out exactly where parameters come from, we recommend:
- Putting the parameter on the CLI if it's the value is unique for each run
- Putting the parameter in a config if it's useful for many runs to have this value, and it changes predictably with the dataset.
- And we discourage using environment variables to pass around information.
This configuration tool uses one or more TOML
files, and creates a dictionary of key-value attributes which can be accessed at any point, without explicitly passing a configuration object. If jobs are set up using analysis-runner
, config will be set up automatically within each job environment. Please see the end section of this document for extra details on how to set up config outside analysis-runner.
The analysis-runner is the entry point to analysis at the CPG, but it's secondary role is to combine a bunch of configs together for your analysis.
This includes:
- Storage configuration generated by the cpg-infrastructure
- Selected configuration attributes (also from cpg-infrastructure)
- Images
- References
And it's constructed by the analysis-runner server.
You can generate an example config using the analysis-runner config
command:
analysis-runner config --help
# usage: config subparser [-h] --dataset DATASET -o OUTPUT_DIR [--access-level {test,standard,full}] [--image IMAGE] [--config CONFIG] [--config-output CONFIG_OUTPUT]
#
# options:
# -h, --help show this help message and exit
# --dataset DATASET The dataset name, which determines which analysis-runner server to send the request to.
# -o OUTPUT_DIR, --output-dir OUTPUT_DIR
# The output directory within the bucket. This should not contain a prefix like "gs://cpg-fewgenomes-main/".
# --access-level {test,standard,full}
# Which permissions to grant when running the job.
# --image IMAGE Image name, if using standard / full access levels, this must start with australia-southeast1-docker.pkg.dev/cpg-common/
# --config CONFIG Paths to a configurations in TOML format, which will be merged from left to right order (cloudpathlib.AnyPath-compatible paths are supported). The analysis-runner will add the default
# environment-related options to this dictionary and make it available to the batch.
# --config-output CONFIG_OUTPUT
# Output path to write the generated config to (in YAML)
Tom's Obvious, Minimal Language is a config file format designed to be easily human readable and writeable, with clear data structures. Sections are delineated using bracketed headings, and key-value pairs are defined using =
syntax, e.g. :
global_key = "value"
[heading_1]
name = "Luke Skywalker"
age = 53
[heading_1.subheading]
occupation = ["Jedi", "Hermit", "Force Ghost"]
will be digested into the dictionary:
{
'global_key': 'value',
'heading_1': {
'name': 'Luke Skywalker',
'age': 53,
'subheading': {
'occupation': ["Jedi", "Hermit", "Force Ghost"]
}
}
}
Analysis-runner incorporates a simple interface for config setting. When setting off a job, the flag --config
can be used, pointing to a config file (local, or within GCP and accessible with current logged-in credentials).
The --config
flag can be used multiple times, which will cause the argument files to be aggregated in the order they are defined. When --config
is set in this way, the job-runner performs the following actions:
- Locally (where
analysis-runner
is invoked), a merged configuration file is generated, creating a single dictionary - This dictionary is sent with the job definition to the execution server
- The merged data is saved in TOML format to a GCP path
- The env. variable
CPG_CONFIG_PATH
is set to this new TOML location - Within the driver image
get_config()
can be called safely with no further config setting
If batch jobs are run in containers, passing the environment variable to those containers will allow the same configuration file to be used throughout the Hail Batch. The cpg-utils.hail_batch.copy_common_env
method facilitates this environment duplication, and container authentication is required to make the file path in GCP accessible.
Even without additional configurations, analysis-runner will insert infrastructure and run-specific attributes, e.g.
get_config()['workflow']['access_level']
e.g. test, or standardget_config()['workflow']['dataset']
e.g. tob-wgs, or acute-care
When passing the Analysis-runner multiple configs, the configs defined earlier are used as a base that is updated with values from configs defined later. New content is added, and content with the exact same key is updated/replaced, e.g.
Base file:
[file]
name = "first.toml"
[content]
square = 4
Second file:
[file]
name = "second.toml"
[content]
triangle = 3
Result:
[file]
name = "second.toml"
[content]
square = 4
triangle = 3
It's important to note that the config files are loaded 'left-to-right', so when multiple configuration files are loaded, only the right-most value for any overlapping keys will be retained.
To use the cpg_utils.config
functions, import get_config
into any code:
from cpg_utils.config import get_config
The first call to get_config
sets the global config dictionary and returns the content, subsequent calls will just return the config dictionary.
assert get_config()['file'] == 'second.toml'
Because configuration is loaded lazily, start-up overhead is minimal, but can result in late failures if files with invalid content are specified.
The config utility can be used outside analysis-runner
and CPG infrastructure, requiring the user to manually set the config file(s) to be read. Configuration files can be set in two ways:
- Set the
CPG_CONFIG_PATH
environment variable - Use
set_config_paths
to point to one or more config TOMLs:from cpg_utils.config import set_config_paths
You can refer to the example configuration TOML in this repository and use it as a template.