# Configuration Files

Whilst argument parsing solved a lot of problems in terms of avoiding hardcoded values and providing information to the user about what different arguments do. As our script becomes more complex and contains more variables, even argument parsing can become overwhelming. See https://github.com/ultralytics/yolov5/blob/master/train.py. 

Say we want to train a machine learning model, we have model A which uses one set of variables, and model B which uses another set of variables, we can't support this solely using `argparse` because we can only set a single default.

Thankfully we can also use configuration files.

When we use configuration files we're typically making a trade-off that not every variable is going to be able to be changed, but we might instead have a few pre-configured variations that we can select between.

## 01. What is a configuration file?

What are configuration files? These are just files that we tend to put in a specific directory, usually with `json`, `.yaml`, `.ini` extensions, that store groups of variables for us. 

In [11]:
# how is a json file structured?
"""
{
    "dataset": {
        "root": ...,
        "regex": ...
    }
}
"""

'\n{\n    "dataset": {\n        "root": ...,\n        "regex": ...\n    }\n}\n'

In [1]:
# how is a yaml file structured?
"""
dataset:
    root: ... # this variable does this
    regex: ... # TODO: change this variable to x when i do y
"""

'\ndataset:\n    root: ... # this variable does this\n    regex: ... # TODO: change this variable to x when i do y\n'

## 02. Creating a configuration file

We can go and create our own configuration file that lets us use a different configuration of a specific dataset, providing us a pre-configured set of variables to use.

In [2]:
# we know for our dataset class we need to provide the following arguments
"""
dataset = FashionMnist(root, regex)
"""

'\ndataset = FashionMnist(root, regex)\n'

In [3]:
# create a file in /configs directory
...

## 03. Reading a configuration file

Once we've created a configuration file we would typically want to be able to read that data stored in that file in order to be able to use it.

In [5]:
# using yaml to load configuration files
import yaml
from pathlib import Path
from typing import *

path = Path("../configs").joinpath("dataset_A.yaml")

with open(path, "r") as file:
    cfg : Dict[str, Any] = yaml.safe_load(file)
    print(cfg)

{'root': 'some other value...', 'regex': '*'}


In [6]:
# to access items 
cfg["root"]

'some other value...'

Whilst we can use `yaml`, I prefer to use `omegaconf` which is a library that builds on top of `yaml` but provides more functionality which we will explore later.

In [7]:
# using omegaconf to load configuration files
from omegaconf import OmegaConf

path = Path("../configs").joinpath("dataset_A.yaml")

cfg = OmegaConf.load(path)

print(cfg)

{'root': 'some other value...', 'regex': '*'}


In [8]:
# it provides a slightly nicer way to interface with the data stored in the config file
cfg.root

'some other value...'

Variable resolution within a configuration file.

## 04. Using a Configuration File

So the next question becomes, how do we integrate these into our script? You generally have a lot of flexibility, so I'm just going to demonstrate one way I might do it.

In [31]:
# lets look at our script 01-script.py vs. 02-script-with-config.py

In [32]:
# lets run script 1
!python ../scripts/01-script.py

usage: 01-script.py [-h] --root ROOT [--regex REGEX]
01-script.py: error: the following arguments are required: --root


In [38]:
# lets run script 2
!python ../scripts/02-script_with_config.py # --dataset ../configs/dataset_B.yaml

downloaded fashion_mnist into root
img (28, 28, 3) with label 9
img (28, 28, 3) with label 2
img (28, 28, 3) with label 1
img (28, 28, 3) with label 1
img (28, 28, 3) with label 6
img (28, 28, 3) with label 1
img (28, 28, 3) with label 4
img (28, 28, 3) with label 6
img (28, 28, 3) with label 5
img (28, 28, 3) with label 7


## 05. Advanced: Command Line Overrides

One issue with this we mentioned briefly is that we might have less granular control over the configuration because we can't specify individual values anymore. However, we can also use `omegaconf` for command line overrides which is an extremely powerful tool.

In [19]:
from omegaconf import OmegaConf

# lets define some args which didn't get parsed
unknown_args = ["regex=0"]

# lets interpret these
overrides = OmegaConf.from_dotlist(unknown_args)

print(overrides)

# 
# print(overrides.x, type(overrides.x))

{'regex': 0}


In [20]:
# we can also merge these using omegaconfg
cfg = OmegaConf.load("../configs/dataset_B.yaml")

print(cfg)
print(cfg.root)

{'root': '${oc.env:PROJECT_ROOT}/data', 'regex': '*'}
C:\Users\samca\Documents\projects\nextgen2025-codingbootcamp-session07/data


In [21]:
# lets merge these 
cfg = OmegaConf.merge(cfg, overrides)

print(cfg)

{'root': '${oc.env:PROJECT_ROOT}/data', 'regex': 0}


However, we often have to be careful about types when doing this, under the hood `omegaconf` will interpret these values for you into what it thinks is the most appropriate datatype. However this might not always be suitable.