# Pipeline and configuration files

**Author(s)**: Matteo Bunino (CERN)

# Using Configuration Files in itwinai

In the previous tutorial, we introduced how to create new components and assemble them into a 
**Pipeline** for a simplified workflow execution. The **Pipeline** executes components in the 
order they are defined, assuming that each component's outputs will fit as inputs to the next one.

Sometimes, you might want to define your pipeline in a configuration **YAML** file instead. 
This allows for:

- Easier modification and reuse of pipeline definitions
- Clear separation between code and configuration
- Dynamic overrides of parameters at runtime

---

## Example Configuration File

```yaml
training_pipeline:
    _target_: itwinai.pipeline.Pipeline
    steps:
        - _target_: basic_components.MyDataGetter
          data_size: 200
        - _target_: basic_components.MyDatasetSplitter
          train_proportion: 0.5
          validation_proportion: 0.25
          test_proportion: 0.25
        - _target_: basic_components.MyTrainer
        - _target_: basic_components.MySaver
```

---

itwinai leverages [Hydra](https://hydra.cc) to parse configuration files and instantiate pipelines dynamically. 
There is two ways you can use your configuration file to run a pipeline: from the command-line 
interface (CLI) or from within your code.


## Parsing Pipelines from CLI

You can execute a pipeline from a configuration file with the CLI using:

```bash
itwinai exec-pipeline
```

This command loads the default configuration file and executes the defined 
pipeline. You can customize execution using the following options:

### 1. Setting a `--config-path` and `--config-name`
By default, the parser will look for a file called config.yaml inside your current working directory. 
If you want to change this, set the path to your configuration file with the `--config-path` option. 
This can be either absolute or relative to your current working directory, and should point to
the directory in which your configuration file is located. If your configuration file has a different name 
(not config.yaml), you may specify this with the `--config-name` flag:

```bash
itwinai exec-pipeline --config-path path/to/dir --config-name my-config-file
```

### 2. Setting a Pipeline Key (`pipe_key`)
A configuration file can contain multiple pipelines. 
The default key that the parser will look for is `training_pipeline`. Use the `pipe_key` argument 
to overwrite this default and specify which pipeline to execute:

```bash
itwinai exec-pipeline +pipe_key=another_training_pipeline
```

### 3. Selecting Steps to Run (`pipe_steps`)

If you only want to run specific steps of the pipeline, use `pipe_steps`:

```bash
itwinai exec-pipeline +pipe_steps=[data_loader,trainer]
```

This will execute only the `data_loading` and `training` steps of the pipeline. You can also give 
`pipe_steps` as a list of indices, if your configuration file defines your steps in list format.

### 4. Overwriting Values

You can override any parameter in the configuration file directly from the command line:

```bash
itwinai exec-pipeline +trainer.batch_size=64
```

This modifies the `batch_size` parameter inside the pipeline configuration.

---

## Advanced Functionality with Hydra

Since this implementation is based on **Hydra**, you can use all of Hydra’s command-line arguments, including:

- **Multi-run execution:**
- **Configuration composition and overrides**
- **Experiment tracking with different configurations**

For more details, refer to the [Hydra documentation](https://hydra.cc/docs/advanced/hydra-command-line-flags/).

---

## Debugging Tip

If your pipeline execution fails and you need detailed error messages, set the following environment variable before running the pipeline:

```bash
export HYDRA_FULL_ERROR=1
```

---

TODO: 
serialization (the other way around)

## Parsing Pipelines from Code

In some cases, you may want to parse and execute a pipeline from a configuration file from within your python code. 
You can do this by running: 

In [None]:
from hydra import compose, initialize
from itwinai import exec_pipeline_with_compose

# Here, we show how to run a pre-existing pipeline stored as
# a configuration file from within python code, with the possibility of dynamically
# override some fields

# Load pipeline from saved YAML (dynamic deserialization)
with initialize(config_path="path/to/my/config"):
    cfg = compose(
        config_name="basic_pipeline_example.yaml",
        overrides=[
            "pipeline.init_args.steps.0.init_args.data_size=200",
        ],
    )
    exec_pipeline_with_compose(cfg)

To keep things simple, this implementation is taken directly from Hydra, so please refer to 
[their documentation](https://hydra.cc/docs/advanced/compose_api/) for more details on what parameters you can set here.

## Reproducibility

Each execution logs the pipeline configuration under the `outputs/` directory. 
This ensures reproducibility by recording the exact parameters used for execution. 

## Running the pipeline

Here you can find a graphical representation of the pipeline implemented below.

![pipeline](sample_pipeline_2.jpg)

### Important!

Pipeline components can be serialized only when they are imported from an external file!

In this case, `MyDataGetter`, `MyDatasetSplitter`, and `MyTrainer` are imported from `basic_components`.
Otherwise, the pipe serialization cannot be deserialized by another process.

In [4]:
from hydra import compose, initialize
from itwinai import exec_pipeline_with_compose

# Here, we show how to run a pre-existing pipeline stored as
# a configuration file from within python code, with the possibility of dynamically
# override some fields

# Load pipeline from saved YAML (dynamic deserialization)
with initialize():
    cfg = compose(
        "basic_pipeline_example.yaml",
        overrides=[
            "pipeline.init_args.steps.0.init_args.data_size=200",
        ],
    )
    exec_pipeline_with_compose(cfg)

MyDataGetter's data_size is now: 200

#######################################
# Starting execution of 'Pipeline'... #
#######################################
###########################################
# Starting execution of 'MyDataGetter'... #
###########################################
#####################################
# 'MyDataGetter' executed in 0.000s #
#####################################
################################################
# Starting execution of 'MyDatasetSplitter'... #
################################################
##########################################
# 'MyDatasetSplitter' executed in 0.000s #
##########################################
########################################
# Starting execution of 'MyTrainer'... #
########################################
##################################
# 'MyTrainer' executed in 0.000s #
##################################
######################################
# Starting execution of 'Adapter'... #
##############