Enhanced parallel experimentation and required changes in kedro code #1606

Vincent-Liagre-QB · 2022-06-09T18:29:23Z

Description & context

When working outside kedro, I often have several parallel configs for the same script (in kedro terms, "pipeline"), e.g. different model configs for a regression model ; or specific start/end dates and exclusion patterns for an analysis. Tree could look like:

├───conf
│   └───model_1
│           experiment_1.yaml
│           experiment_2.yaml
└───src
        model_1.py

And within model_1.py , I'd usually do something like:

from typer import Typer

app = Typer()

@app.command()
def main(conf = "experiment_1"):
    ....

if __name__ == "__main__":
    app()

So that I can then easily run different experiments independently with:
python src/model_1.py --conf=experiment_2 (for instance)

And I'd usually organize results like this (but that's personal ; point is to make it easily configurable):

├───data
│   └───08_reporting
│       └───model_1
│           ├───experiment_1
│           │       result_a.csv
│           │       result_b.png
│           │
│           └───experiment_2
│                   result_a.csv
│                   result_b.png

Note that:

I don't want to have different branches for these different config files as they could all be relevant (or at least relevant to show in main branch even though one of them is preferred)
I don't want to have to write code to call each of the conf in model_1.py to be able to run them independently and so that the workflow of adding a conf is streamless

Now I am wondering: how can I easily have a similar workflow in kedro ? What I have though about so far:

Use the env arg when doing a kedro run ; but:

Not really what its meant for
Say I have several pipelines with each several such options --> quickly becomes unmanageable

Use a modular pipeline and create 1 pipeline for each config but

It will pollute the kedro-viz
It is cumbersome

Use a modular pipeline, create 1 pipeline for each config and return them together so that they appear as the same node in kedro-viz ; but:

Won't be able to execute them independently
Still cumbersome

Have a first level in my dict of params corresponding to the several possible config + a wrapper around my nodes that starts with something like chosen_params = params[config_name] + use --params=config_name:<my_config> when doing a kedro run but its also a bit cumbersome and confusing
Custom CLI using most of Kedro's power --> the solution I'll go with in there

Before deep diving into 5:
Do you have any other idea? Am I missing something (might very well be the case since I am quite a beginner here)? Am I too biased by my outside-kedro workflow which might not be that straightforward after all?

Possible Implementation

Using the example case of spaceflights' data_science pipeline

Simply run:
python src/kedro_tutorial/pipelines/data_science/experiment_run.py --experiment-name="test_experiment"

Where:
src/kedro_tutorial/pipelines/data_science/experiment_run.py is as below:

(Remarks and required changes below)

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from kedro_tutorial.pipelines.data_science.nodes import evaluate_model, split_data, train_model
from kedro.io.core import (
    AbstractDataSet
    )

from pathlib import Path

from typer import Typer

app = Typer()


def change_savepath(initial_save_path, experiment_name: str):
	# TODO: this can be configured + make path absolute so it app be run from anywhere?
	return "./test_experiment_run"

def create_experiment_pipeline(experiment_name:str = "active_modelling_pipline", **kwargs) -> Pipeline:
    # TODO: link with base pipeline defined in pipeline.py
    pipeline_instance = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                name="evaluate_model_node",
                outputs="metrics",
            )
        ]
    )

    res = pipeline(
        pipe = pipeline_instance,
        inputs = "model_input_table",
        namespace = experiment_name,
        parameters = {"params:model_options":f"params:{experiment_name}.model_options"}
        )

    return res

@app.command()
def main(experiment_name:str = "test_experiment"):
	project_path = Path(__file__).absolute().parents[4] # . absolute required
	metadata = bootstrap_project(project_path)

	create_session_arg = metadata.package_name

	params = {}

	pipeline = create_experiment_pipeline(experiment_name = experiment_name)

	with KedroSession.create(create_session_arg, env = "base", extra_params = params) as session:
	    
	    catalog = session.load_context()._get_catalog()

	    results = session.run(pipeline_name = pipeline)
	    # TODO: works only if session.run is made to accept pipeline objects directly (rather than names)

	    # TODO: find a way to access all Memory datasets (not only end results)
	    # or run nodes one after another
	    # --> make it optional to return all "unregistered_ds" in AbstractRunner.run

	    for data_name, data in results.items():

	    	base_data_name = data_name.replace(f"{experiment_name}.", "")
	    	# Based on namespace of modular pipeline
	    	
	    	dataset = catalog._get_dataset(base_data_name)
	    	ds_module = dataset.__module__
	    	ds_class = type(dataset).__name__
	    	ds_type = (
	    		".".join(ds_module.split(".")).replace("kedro.extras.datasets.", "")
	    		 + "." + ds_class
	    		 )

	    	ds_config = {
	    		"filepath":dataset._get_save_path(),
	    		"type": ds_type,
	    		"versioned": dataset.versioned
	    	}

	    	new_ds_config = ds_config.copy()

	    	new_ds_config["filepath"] =  change_savepath(
	    		initial_save_path = ds_config["filepath"], 
	    		experiment_name = experiment_name
	    		)

	    	new_dataset = AbstractDataSet.from_config(
	                name = data_name, config = new_ds_config, #load_versions.get(ds_name), save_version
	            )
	    	new_dataset.save(data)

if __name__ == "__main__":
	app()

Remarks:

This is more of a workflow than a big change in the code - however this requires 2 changes in the code ; see below
Main value added is to enable user to focus on experimenting with configs by removing the need to define manually new pipelines and catalog entries, while still starting from a template and using kedro's full power (pipeline runs and i/o notably)

Required changes in kedro code:

A way to run a pipeline specified by a kedro.pipeline.Pipeline object

Goal: so that pipelines can be created programmatically, notably to specify custom params & outputs - so that I can have outputs as free_outputs and save them where I want
It seems to be a relatively easy change: mainly this line:

kedro/kedro/framework/session/session.py

Line 350 in 8f4b81a

pipeline = pipelines[name]

of course change of related doc and tests

(Optional): make it optional to return all unregistered_ds in AbstractRunner.run (vs. only free_outputs)

Again not a big change: see here:

kedro/kedro/runner/runner.py

Line 91 in 8f4b81a

return {ds_name: catalog.load(ds_name) for ds_name in free_outputs}

Possible Alternatives

See points 1/2/3/4 above

The text was updated successfully, but these errors were encountered:

deepyaman · 2022-06-10T01:06:22Z

@Vincent-Liagre-QB To what extent would this be covered by #1303?

Also, just to clarify, is your goal to be able to run all experiments with a single command, only run one experiment at a time, or do either? I think I understand your requirement as running one experiment at a time, but just wanted to make sure.

Finally, since you're from QB, you can also consider an internal project called Multi-Runner--but I 100% think these issues should be resolved in the open source Kedro ecosystem in the long run!

Vincent-Liagre-QB · 2022-06-10T08:00:43Z

@deepyaman ;

to your questions,

Goal (and value added) would be: be able to run one experiment at a time + avoid having to write specific pipelines & catalog entries for each experiment
I have looked into Multi-Runner indeed ; trying to get in touch with the team as there are indeed synergies ; but at the moment it doesn't allow to run a single experiment at a time
Regarding Hydra ([Feature Request] Support for Hydra in Kedro #1303 ): I have looked into Hydra recently indeed but am not super familiar with it. From what I understand, it could indeed cover the need ; only: (1) the changes required are probably easier to implement ; (2) while going w. Hydra only would provide a standard approach it would also create a dependency

deepyaman · 2022-06-13T12:43:04Z

@Vincent-Liagre-QB Was just taking a closer look at this, including the code. To confirm my understanding of the requirements:

Be able to store (overriding) config for each experiment--so that you can rerun any of your configured experiments.
Be able to modify filepath on the fly for each experiment, so that you can include experiment_name in the hierarchy.
Not sure on this one--be able to automatically persist all MemoryDataSets?

I think modifying filepath based on some param/other variable isn't too bad with Hooks. Storing config for each experiment requires something extra, if not using envs (and I get your reservation on using envs).

Vincent-Liagre-QB · 2022-06-13T14:31:30Z

@deepyaman to your points:

Yes
Yes
More like being able to retrieve all resulting datasets (incl. intermediary results) from a run so as to be able to persist the ones I want in the way I want.

Regarding hooks: in my understanding the limitations is that once you have implemented them, you cannot easily choose whether to apply them or not. I.e. hooks are not programatically manageable.

Also, I like more to think in terms of (1) feature needs and (2) possible code implementations (which I called "requirements") and think about them separately ; so to summarise:

Feature needs:

Independent runs
Manage paths when persisting

Requirements for a possible implementation solution (note that in this case there is a 1-to-1 matching with the feature needs but not always the case)

A way to run a pipeline specified by a kedro.pipeline.Pipeline object
make it optional to return all unregistered_ds in AbstractRunner.run (vs. only free_outputs)

(See in 1st message for more details)

Vincent-Liagre-QB · 2022-06-13T14:35:35Z

Also for the sake of enriching the discussion, I was told to look into this: https://kedro-mlflow.readthedocs.io/en/stable/index.html ; not sure it covers the need but worth looking into ; will do

deepyaman · 2022-06-14T13:00:45Z

3. More like being able to retrieve all resulting datasets (incl. intermediary results) from a run so as to be able to persist the ones I want in the way I want.

My inclination is to recommend that you return them explicitly from a node. I think it lends itself well to the idea that pipelines have an interface of inputs and outputs.

Regarding hooks: in my understanding the limitations is that once you have implemented them, you cannot easily choose whether to apply them or not. I.e. hooks are not programatically manageable.

This is doable as long as you design the hooks accordingly (e.g. parse flags that determine when and where to apply the hook logic).

avan-sh · 2022-06-14T17:38:33Z

@Vincent-Liagre-QB I'll first try to summarize the requirements to confirm my understood about this is right.

You want to run different experiments by passing experiment_name at run time.
Each experiment will be running the same pipeline
Each experiment can be differentiated by different model/pipeline params or inputs.
They are also separated by different output paths/folders

Assuming my understanding is correct, I feel like hooks as suggested by @deepyaman might be the right way to go about. As the only difference between experiments is inputs and outputs and not the pipeline being run, you can choose which files to be loaded at run time using some pattern recognition. This might be TemplatedConfig in the latest versions though.

On integration with MLFlow, it fits perfectly to run different experiments. Ideally, all of your parameters from the experiment(especially things that differentiate the experiment) should be logged in the experiment and your models can be registered in MLFlow. I think kedro-mlflow plugin might have this capability.

Edit: A workflow could be this.

Name of the experiment could be a global parameter experiment_name
You could set paths for any common outputs to be written using experiment_name eg: data/08_reporting/model_1/${experiment_name}
Experiment specific config can sit in a separate folder eg: conf/experiments/experiment_name.yaml
Additional code in TemplatedConfig/register_catalog to add any files under experiment specific config
Each experiment can be run as `kedro run --pipeline experiments --params experiment_name:experiment_1

Vincent-Liagre-QB · 2022-06-17T16:56:50Z

@deepyaman, on nodes, my frustration is that it would prevents from using the full capacities of pipeline ;

@avan-sh --> yes that's exactly what I have in mind

@deepyaman @avan-sh on hooks : I'll try to look more into this, but I am a bit skeptical about the possibility to programmatically manage hooks ; if you have examples, I am curious to look into them.

On integration w. ML Flow, I was just sharing this as it had been suggested it might cover my need ; but that's not the main topic :)

Vincent-Liagre-QB · 2022-08-22T17:38:42Z

Re-opening this now that I have a bit of time to look into it again:

@avan-sh the workflow you shared looks promising to me ; the only thing that I have difficulties understanding is how to make sure to use the version of the params corresponding to the specified experiment_name ? Could it be with a hook ?

EDIT: my previous implementation of after_context_created was missing self

I can access the params with the after_context_created hook (see below) but can't seem to modify the dict ; the hook is not supposed to return anything and I was hoping to leverage the mutability of dictionnaries but this doesn't seem to work (see test with VerificationHooks in the implem:

Implem:

In src/kedro_tutorial/hooks.py

from kedro.framework.hooks import hook_impl

class ExperimentRunHooks:

    @hook_impl
    def after_context_created(self, context) -> None:
        print("Inside ExperimentRunHooks")
        # Trying to modify the dict of params
    	context.params["test_hook_param"] = 5

class VerificationHooks:

	@hook_impl
	def after_context_created(self, context) -> None:
		print("Inside hook : VerificationHook")
		print(context.params)

In src/kedro_tutorial/settings.py:

from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
from pathlib import Path

from kedro_tutorial.hooks import ExperimentRunHooks, VerificationHooks

SESSION_STORE_CLASS = SQLiteStore
SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / "data")}

HOOKS = (VerificationHooks(), ExperimentRunHooks()) #LIFO order

Vincent-Liagre-QB · 2022-08-23T07:51:21Z

EDIT: my previous implementation of after_context_created was missing self

Vincent-Liagre-QB · 2022-08-23T08:34:34Z

Also like pointed by @avan-sh we need a hook to inject the extra param experiment_name into the TemplatedConfigLoader ; something like (credits to @avan-sh ):

    @hook_impl
    def register_config_loader(
        self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any]
    ) -> ConfigLoader:

        globals_dict = {}
        if extra_params:
            globals_dict = {"experiment_name": extra_params["experiment_name"]}
        return TemplatedConfigLoader(
            conf_paths,
            globals_pattern="*globals.yml",
            globals_dict=globals_dict,
        )

but I am not sure this register_config_loader hook template exists ; when testing it, it doesn't appear to be called...

cosasha · 2022-09-26T11:21:07Z

Hello! Has the suggestion of @Vincent-Liagre-QB been taken into account? It would greatly help me if so :)

avan-sh · 2022-09-26T21:00:46Z

@cosasha , register_config_loader hook was replaced since kedro 0.18. Possibly the issues here might be tackled in https://github.com/kedro-org/kedro/milestone/9. Possibly, someone from the maintainer team could comment on this.

astrojuanlu · 2023-06-15T14:16:16Z

Similar request from @ofir-insait from a month ago:

https://www.linen.dev/s/kedro/t/11183882/please-correct-me-if-i-m-wrong-but-it-looks-like-kedro-s-imp#15db92de-33cf-4d31-a3f4-3f08c9542b81

As stated by @Vincent-Liagre-QB in option (1) at the beginning of the thread, --environment=... only solves part of the problem, and having to write down the modular pipelines to achieve this reusability is indeed a bit cumbersome.

astrojuanlu · 2023-06-27T08:46:55Z

Similar request from @andrko1 today:

https://www.linen.dev/s/kedro/t/12904947/hi-everyone-i-will-like-to-pass-a-date-parameter-from-the-co#49134a59-df98-4126-86bf-645f5678b515

lets say that we have a folder with a date (partition) and I want to access only the specified date, e.g ${root_path}/${date}/cars.csv, but for ${date} variable I want to change it every time
it doesnt work with --params as it seems to initialize the default parameters first and then replacing the specified values
[for example: kedro run 20230627]

astrojuanlu · 2023-08-22T15:00:35Z

A similar request from @quantumtrope: #2958 (and also https://linen-slack.kedro.org/t/14164549/i-have-a-question-about-using-kedro-in-a-non-ml-setting-spec#a956426e-30d3-4a01-98b5-a582e3082da6)

Which is similar to this one from @ChristopherRabotin a while back https://linen-slack.kedro.org/t/14162145/hi-there-what-s-the-best-way-to-run-a-monte-carlo-simulation#48ef7630-854f-4e98-b698-3534f80a05b7

And this one from @bpmeek even earlier https://linen-slack.kedro.org/t/9703489/hey-everyone-i-m-looking-for-the-kedro-way-of-doing-a-monte-#80277f3a-95a8-4578-ae24-f101dc0244f9

astrojuanlu · 2023-10-23T17:21:26Z

To all people subscribed to this issue, notice that @marrrcin has published an interesting approach using

OmegaConfigLoader with custom resolvers
Dataset factories
Modular pipelines with namespaces
Centralised settings.py

Please give it a read https://getindata.com/blog/kedro-dynamic-pipelines/ and let us know what you think.

astrojuanlu · 2023-10-26T11:00:52Z

Today @datajoely recommended @marrrcin's approach as an alternative to Ray Tune for parameter sweep https://linen-slack.kedro.org/t/16014653/hello-very-much-new-to-the-ml-world-i-m-trying-to-setup-a-fr#e111a9d2-188c-4cb3-8a64-37f938ad21ff

Are we confident that the DX offered by this approach can compete with this?

search_space = {
    "a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
    "b": tune.choice([1, 2, 3]),
}

tuner = tune.Tuner(objective, param_space=search_space)

Originally posted by @astrojuanlu in #2627 (comment)

No but it's does provide a budget version of it - this is what I'm saying about the lack of sweeper integration with dedicated "sweepers" in this comment

Originally posted by @datajoely in #2627 (comment)

Let's continue the conversation about "parameter sweeping"/experimentation here.

Vincent-Liagre-QB · 2023-10-26T11:36:03Z

To all people subscribed to this issue, notice that @marrrcin has published an interesting approach using

OmegaConfigLoader with custom resolvers

Dataset factories

Modular pipelines with namespaces

Centralised settings.py

Please give it a read https://getindata.com/blog/kedro-dynamic-pipelines/ and let us know what you think.

@astrojuanlu thanks for sharing this and for the overall work on connecting everything going on around this feature request. The solution you are sharing seems very promising - although a bit complex also. I'll try to take a deeper look into it asap.

astrojuanlu · 2023-11-06T10:18:28Z

Nice talk on how to do hyperparameter tuning and selection in Flyte
https://www.youtube.com/watch?v=UO1gsXuSTzg (key bit starts around 12 mins in)

Originally posted by @datajoely in #2627 (comment)

Optuna + W&B
https://colab.research.google.com/drive/1WxLKaJlltThgZyhc7dcZhDQ6cjVQDfil#scrollTo=sHcr30CKybN7

Originally posted by @datajoely in #2627 (comment)

astrojuanlu · 2023-11-07T10:35:46Z

A user that uses different environments https://linen-slack.kedro.org/t/16041288/question-on-environments-and-credentials-we-are-currently-us#49927057-9256-455d-9213-94b898fcb699

we have a lot of params that change depending on the pipeline input so we used the envs concept to parametrise through the cli - works well for us.

Essentially option (1) of the original @Vincent-Liagre-QB ticket. In my opinion this is an abuse of environments but it's what users want: add new config file, change CLI flag, and done.

nikos-kal · 2023-11-07T11:55:21Z

A user that uses different environments https://linen-slack.kedro.org/t/16041288/question-on-environments-and-credentials-we-are-currently-us#49927057-9256-455d-9213-94b898fcb699

I am that user! Indeed, we have have repurposed envs to act as parameter groups. It works fairly well for us and it's been easy to train new team members on how we use them.

Would love a kedro-native solution though!

PS: For most functionality that is not out of the box for kedro the community tends to recommend hooks. My experience is that large projects can end up with dozens of hooks and each team uses different ones making onboarding difficult. Also, logic that is applied there might appear as side effects to someone not familiar with them so my preference is to use them sparingly. Just one person's opinion :)

astrojuanlu · 2023-11-14T16:45:18Z

@netphantom #3308

I need to run multiple pipelines with different inputs, so I have configured in my parameters.yml something like:
neural_network_heads: [100, 200, 300]
I would like that Kedro takes into account one at the time each value, and run 3 pipelines.
Using Snakemake, I put the expand rule and it took care of it. Is it possible to do the same in Kedro?

astrojuanlu · 2023-11-24T09:47:09Z

"Live replay" of a user attempting the current approach #3308 useful for future iterations

astrojuanlu · 2024-04-12T12:23:43Z

When showing dataset factories to some users internally:

Can I pass the parameters directly on the CLI instead of creating new namespaces?

Vincent-Liagre-QB added the Issue: Feature Request New feature or improvement to existing feature label Jun 9, 2022

merelcht added this to the Multi-runner type issues milestone Aug 29, 2023

astrojuanlu mentioned this issue Oct 23, 2023

Dynamic Pipeline #2627

Open

1 task

astrojuanlu mentioned this issue May 16, 2024

Universal Kedro deployment (Part 1) - Separate external and applicative configuration to make Kedro cloud native #770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced parallel experimentation and required changes in kedro code #1606

Enhanced parallel experimentation and required changes in kedro code #1606

Vincent-Liagre-QB commented Jun 9, 2022 •

edited

Loading

deepyaman commented Jun 10, 2022

Vincent-Liagre-QB commented Jun 10, 2022

deepyaman commented Jun 13, 2022

Vincent-Liagre-QB commented Jun 13, 2022

Vincent-Liagre-QB commented Jun 13, 2022

deepyaman commented Jun 14, 2022

avan-sh commented Jun 14, 2022 •

edited

Loading

Vincent-Liagre-QB commented Jun 17, 2022

Vincent-Liagre-QB commented Aug 22, 2022 •

edited

Loading

Vincent-Liagre-QB commented Aug 23, 2022

Vincent-Liagre-QB commented Aug 23, 2022 •

edited

Loading

cosasha commented Sep 26, 2022

avan-sh commented Sep 26, 2022

astrojuanlu commented Jun 15, 2023

astrojuanlu commented Jun 27, 2023 •

edited

Loading

astrojuanlu commented Aug 22, 2023

astrojuanlu commented Oct 23, 2023

astrojuanlu commented Oct 26, 2023

Vincent-Liagre-QB commented Oct 26, 2023 •

edited

Loading

astrojuanlu commented Nov 6, 2023

astrojuanlu commented Nov 7, 2023

nikos-kal commented Nov 7, 2023

astrojuanlu commented Nov 14, 2023

astrojuanlu commented Nov 24, 2023

astrojuanlu commented Apr 12, 2024

Enhanced parallel experimentation and required changes in kedro code #1606

Enhanced parallel experimentation and required changes in kedro code #1606

Comments

Vincent-Liagre-QB commented Jun 9, 2022 • edited Loading

Description & context

Possible Implementation

Possible Alternatives

deepyaman commented Jun 10, 2022

Vincent-Liagre-QB commented Jun 10, 2022

deepyaman commented Jun 13, 2022

Vincent-Liagre-QB commented Jun 13, 2022

Vincent-Liagre-QB commented Jun 13, 2022

deepyaman commented Jun 14, 2022

avan-sh commented Jun 14, 2022 • edited Loading

Vincent-Liagre-QB commented Jun 17, 2022

Vincent-Liagre-QB commented Aug 22, 2022 • edited Loading

Vincent-Liagre-QB commented Aug 23, 2022

Vincent-Liagre-QB commented Aug 23, 2022 • edited Loading

cosasha commented Sep 26, 2022

avan-sh commented Sep 26, 2022

astrojuanlu commented Jun 15, 2023

astrojuanlu commented Jun 27, 2023 • edited Loading

astrojuanlu commented Aug 22, 2023

astrojuanlu commented Oct 23, 2023

astrojuanlu commented Oct 26, 2023

Vincent-Liagre-QB commented Oct 26, 2023 • edited Loading

astrojuanlu commented Nov 6, 2023

astrojuanlu commented Nov 7, 2023

nikos-kal commented Nov 7, 2023

astrojuanlu commented Nov 14, 2023

astrojuanlu commented Nov 24, 2023

astrojuanlu commented Apr 12, 2024

Vincent-Liagre-QB commented Jun 9, 2022 •

edited

Loading

avan-sh commented Jun 14, 2022 •

edited

Loading

Vincent-Liagre-QB commented Aug 22, 2022 •

edited

Loading

Vincent-Liagre-QB commented Aug 23, 2022 •

edited

Loading

astrojuanlu commented Jun 27, 2023 •

edited

Loading

Vincent-Liagre-QB commented Oct 26, 2023 •

edited

Loading