Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested parameters update via command line / config file #750

Closed
Tracked by #1963
noklam opened this issue Apr 19, 2021 · 9 comments
Closed
Tracked by #1963

Nested parameters update via command line / config file #750

noklam opened this issue Apr 19, 2021 · 9 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Apr 19, 2021

Description

Currently, when passing a parameter to override configurations like
kedro run --params key:value, it only supports top-level key but not nested dictionary. This is not flexible for experiments.

Similarly, kedro run --config=config.yml, overwriting via config file suffers from the same top-level key only problem.

Ideally.

# paramters.yml
group1:
   key1: 1
   key2: 2
   key3: 3

Ideally, this should be the expected result

# paramters.yml
group1:
   key1: 1
   key2: 2
   key3: 4

Current behavior

# paramters.yml
group1:
   key3: 4

Related #605

Context

A lot of parameters are organized in groups instead of being a top-level key.

Possible Implementation

Maybe a simple nest dict update is enough, I am not sure

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label Apr 19, 2021
@antonymilne
Copy link
Contributor

Hello @noklam, I totally understand the need for this and agree that there should be a better mechanism for overwriting a nested parameter. But I'm not sure how easy it is to resolve just this without some more holistic solution that reworks how configuration is handled more generally, and as per @lorenabalan's comment in #605 there's a bit of a conversation ongoing about this (N.B. @datajoely).

If we allow kedro run --params key.subkey:value to overwrite a nested parameter then it immediately raises the question of why we don't allow this sort of nested overwrite in configuration environments as you wanted to do in #605. So while it sounds like a very useful feature to add, it leads to inconsistency between the CLI and the parameters.yml approaches to specifying configuration. In an ideal world you would be able to do this sort of thing from both CLI and parameters.yml, but that is part of the larger conversation I think (and in addition to overwriting keys should also consider things like deleting and adding keys).

This nested overwriting behaviour is already possible through use of TemplatedConfigLoader and globals. I know that's a bit hacky and not a nice solution, but at least it is possible to currently do this. So my opinion would tend to be: very nice idea, I 100% agree we need a better solution than we currently have, but I think it needs to come as part of the complete solution rather than just implementing part of it that only works using the CLI.

@noklam
Copy link
Contributor Author

noklam commented Apr 19, 2021

@AntonyMilneQB Thank you for your prompt response. I agree with what you said, it leads to inconsistent. but I really hope kedro team may consider adding this. TemplatedConfigLoader only works in very limited context.

Here is one thing that I want to do. I want to have a dynamic pipeline (which isn't easy to do in kedro). So instead of creating a pipeline, I want to

I have a machine learning model pipeline, which takes t + delta to t + 4 + delta weeks as training data, then give prediction of next week. Let's say I want to do a rolling prediction for every week.

Train: t + d to t+4+d, Predict: t + 4+ d week, where d =1
Train: t + d to t+4+d,  Predict: t + 4+ d  week, where d = 2
Train: t + d to t+4+d, Predict: t + 4+ d  week, where d = 3
Train: t + d to t+4+d, Predict:t + 4+ d  week, where d = 4
Train: t + d to t+4+d, Predict: t + 4+ d  week, where d = 5

As a workaround to avoid re-writing the pipeline for dynamic pipeline generation. I can do this instead.

kedro run --params d:1
kedro run --params d:2
kedro run --params d:3
kedro run --params d:4
kedro run --params d:5

I can just paste this in a terminal and it will run the 5 pipelines for me. Now, this only works for top-level keys, so I have to move my sub-key to become a top-level key to make this hack works. This kind of "dynamic" pipeline generation is very common for machine learning tasks. At first, I try to re-write the pipeline to generate pipeline on the fly. Over time, I found doing this CLI style is easier to understand and maintain.

I would really love to hear Kedro team's opinion about what's the best/common way of handling these kinds of situations with Kedro. So far all solution I have found online for the dynamic pipeline is fragile, and easily break compatibility between different Kedro's version, and it makes the code unreadable and less reusable too.

@antonymilne
Copy link
Contributor

The "official" kedro viewpoint is basically that pipelines should be static (see #650). At the moment I think there are a couple of options, but I 100% do sympathise with you here and think we should have a better solution for this sort of thing. Interested to hear what other people think, but my suggestions for now would be as follows.

Modular pipelines (docs)

Generate each pipeline with a different namespace. In pipeline_registry.py, something like this:

def register_pipelines() -> Dict[str, Pipeline]:
    base_pipeline = create_pipeline()
    pipelines = {}
    for delta in range(1, 4):
        pipelines[f"d_{delta}"] = pipeline(
            base_pipeline,
            parameters={"params:key.subkey.subsubkey_0": f"params:key.subkey:subsubkey_{delta}"},
            namespace=f"d_{delta}"
        )
    return pipelines

And then in parameters.yml:

key:
    subkey:
        subsubkey_0: 0
        subsubkey_1: 1
        subsubkey_2: 2
        subsubkey_3: 3

And then do kedro run --pipeline=d_1, for example. This is kind of hacky I know. But it is quite flexible as you can add together pipelines in your pipeline_registry to define, e.g. a pipeline that runs both delta=1 and delta=2. It also means that you can overwrite an arbitrarily nested parameter, so long as you refer to it directly as a node input.

Custom config loader

Depending on exactly what you want to do, this is maybe the cleanest solution. It's kind of a more "proper" implementation of the TemplatedConfigLoader workaround. Write your own ConfigLoader that merges parameter dictionaries rather than overwriting top level keys, and then run using a different environment for each delta. Note this new ConfigLoader potentially creates its own problems though (e.g. you can't delete keys without introducing a new syntax for that). @SteveLerQB wrote something for this on a client project and it worked pretty well I think.

To get this working from the CLI as well as parameters.yml across different environments it looks like just making a new ConfigLoader wouldn't be enough, and you'd additionally need to overwrite KedroContext.params as you did in your PR. Actually it's worth noting that altering KedroContext is something that you can easily do, since you can set your own context class using CONTEXT_CLASS in settings.py. So you could actually achieve what you're doing in #751 that way already, without needing a new ConfigLoader.

@noklam
Copy link
Contributor Author

noklam commented Apr 20, 2021

@AntonyMilneQB Thanks, I like this solution better. Just one thing.

Is there a way to define this 1 and 4 in paramters.yml and pass it to the pipeline instead of hard-coding it inside the code?
for delta in range(1, 4):

@antonymilne
Copy link
Contributor

I think the basic answer is no, sorry. To do this you would need to access the catalog, which is currently only possible through the context. Registration of pipelines (through configure_project) happens before this, so won't have the current session or context available yet. In the future it should be possible to load the catalog without needing a context (as you can now do with from kedro.framework.project import pipelines) but not yet. Even when it becomes possible, I don't think it will be a generally recommended workflow to use the catalog when registering pipelines though.

For the time being I think you'd need to either manually load the parameters you want from parameters.yml inside register_pipelines or inject them some other way (e.g. os.getenv).

@noklam
Copy link
Contributor Author

noklam commented Apr 20, 2021 via email

@antonymilne
Copy link
Contributor

antonymilne commented Apr 21, 2021

In older Kedro's version def create_pipeline(**kwargs): actually takes a keyword argument. In later versions, it is removed.

As far as I can tell this still takes **kwargs. What might have changed is that the kedro new template no longer produces any pipelines, since with the addition of modular pipelines these are now made by kedro pipeline create.

Can you elaborate on why it is not recommended? So you think these "meta-parameters" about the pipeline creation should be kept in separate files?

I think Ivan's post here gives quite a nice explanation of it. Basically it boils down to "pipelines are static and do not change after you deploy your code".

With that said, my personal opinion is actually that sometimes dynamic pipeline creation makes sense. In these cases I think passing arguments into create_pipeline is a good way to do it. Combined with the modular pipelines, this is quite powerful and flexible and you should be able to achieve whatever you need once you've got the the "meta-parameters" in the right place.

As for where the meta-parameters should live: I do think that, even if it were possible to put them in parameters.yml, this isn't really the right place for them. parameters from parameters.yml are treated as catalog entries and can be used as inputs to nodes. A parameter that describes a pipeline seems quite different from this as it's not an input to a node; it's an input to the pipeline creation mechanism.

So I would recommend using a different place for defining meta-parameters, either in meta_parameters.yml or through an environment variable. If you still want to use the same conf system so that a conf/local/meta_parameters.yml can overwrite conf/base/meta_parameters.yml, you should still be able to use ConfigLoader to achieve this, just it will be outside a kedro context. If you don't need that sort of inheritance then it's probably easiest just to load it yourself without worrying about the whole kedro config system.

@lorenabalan
Copy link
Contributor

I'll go ahead and close this as answered for now, but we made a note of it in the wider configuration workstream. Thank you for sharing!

@noklam
Copy link
Contributor Author

noklam commented Aug 30, 2022

#927 is an equivalent PR that get merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants