Add the ability to dynamically create modular pipelines #1993

jstammers · 2022-10-31T21:25:27Z

Description

I have a pipeline which is broadly composed of the following steps

Load and preprocess some data
Apply a function to transform it
Store the transformed data and some additional meta-data

An example of the pipeline is presented below

from typing import List

import pandas as pd
from kedro.pipeline import node, pipeline


def create_data():
    df = pd.DataFrame({"x": [1, 2, 3]})
    return df


def process_data(df: pd.DataFrame, x: int) -> pd.DataFrame:
    df["x"] *= x
    return df


def combine_data(dfs: List[pd.DataFrame]) -> pd.DataFrame:
    return pd.concat(dfs)


def create_pipeline(*args, **kwargs):
    input_pipeline = pipeline(
        [node(create_data, inputs=None, outputs="input_data")]
    )

    process_pipeline = pipeline(
        [
            node(
                process_data,
                inputs=["input_data", "param:x"],
                outputs="processed_data",
            )
        ]
    )

    combine_pipeline = pipeline(
        [node(combine_data, inputs=["processed_data"], outputs="combined_data")]
    )
    
    return input_pipeline + process_pipeline + combine_pipeline

As an extension to the current functionality, I would like to be able to iterate over multiple parameters for process_pipeline and combine the results together at the final stage. Additionally, these loop parameters would be determined as an output from the initial pipeline. Using modular pipelines, I would expect that if it were possible to load the result of an output from a node, then I could compose the pipeline as follows

def create_data():
    loop = [1, 2, 3, 4, 5]
    df = pd.DataFrame({"x": [1, 2, 3]})
    return df, loop


def create_pipeline(*args, **kwargs):
    input_pipeline = pipeline(
        [node(create_data, inputs=None, outputs=["input_data", "loops"])]
    )
    
    process_pipeline = pipeline(
        [
            node(
                process_data,
                inputs=["input_data", "param:loop"],
                outputs="processed_data",
            )
        ]
    )
    

    p = pipeline
    process_outputs = []

    for loop in _load("loops"):
        processor = pipeline(process_pipeline,parameters={"param:loop":loop}, namespace=loop)
        process_outputs .append("__" + str(loop) + "_processed_data") #assumes '__<loop>_processed_data' is added to catalog
        p += processor
   
    combine_pipeline = pipeline(
        [node(combine_data, inputs=[process_outputs ], outputs="combined_data")]
    )
    p += combine_pipeline
                                   
    return p

Context

This change would be useful because it would allow me to extend the use a current production pipeline without requiring additional modifications to the code that is used to execute this pipeline. In order to make use of this dynamic parameter, I currently have to create a separate runner script for this specific pipeline, which inevitably makes it less portable

Possible Alternatives

In order to achieve the desired functionality, I have implemented something similar to the following

from kedro.io import DataCatalog
from kedro.runner import SequentialRunner

catalog = DataCatalog()

pipeline = create_pipeline()
runner = SequentialRunner()

init_pipeline = pipeline.to_outputs(["input_data", "loops"])

init_run = runner.run(init_pipeline, catalog)

loops = init_run["loops"].load()
processed_data = []

for l in loops:
    catalog = DataCatalog({"input_data":init_run["input_data"], "param:loop":MemoryDataSet(l)})
    loop_run = runner.run(pipeline.from_inputs(["input_data_", "param:loop"]).to_outputs("processed_data"), catalog)
    processed_data.append(loop_run["processed_data"].load())

output_run = runner.run(pipeline.from_inputs("processed_data"), DataCatalog({"processed_data":MemoryDataSet(processed_data)})

but this does not give me the full functionality of Kedro. For example, I can't use this to load catalogs from different environments, This script is also very tightly coupled to the pipeline such that if I change the pipeline, I would need to change this as well.
Another option I have come across is to use a custom Runner as described in #1853, but I haven't yet tried to implement this for my pipeline

The text was updated successfully, but these errors were encountered:

datajoely · 2022-11-01T09:31:51Z

Related discussion by @noklam here #1963

astrojuanlu · 2023-11-24T14:38:42Z

It's not clear to me what the ask is here. Defining modular pipelines dynamically is already possible with Python code. See for instance https://getindata.com/blog/kedro-dynamic-pipelines/

    for namespace in settings.DYNAMIC_PIPELINES_MAPPING.keys():
        pipes.append(
            pipeline(
                data_processing,
                inputs={
                    "companies": "companies",
                    "shuttles": "shuttles",
                    "reviews": "reviews",
                },
                namespace=namespace,
                tags=settings.DYNAMIC_PIPELINES_MAPPING[namespace],
            )
        )
    return sum(pipes)

What am I missing?

Also, this issue hasn't had activity in one year, I'm voting to close it unless we can clarify the problem.

merelcht · 2024-03-12T14:26:46Z

I agree with @astrojuanlu, closing this issue now due to inactivity and it not being entirely clear what the ask is. If you come across this issue and have a similar need, feel free to comment and we can consider re-opening it for further discussion.

jstammers added the Issue: Feature Request New feature or improvement to existing feature label Oct 31, 2022

merelcht added the Community Issue/PR opened by the open-source community label Jan 26, 2023

merelcht added this to Kedro Framework Jan 26, 2023

astrojuanlu mentioned this issue Mar 8, 2023

RFC: loop pipeline over result other pipeline? #2354

Closed

AhdraMeraliQB added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 21, 2023

noklam mentioned this issue Jun 1, 2023

Dynamic Pipeline #2627

Open

1 task

merelcht closed this as not planned Won't fix, can't repro, duplicate, stale Mar 12, 2024

github-project-automation bot moved this to Done in Kedro Framework Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to dynamically create modular pipelines #1993

Add the ability to dynamically create modular pipelines #1993

jstammers commented Oct 31, 2022

datajoely commented Nov 1, 2022

astrojuanlu commented Nov 24, 2023

merelcht commented Mar 12, 2024

Add the ability to dynamically create modular pipelines #1993

Add the ability to dynamically create modular pipelines #1993

Comments

jstammers commented Oct 31, 2022

Description

Context

Possible Alternatives

datajoely commented Nov 1, 2022

astrojuanlu commented Nov 24, 2023

merelcht commented Mar 12, 2024