Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make config loading consistently happen before pipelines are registered to allow for dynamic pipelines with OmegaConf #3093

Open
Lasica opened this issue Sep 28, 2023 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@Lasica
Copy link

Lasica commented Sep 28, 2023

Description

Currently the order between pipelines loading and config loading varies, depending on kedro command. If pipelines were to be ever dynamic depending on config/params, then they should always be read before pipelines are registered. Examples:

I wrote hello world functions to pipeline registry and config handler functions to demonstrate the order of loading:

✅ Command `kedro catalog list`
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro catalog list
Hello config register resolver
Hello pipeline registry function
❌ Command `kedro run`
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro run --namespace price_predictor.base --nodes price_predictor.base.debug_node
[09/28/23 14:59:54] INFO     Kedro project spaceflights-multirun                                                                                                             session.py:364
Hello pipeline registry function
Hello config register resolver
[09/28/23 14:59:55] INFO     Loading data from 'params:price_predictor.base.model_options' (MemoryDataset)...                                                           data_catalog.py:492
                    INFO     Running node: debug_node: verbose_params([params:price_predictor.base.model_options]) -> None                                                      node.py:331
                    INFO     Verbose debug node reporting                                                                                                                       nodes.py:60
                    INFO     Argument number:0, Value:{'test_size': 0.2, 'random_state': 3, 'target': 'price', 'features': ['engines', 'passenger_capacity', 'crew',            nodes.py:62
                             'd_check_complete'], 'model': 'sklearn.linear_model.LinearRegression', 'model_params': {'gamma': 3}}                                                          
                    INFO     Completed 1 out of 1 tasks                                                                                                             sequential_runner.py:85
                    INFO     Pipeline execution completed successfully.  
❌ Command `kedro registry list`
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro registry list                                                               
Hello pipeline registry function
- __default__
- data_processing
- data_science
✅ Command `kedro ipython`
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro ipython
ipython --ext kedro.ipython
Python 3.10.4 (main, Apr  6 2023, 13:50:45) [GCC 12.2.1 20230111]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.
[09/28/23 15:01:51] INFO     Resolved project path as: /home/adobrogo/projects/kedro/multirunner-demo/spaceflights-multirun.                                                __init__.py:139
                             To set a different path, run '%reload_kedro <project_root>'                                                                                                   
Hello config register resolver
[09/28/23 15:01:52] INFO     Kedro project spaceflights-multirun                                                                                                            __init__.py:108
                    INFO     Defined global variable 'context', 'session', 'catalog' and 'pipelines'                                                                        __init__.py:109
[09/28/23 15:01:53] INFO     Registered line magic 'run_viz'                                                                                                                __init__.py:115

In [1]: 

I think that's all the places that read the pipelines from kedro.framework.project.

Context

I wrote an example based on Spaceflights starter that uses modular pipelines feature and omegaconf resolver to create pseudo dynamic pipelines (I will soon publish this for review):

    pipes = []
    for family, variants in MODEL_FAMILIES.items():
        for model_variant in variants:
            pipes.append(
                pipeline(
                    data_science_pipeline,
                    inputs={"model_input_table": "model_input_table"},
                    namespace=f"{family}.{model_variant}",
                    tags=[model_variant]
                )
            )

Currently MODEL_FAMILIES is a static variable that is validated against what is defined in parameters, configuration like:

Configuration sample
model_options:
  test_size: 0.2
  random_state: 3
  target: price
  features:
    - engines
    - passenger_capacity
    - crew
    - d_check_complete
    - moon_clearance_complete
    - iata_approved
    - company_rating
    - review_scores_rating
  # unused, it's defined for demo purposes
  model: sklearn.linear_model.LinearRegression
  model_params: {}

# model family
price_predictor:
  _overrides:
    model_params:
      gamma: 3
    features:
    - engines
    - passenger_capacity
    - crew
    - d_check_complete
  model_options: ${merge:${model_options},${._overrides}}
  
  # model variants
  base:
    model_options: ${..model_options}
  candidate1:
    model_options: ${merge:${..model_options},${._overrides}}
    _overrides:
      features:
      - engines
      - passenger_capacity
      - crew
      - d_check_complete
      - company_rating
  candidate2:
    model_options: ${merge:${..model_options},${._overrides}}
    _overrides:
      model_params:
        gamma: 2.5
  candidate3:
    model_options: ${..model_options}



model_families: ${register_model_families:}

It would be fully dynamic if the order of loading were consistent and then custom omegaconf resolver could populate MODEL_FAMILIES that defines namespaces of modular pipelines shown before. I believe this is minimum effort change to achieve dynamic pipelines functionality.

Possible Implementation

Due to lazy evaluation of many of kedro resources, for KedroSession its sufficient to load catalog before reading pipeline to run. They don't depend on each other and it can be simply swapped.

For kedro registry some config read could be introduced.

However, for better consistency in all places, the _ProjectPipelines class from kedro.framework.project should refer to config in some way to make it actually load.

Related issues:

@Lasica Lasica added the Issue: Feature Request New feature or improvement to existing feature label Sep 28, 2023
@noklam
Copy link
Contributor

noklam commented Sep 28, 2023

Minor Github hack, structure the issue in bullet point it will render title automatically. i.e.

- #3000 
- #2663 
- #2627
-  #2626  

@datajoely
Copy link
Contributor

Great push - thanks for the well thought out issue @Lasica !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

3 participants