Lazy Loading of Catalog Items #2829

m-gris · 2023-07-22T17:59:59Z

Hi everyone,

Currently, running modular pipelines requires loading all datasets before pipeline executions, which, to me, seems not really ideal...

Let's assume a simple project, with 2 modular pipelines, A and B.

If B requires datasets created via api calls or interactions with remote databases , then, when working offline, one cannot do a simple kedro run --pipeline A, since Kedro will fail at loading B's dataset... which are not needed to run A !
A bit of a pity...

A first, impulsive reaction might lead to commenting out those un-needed datasets in the local conf, but since commentaries are not evaluated, this would simply 'do nothing'...

Granted, one could simply comment out those datasets in the base config...
But "messing around" with base to handle a temporary / local need seems, to me at least, to deviate from the "Software engineering best practices" that kedro aims at fostering.

To avoid this, one would therefore have to create some sort of dummy dataset in the local conf... Totally do-able, but not super convenient and definitely not frictionless...

I guess that such "troubles" could be avoided if datasets were loaded lazily, only when needed, or by building the DAG first and then filtering datasets that are not actually needed...

I hope that this suggestion will both sound relevant and won't represent much of a technical challenge.

Best Regards
Marc

The text was updated successfully, but these errors were encountered:

noklam · 2023-07-24T12:46:46Z

Aleksander Jaworski
1 hour ago
[Kedro-version : 0.18.6 currently]
Hi, I am working on a sort of 'pipeline monorepo' where I have dozens of pipelines.
I have a question: would some sort of lazy-configuration-validation be a useful feature for kedro?
I have 2 reasons for asking:
It feels a bit cumbersome that even a simple hello_world.py will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a hello_world.py
When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated.
Are there any solutions or modifications I could use to improve my approaches here? Thanks in advance! :)
:kedroid-party:
1

I ask the user to comment out any SQLDataSet to see how it impacts performance. It shows that it contribute to a significant amount of time.

~21 secs for kedro catalog list
~6 secs after commenting out the SQL and GBQ datasets

noklam · 2023-07-24T13:01:24Z

To scope the discussion better.

Maybe focus on SQLDataSet (and any dataset initial connections at initialisation)
This old PR changes all connections to be singleton - [KED-2865] Make sql datasets use a singleton pattern for connection #1163
The above comment also suggested that this create significant overhead to run kedro projects (particular with bigger one), from 21 seconds -> 6 seconds.

deepyaman · 2023-07-28T16:14:01Z

Hey @m-gris! Which dataset were you facing an issue with on #2829? I think I've "fixed" it in kedro-org/kedro-plugins#281 (just for pandas.SQLTableDataSet; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. 🙂 In the meantime, I'll work on adding it across other datasets.

To test, I modified Spaceflights to make companies a pandas.SQLTableDataSet:

# conf/base/catalog.yml
companies:
   type: pandas.SQLTableDataSet
   credentials: db_credentials
   table_name: shuttles
   load_args:
     schema: dwschema
   save_args:
     schema: dwschema
     if_exists: replace

# conf/local/credentials.yml
db_credentials:
   con: postgresql://scott:tiger@localhost/test

This dataset doesn't exist, so kedro run fails, but kedro run --pipeline data_science still works (if you've previously generated model_input_table). (You still do need to have the dependencies for the dataset installed; e.g. psycopg2).

deepyaman · 2023-08-10T15:26:45Z

Hey @m-gris! Which dataset were you facing an issue with on #2829? I think I've "fixed" it in kedro-org/kedro-plugins#281 (just for pandas.SQLTableDataSet; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. 🙂 In the meantime, I'll work on adding it across other datasets.

To test, I modified Spaceflights to make companies a pandas.SQLTableDataSet:
# conf/base/catalog.yml
companies:
   type: pandas.SQLTableDataSet
   credentials: db_credentials
   table_name: shuttles
   load_args:
     schema: dwschema
   save_args:
     schema: dwschema
     if_exists: replace

# conf/local/credentials.yml
db_credentials:
   con: postgresql://scott:tiger@localhost/test
This dataset doesn't exist, so kedro run fails, but kedro run --pipeline data_science still works (if you've previously generated model_input_table). (You still do need to have the dependencies for the dataset installed; e.g. psycopg2).

Confirmed offline with @m-gris that kedro-org/kedro-plugins#281 works for him, so I'll get it ready for review when I get a chance! :)

m-gris · 2023-08-12T06:33:56Z

Hi everyone,

Apologies for my silence / lack of reactivity & Thanks to Deepyaman for reaching out and coming up with a fix for this issue.

I'm happy to confirm that it does work, in the sense that I can run pipeline A, even if some sql datasets needed for pipeline B can’t be loaded because I’m offline / without access to the DB.

However... With further tests I noticed something that, to me at least, seems not ideal:
The con is still needed even if not used.

I know… I’m being a bit picky.
But here is the case I have in mind:
A team member working on a new modular pipeline C created a new sql dataset in the catalog.
However this being a pipeline I do not work on, I do not have a valid con to provide.
Granted, I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...

Instead of putting garbage in the conf wouldn’t it be better to "simply" have the ability to leave con empty until I really do need it ?

Thanks in advance,
Regards
M

deepyaman · 2023-08-15T04:44:46Z

However... With further tests I noticed something that, to me at least, seems not ideal: The con is still needed even if not used.

I know… I’m being a bit picky. But here is the case I have in mind: A team member working on a new modular pipeline C created a new sql dataset in the catalog. However this being a pipeline I do not work on, I do not have a valid con to provide. Granted, I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...

Hi again @m-gris! Based on just my evaluation, I think what you're requesting is feasible, and looks reasonably straightforward. When constructing DataCatalog.from_config(), the parsed dataset configuration is materialized (see https://github.com/kedro-org/kedro/blob/0.18.12/kedro/io/data_catalog.py#L299-L301). Instead of this, it should be possible to just store the dataset configuration, and construct the dataset instance the first time it's referenced (using a lazy-loading approach, like you suggest).

However... I do think the change could have a significant effect, potentially breaking things (e.g. plugins) that parse the data catalog. I'm almost certain this would be a breaking change that would have to go in 0.19 or a future minor release. Additionally, users may benefit from the eager error reporting, but perhaps the lazy loading could be enabled as an option (reminiscent of the discussion about find_pipelines() in #2910).

Let me request some others view on this, especially given a lot of people have been looking at the data catalog lately.

P.S. Re "I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...", my first impression was, this was what we would do back when I was primarily a Kedro user myself. We'd usually define a dummy uri with the template in the base environment (e.g. postgresql://<username>:<password>@localhost/test), and team members would put the actual credentials following the template in their local environment (e.g. postgresql://deepyaman:5hhhhhh!@localhost/test). So, I don't 100% know if I find it that odd myself. That being said, that's also a specific example, and I don't think it more broadly addresses the lazy loading request, hence I'm just including it as an aside. :)

m-gris · 2023-08-15T06:56:12Z

Thx for your answer. Good point regarding regarding the potential for breaking changes.
Local forward what others will have to say :)

noklam · 2023-08-15T14:52:20Z

As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.

This look like a dataset issue to me and we should fix dataset instead of making DataCatalog lazily load. Regard to the problem of con, isn't this simply because the dataset assume credentials always come with a con key?

Would this be enough?
self._connection_str = credentials.get("con", None)

deepyaman · 2023-08-15T17:54:00Z

As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.

This would be addressed by my open PR. I think there are definite benefits to doing this (e.g. not unnecessarily creating connections well before they need to be used, if at all.

This look like a dataset issue to me and we should fix dataset instead of making DataCatalog lazily load. Regard to the problem of con, isn't this simply because the dataset assume credentials always come with a con key?

Would this be enough? self._connection_str = credentials.get("con", None)

Nit: There's also a block further up, to validate the con argument, that would need to be moved or turned off.

But, more broadly, I don't know how I feel about this. I feel like providing con is required (in the dataset credentials, but it's also a positional argument for pd.read_sql_table()). Why would it be a special case to delay validation that just con is provided? What if I don't know where my colleague put some local CSV file for a pandas.CSVDataSet; why does my catalog validation fail if I don't have a filepath to provide?

rishabhmulani-mck · 2023-08-16T17:31:39Z

We are trying to isolate catalogs within Kedro based on pipelines. What this means is: pipeline A should use catalog A, pipeline B should use catalog B. I have used Kedro and used multiple catalogs but it seems like all catalog files (with the prefix catalog ) get read in regardless of which pipeline is actually run.

More context/background: The DE pipeline is set up to read from an Oracle database and create data extracts. The only way to go ahead is with data extracts as we have READ-ONLY access on the database and the DE pipeline needs to do some transformations before the data is ready for DS.
Now once these transformations are done, these extracts are ready for use by DS. This means, that they don't really need to use the DE pipeline again, and they do run only their specific pipeline. However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.

Posting at the request of @astrojuanlu for feature prioritization (if actually determined to be a needed feature)

astrojuanlu · 2023-08-16T17:33:30Z

Thanks a lot @m-gris for opening the issue and @rishabhmulani-mck for sharing your use case.

One quick comment: is using environments a decent workaround? https://docs.kedro.org/en/stable/configuration/configuration_basics.html#how-to-specify-additional-configuration-environments One that contains the "problematic"/"heavy" datasets, and another one that is isolated from it. Then one could do kedro run --env=desired-env and avoid the accidental loading of the database connection entirely.

deepyaman · 2023-08-21T15:24:14Z

However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.

@rishabhmulani-mck I think we all agree that it's a problem to connect to all of the databases upon parsing the catalog, which would be resolved by something like kedro-org/kedro-plugins#281. I can finish this PR.

There is a second question of con being a required field. Would your DS care if a dummy con was in the base environment (thereby satisfying the constraint of having a valid config, but it won't be loaded if kedro-org/kedro-plugins#281 is implemented).

@astrojuanlu @noklam my opinion is to split this up, and finish fleshing out kedro-org/kedro-plugins#281 (I believe it's an improvement regardless, can be added for all the datasets with a DB connection), and decide in the meantime whether to support anything further. Unless it's just me, I think making con essentially optional is more contentious.

noklam · 2023-08-21T23:47:27Z

That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.

astrojuanlu · 2023-08-22T08:25:40Z

I don't think we have spent enough time thinking about how to solve this in a generic way for all datasets beyond database connections. The reason I'm saying this is because if we come up with a generic solution (like "let's not instantiate the dataset classes until they're needed"?) then maybe kedro-org/kedro-plugins#281 is not even needed.

I'd like to hear @merelcht perspective when she's back in a few days.

AlekJaworski · 2023-08-22T12:13:53Z

Since my slack message was quoted above, I'll add my 2 cents for what it's worth.

I have in the end solved it also using a templated uri string, but was left slightly unsatisfied, as there is just "no need" to construct everything (or there doesnt seem to be) and it both impacts performance and ease of use.

Having the whole configuration valid while perhaps desirable, it is kind of annoying to not be able to run an unrelated pipeline perhaps during development. "Why would X break Y???" reaction is what I had personally. So in this case I would love to also see lazy loading of only the necessary config components, but I am not that knowledgeable in Kedro and what sort of things it'd break.

deepyaman · 2023-08-22T15:45:49Z

That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.

Great! I will unassigned myself from this issue (and it's larger scope), but finish the other PR when I get a chance. :)

noklam · 2023-08-29T10:57:11Z

Link kedro-org/kedro-plugins#281 which is a partial solution to this issue.

idanov · 2023-09-14T15:41:51Z

Currently the catalog is lazily instantiated on a session run:

kedro/kedro/framework/session/session.py

Lines 417 to 420 in 0fd1cac

    
           catalog = context._get_catalog(  # noqa: protected-access 
        
               save_version=save_version, 
        
               load_versions=load_versions, 
        
           )

(not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)

It should be possible to add an argument to context._get_catalog(..., needed=pipeline.data_sets()) (if it won't create any weird implications) that will filter the conf_catalog before it creates the DataCatalog object with the list of datasets from the pipeline.

kedro/kedro/framework/context/context.py

Line 269 in 0fd1cac

conf_catalog = self.config_loader["catalog"]

Something like this:

conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}

Parameter names and points of interjection should be more carefully examined though, since this is just a sketch solution. The good news is that it's a non-breaking one, if we implement it.

noklam · 2023-09-18T10:54:51Z

(not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)

@idanov I think that's because save_version and load_version are only available in session but not context? Sure we can pass all these information down to KedroContext and I don't see a big problem with that. The weird bit may be save_version is also session_id and it make more sense to have it in KedroSession.

Noted @ankatiyar added the save_version and load_version into DataCatalog for data factories. (This is relatvely new), so catalog actually has sufficient information to infer the version now and doesn't need to rely on context/session anymore.
#2635

takikadiri · 2023-10-02T16:29:45Z

This is really something needed when the codebase grow or when we do Monorepo with many pipelines that need/could be orchestrated independently with different scheduling contexts (for example : training, evaluation and inference pipelines).

Currenly when our codebase grow, we create multiple packages inside the same kedro project (src/package1, src/package2, ...), which means multiple kedro contexts, one context per "deployabe" pipelines. So we can have a "space" to manage configs that belong to the same execution context. However this increase our cognitive context switching and duplication of boilerplate code.

I wonder if the proposed solution solve entirely the problem, as it's not just about initializing datasets lazily, but somehow about materializing the dataset lazily. The catalog will still be entirely materialized from yaml to python object by the config_loader. kedro will still ask for some globals that are not filled but not really needed by the selected pipeline (in user perspectif).

Maybe pushing a little further the namespace concept could solve this. But for now, the poposed solution below (lazily instantiate datasets) is a big step toward pipeline modularity and monorepos, and have the merit of being no breaking. I'm looking forward to it.

It should be possible to add an argument to context._get_catalog(..., needed=pipeline.data_sets()) (if it won't create any weird implications) that will filter the conf_catalog before it creates the DataCatalog object with the list of datasets from the pipeline.

kedro/kedro/framework/context/context.py

Line 269 in 0fd1cac

conf_catalog = self.config_loader["catalog"]

Something like this:
conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}

astrojuanlu · 2023-10-19T05:40:29Z

kedro-org/kedro-plugins#281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht

deepyaman · 2023-10-23T13:59:39Z

kedro-org/kedro-plugins#281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht

Yeah, we should; I'm not really sure why this got closed in #281, as that was linked to #366 and not this...

takikadiri · 2024-04-08T11:16:35Z

We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B.

Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform.

astrojuanlu · 2024-05-27T07:48:01Z

Adding this to the "Redesign the API for IO (catalog)" for visibility.

astrojuanlu · 2024-05-27T07:50:33Z

Noting that this was also reported as #3804.

m-gris added the Issue: Feature Request New feature or improvement to existing feature label Jul 22, 2023

deepyaman self-assigned this Jul 25, 2023

merelcht added the Community Issue/PR opened by the open-source community label Aug 1, 2023

noklam mentioned this issue Aug 1, 2023

Provide a lightweight solution to speed up session reload or create new session #2879

Open

github-actions bot mentioned this issue Aug 18, 2023

Monthly issue metrics report #2950

Closed

deepyaman removed their assignment Aug 22, 2023

noklam linked a pull request Aug 29, 2023 that will close this issue

perf(datasets): don't create connection until need kedro-org/kedro-plugins#281

Merged

4 tasks

datajoely mentioned this issue Sep 28, 2023

Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094

Closed

deepyaman mentioned this issue Oct 3, 2023

perf(datasets): delay Engine creation until need kedro-org/kedro-plugins#366

Closed

deepyaman closed this as completed in kedro-org/kedro-plugins#281 Oct 19, 2023

merelcht reopened this Oct 19, 2023

Galileo-Galilei mentioned this issue Jan 21, 2024

Universal Kedro Deployment (Part 4) - Embedding kedro pipelines in third-party applications #3540

Open

This was referenced May 8, 2024

Create massive pipeline to test with flowchart on Kedro-viz kedro-org/kedro-viz#1064

Open

Create example projects to assess Kedro performance for complex pipelines #3866

Open

merelcht mentioned this issue May 21, 2024

ParallelRunner raises AttributeError: The following data sets cannot be used by multiprocessing... on datasets not involved in --pipeline being run #3804

Closed

astrojuanlu added this to the Redesign the API for IO (catalog) milestone May 27, 2024

ElenaKhaustova mentioned this issue Jun 6, 2024

[DataCatalog]: Lazy dataset loading #3935

Open

merelcht removed the Community Issue/PR opened by the open-source community label Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy Loading of Catalog Items #2829

Lazy Loading of Catalog Items #2829

m-gris commented Jul 22, 2023

noklam commented Jul 24, 2023

noklam commented Jul 24, 2023

deepyaman commented Jul 28, 2023

deepyaman commented Aug 10, 2023

m-gris commented Aug 12, 2023 •

edited

Loading

deepyaman commented Aug 15, 2023

m-gris commented Aug 15, 2023

noklam commented Aug 15, 2023

deepyaman commented Aug 15, 2023

rishabhmulani-mck commented Aug 16, 2023 •

edited

Loading

astrojuanlu commented Aug 16, 2023

deepyaman commented Aug 21, 2023

noklam commented Aug 21, 2023

astrojuanlu commented Aug 22, 2023 •

edited

Loading

AlekJaworski commented Aug 22, 2023

deepyaman commented Aug 22, 2023

noklam commented Aug 29, 2023 •

edited

Loading

idanov commented Sep 14, 2023 •

edited

Loading

noklam commented Sep 18, 2023 •

edited

Loading

takikadiri commented Oct 2, 2023 •

edited

Loading

astrojuanlu commented Oct 19, 2023

deepyaman commented Oct 23, 2023

takikadiri commented Apr 8, 2024

astrojuanlu commented May 27, 2024

astrojuanlu commented May 27, 2024

Lazy Loading of Catalog Items #2829

Lazy Loading of Catalog Items #2829

Comments

m-gris commented Jul 22, 2023

noklam commented Jul 24, 2023

noklam commented Jul 24, 2023

deepyaman commented Jul 28, 2023

deepyaman commented Aug 10, 2023

m-gris commented Aug 12, 2023 • edited Loading

deepyaman commented Aug 15, 2023

m-gris commented Aug 15, 2023

noklam commented Aug 15, 2023

deepyaman commented Aug 15, 2023

rishabhmulani-mck commented Aug 16, 2023 • edited Loading

astrojuanlu commented Aug 16, 2023

deepyaman commented Aug 21, 2023

noklam commented Aug 21, 2023

astrojuanlu commented Aug 22, 2023 • edited Loading

AlekJaworski commented Aug 22, 2023

deepyaman commented Aug 22, 2023

noklam commented Aug 29, 2023 • edited Loading

idanov commented Sep 14, 2023 • edited Loading

noklam commented Sep 18, 2023 • edited Loading

takikadiri commented Oct 2, 2023 • edited Loading

astrojuanlu commented Oct 19, 2023

deepyaman commented Oct 23, 2023

takikadiri commented Apr 8, 2024

astrojuanlu commented May 27, 2024

astrojuanlu commented May 27, 2024

m-gris commented Aug 12, 2023 •

edited

Loading

rishabhmulani-mck commented Aug 16, 2023 •

edited

Loading

astrojuanlu commented Aug 22, 2023 •

edited

Loading

noklam commented Aug 29, 2023 •

edited

Loading

idanov commented Sep 14, 2023 •

edited

Loading

noklam commented Sep 18, 2023 •

edited

Loading

takikadiri commented Oct 2, 2023 •

edited

Loading