Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance of config loading for big projects #3893

Open
astrojuanlu opened this issue May 25, 2024 · 7 comments
Open

Investigate performance of config loading for big projects #3893

astrojuanlu opened this issue May 25, 2024 · 7 comments

Comments

@astrojuanlu
Copy link
Member

Description

Earlier this week a user reached out to me in private saying that it was taking 3 minutes for Kedro to load their configuration (KedroContext._get_catalog).

Today another user mentioned that "Looking at the logs, it gets stuck at the kedro.config.module for more than 50% of the pipeline run duration, but we do have a lot of inputs and outputs"

I still don't have specific reproducers, but I'm noticing enough qualitative evidence to open an issue about it.

@datajoely
Copy link
Contributor

I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.

kedro profile {kedro command} -> .html / .bin

@yury-fedotov
Copy link
Contributor

I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.

kedro profile {kedro command} -> .html / .bin

@datajoely flamegraph for the entire pipeline run (how much time each node takes) or just the config resolution / pipeline initialization?

@datajoely
Copy link
Contributor

In my mind, it would run the whole command as normal, but also generate the profiling data.

Perhaps if we were to take this seriously, a full on memray integration would incredible.

@astrojuanlu
Copy link
Member Author

Continuing the discussion on creating custom commands here #3908

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Jul 2, 2024

Many users have been complaining about the slowness of Kedro with big projects and that can be attributed to many different causes. However one of the most prevailing cause is big parameter files that get expanded into hundreds of datasets on their own. That process takes a lot of time and if the files become too big (a couple of MB), it presents as significant slowdown.

Originally posted by @idanov in #3732 (comment)

The solution works, but couples the DataCatalog with OmegaConf is still under review.

From the discussion in the PR:

Shouldn't we redesign the DataCatalog API instead so that parameters are first class citizens, and not fake datasets?

There were a few thumbs up to the idea, and it was brought up again in #3973 (@datajoely please do confirm that this is what you had in mind 😄)

@merelcht pointed out that there's a pending research item on how users use parameters and for what #2240

@ElenaKhaustova agreed that this is relevant in the context of the ongoing DataCatalog API redesign #3934.

Ideally, if there's a way we can tackle this issue without blocking it on #2240, the time to look at it would be now. But I have very little visibility on what are the implications, or whether we would actually solve the performance problem at all. So, leaving the decision to the team.

@merelcht
Copy link
Member

merelcht commented Jul 2, 2024

The solution works, but couples the DataCatalog with OmegaConf

Would you really call this coupling? The way I read it is that is uses omegaconf to parse the parameters config. We already have a dependency on omegaconf anyway, and I actually quite like that we can leverage it in more places than just the OmegaConfigLoader itself. I would have called it coupling if it uses the actual OmegaConfigLoader class, but this just imports the library.

@astrojuanlu
Copy link
Member Author

Sorry to keep moving the conversation but I'd rather not discuss the specifics of a particular solution outside the corresponding PR, addressed your question in context at #3732 (comment)

@merelcht merelcht added this to the Improve Developer Experience milestone Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants