Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview: Kedro's dependencies and what to do about Cookiecutter #3967

Closed
lrcouto opened this issue Jun 25, 2024 · 13 comments
Closed

Overview: Kedro's dependencies and what to do about Cookiecutter #3967

lrcouto opened this issue Jun 25, 2024 · 13 comments

Comments

@lrcouto
Copy link
Contributor

lrcouto commented Jun 25, 2024

The original issue: Kedro has a lot of dependencies

  • We have an ongoing discussion on Kedro's number of dependencies and whether users perceive Kedro to be a "heavy" framework (Users cannot install specific components of Kedro separately #3659)
  • On top of that, some of our dependencies have a lot of dependencies. The notorious example is Cookiecutter, that has a pretty hefty dependency tree:
cookiecutter
├── Jinja2<4.0.0,>=2.7
│   └── MarkupSafe>=2.0
├── arrow
│   ├── python-dateutil>=2.7.0
│   │   └── six>=1.5
│   └── types-python-dateutil>=2.8.10
├── binaryornot>=0.4.4
│   └── chardet>=3.0.2
├── click<9.0.0,>=7.0
├── python-slugify>=4.0.0
│   └── text-unidecode>=1.3
├── pyyaml>=5.3.1
├── requests>=2.23.0
│   ├── certifi>=2017.4.17
│   ├── charset-normalizer<4,>=2
│   ├── idna<4,>=2.5
│   └── urllib3<3,>=1.21.1
└── rich
    ├── markdown-it-py>=2.2.0
    │   └── mdurl~=0.1
    └── pygments<3.0.0,>=2.13.0
  • Given this, we've decided to start figuring out ways to decouple Kedro from some of those dependencies.

Attempting to remove Rich

The Cookiecutter Issue

  • We are currently using Cookiecutter to handle our project and pipeline creation flows. That has the effect of making this whole process completely tied to Cookiecutter. (Spike: Make cookiecutter optional / not a core dependency of kedro #3884 (comment))
  • The way our project creation flow currently works is that everything from kedro new onwards is building up a data structure to be passed as a parameter to the cookiecutter() function, which handles the creation itself from the desired template.
graph TD
    A[kedro new]
    B[Initialize flag_inputs]
    C[Validate flag_inputs]
    D[Get starters_dict]
    E{starter_alias in starters_dict?}
    F[Set template_path and directory]
    G[Set selected_tools to lowercase]
    H[Create tmpdir]
    I[Get cookiecutter_dir]
    J[Get prompts_required]
    K{config_path provided?}
    L[Make cookiecutter_context]
    M[Cleanup tmpdir]
    N[Get extra_context]
    O[Make cookiecutter_args]
    P{telemetry_consent provided?}
    Q[Validate telemetry_consent]
    R[Call create_project]
    S[Call cookiecutter]

    A --> B --> C --> D --> E
    E -- Yes --> F
    E -- No --> F
    F --> G --> H --> I --> J --> K
    K -- No --> L
    K -- Yes --> M
    L --> M --> N --> O --> P
    P -- Yes --> Q --> R
    P -- No --> R
    R --> S


Loading

Current ideas for solutions

Further questions to discuss

  • Do we have concrete evidence that a significant amount of our userbase thinks Kedro is heavyweight/cumbersome? Enough to justify a refactor or splitting it in packages?
  • What defines a "heavyweight" framework? What are the criteria we are using for that?
  • What do we consider the core features of Kedro?
  • What could be used to replace Cookiecutter, in case we decide to do that?
  • How would a possible split in two packages, or having one install option with extra dependencies, affect our user experience?
@datajoely
Copy link
Contributor

I wonder if we can invoke cookiecutter via pipx it's literally only needed once

@astrojuanlu
Copy link
Member

Notice that both kedro new and kedro pipeline create use cookiecutter, but refactoring the former is much more difficult than refactoring the latter. So, on @lrcouto ideas for solutions, we could account for the fact that maybe we could make kedro pipeline create not dependent on cookiecutter, and focus on what to do with kedro new.

@noklam
Copy link
Contributor

noklam commented Jun 26, 2024

I cannot join today Tech Design and I will watch the recording. I leave some comment on the issue to clarify:

The only way we can currently run Kedro without needing Rich is by downgrading Cookiecutter to a version before they themselves added Rich as one of their dependencies, which is hacky and not ideal.

cookiecutter is not needed as a "runtime" dependencies, by runtime I mean kedro run . If user still need to use kedro new or kedro pipeline createthen cookiecutter is needed.

To me the problem right now it that user cannot INSTALL kedro without installing cookiecutter, thus either solutions that I propose can address this with different tradeoff (see the summary):

  1. kedro / kedro-core
  2. move cookiecutter, rich as optional dependencies, essentially the core dependency will be equivalent to kedro-core as a pacakge, if user need to use more they may install kedro[standard] (arbitrary name, follow FastAPI convention)
  3. There is 3rd option that I didn't mention before, which is we could vendor cookiecutter within kedro (increase kedro library in terms of size, but reduce dependencies), see this thread for full discussion. I feel like this is a heavyweight solution and not worth the effort, but I want to bring it up as an alternative.

Replace cookiecutter

I will not consider this option unless we aim as expanding the feature. For example, there has been quite a lot of issue running kedro new in databricks (network, permission issues). Do we have alternative that can handle this better?

How would a possible split in two packages, or having one install option with extra dependencies, affect our user experience?

This is explains mostly in Spike: Make cookiecutter optional / not a core dependency of kedro

  1. Move cookiecutter/rich out from core to kedro[something]

Pro:

  • Easy to implement, no extra maintenance, CI setup, new PyPi package
  • Not affect existing kedro run users (it will still work )

Con:

  • Breaking change if we consider kedro new and kedro pipeline create, existing CI which involves creating project will fail
  • Longer install statement, need to explain to beginner what is pip install kedro[standard] (optional dependencies are not something beginner familiar with <- this is the view of FastAPI)
  1. Two-package approach, i.e. kedro and kedro-core

Pro:

  • Explicit telemetry on kedro-core if kedro does not depends on kedro-core (We can check the PyPi stats easily)
  • Non-breaking to all existing users, can introduce now instead of 0.20.0

Con:

  • Debugging is a bit more confusing, should I look at kedro or kedro-core?
  • Ecosystem need to adopt kedro-core, otherwise if plugins i.e. kedro-mflow pinned kedro as a dependency, project cannot use kedro-core
  • (Haven't checked) What if both kedro and kedro-core are installed? Which one is actually being run?
    Extra PyPi package, CI setup

@datajoely
Copy link
Contributor

One last idea - pip vendors certain tools (like rich) so there is no risk of conflicts. Maybe that's what we need to do here?
https://github.com/pypa/pip/tree/main/src/pip/_vendor

@lrcouto
Copy link
Contributor Author

lrcouto commented Jun 28, 2024

Here's the summary of what we discussed on the Tech Design session on Jun 26th:

Some interesting remarks:

  • Some users would like to use just the DataCatalog separate from the rest of the framework, and do think Kedro has too many dependencies.
  • We should weigh out cost/benefit for choosing which solution to implement. Time spent on this issue could also be used to implement new features that would solve other user problems.
  • Specifically for the Rich issue, even if we manage to deal with it being a dependency of Cookiecutter, it might be re-introduced again by another dependency, as it is a very popular library.
  • Cookiecutter doesn't have much active support anymore.
  • When trying to define what a "core feature" of Kedro is, one proposal is to define "core" as anything that's required for kedro run.
  • We might have to refactor our project creation flow in the near future regardless, because it's becoming too large and hard to work with

Proposed solutions:

  • Patching Cookiecutter to resolve the Rich dependency issue, as it'd be a small change on the Cookiecutter code but we have no guarantee that they'd be willing to do it. Would be a quick solution, but could be somewhat fragile.
  • Vendoring or forking Cookiecutter and fixing the parts that we need. Would give us complete control of the code, but would be a pretty hefty thing to mantain, that we essentially use for one feature.
  • Calling Cookiecutter through pipx. Could be viable since we're essentially using it like it was a CLI call, but might need some work with scripting or significant changes in the code.
  • The "kedro[new]" approach, separating the non-core features of Kedro into an optional install. Relatively easy to implement and wouldn't affect current users, but would be a breaking change and possibly confusing to explain to new users.
  • Separate Kedro in two packages, e.g. kedro and kedro-core. Would not break for existing users, but would complicate our ecosystem and be more complicated to debug as well.

@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 30, 2024

To clarify on the two packages solution, there are 2 approaches:

  1. Disjoint kedro and kedro-slim, aka the FastAPI approach as described by @noklam here

basically fastapi and fastapi-slim does not rely on each other. They are essentially duplicate but standalone packages as I understand.

Indeed, they're generated from the same codebase but they don't depend on each other, see fastapi/fastapi#11503. Compare https://pyoven.org/package/fastapi with https://pyoven.org/package/fastapi-slim .

  1. kedro depending on kedro-core, aka the Dask Conda approach:

https://github.com/conda-forge/dask-feedstock/blob/18eb09f9125074b37541f8c8fffd704e32837686/recipe/meta.yaml#L16-L19

There is https://anaconda.org/conda-forge/dask, depending on dask-core, distributed, pandas etc (hence equivalent to pip install dask[complete]) and https://anaconda.org/conda-forge/dask-core, with minimal dependencies.

Other packages doing the same:


The "kedro[new]" approach would then be similar to the Dask PyPI approach.

@merelcht
Copy link
Member

merelcht commented Jul 1, 2024

Thanks you so much for the great write-up of the problem and the discussion summary @lrcouto 👏 ⭐

I'd like to look at this with a short-term and long-term solution view.

  • IMO the best short-term solution for this problem is patching cookiecutter to resolve the Rich issue (I've been keeping an eye on the cookiecutter repo and the issue @noklam created, but not much seems to be happening there. I don't think we can count on them removing Rich as a dependency)
  • As a long-term solution I'm starting to like the multiple packages solution more. I would consider that a 1.0.0 redesign like @deepyaman proposed in the tech design meeting.

Aside from these two solutions, we might need to find an alternative for cookiecutter if it is indeed being maintained less and less. I don't think that necessarily solves any of our issues though, because it would just replace the cookiecutter dependency with e.g. copier and there's a chance that any replacement introduces Rich again at some point. So although this is related, I wouldn't consider replacing cookiecutter a solution for anything other than making sure we use up to date packages as dependencies.

@lrcouto
Copy link
Contributor Author

lrcouto commented Jul 1, 2024

I am leaning towards separating Kedro in two packages as a solution as well. Out of those, I think having kedro depending on kedro-core is my favorite. It would be a big endeavor to implement, but I think it would prevent this kind of issue from happening in the future as well. We could keep kedro-core as lean as possible, having only what's strictly necessary for kedro run, and have other amenities and extra features on the larger kedro packages.

@noklam
Copy link
Contributor

noklam commented Jul 2, 2024

IMO the rich issue is not a big problem, isn't the original focus making install cookiecutter optional (or how to install part of Kedro in general?). I don't see how patching rich would solve this problem

As a long-term solution I'm starting to like the multiple packages solution more. I would consider that a 1.0.0 redesign like @deepyaman proposed in the tech design meeting.

@merelcht What's the reason behind this? As I remember you support the "kedro[new]" approach more originally. (I'll catch up on the recording tomorrow).

@lrcouto
Copy link
Contributor Author

lrcouto commented Jul 2, 2024

IMO the rich issue is not a big problem, isn't the original focus making install cookiecutter optional (or how to install part of Kedro in general?). I don't see how patching rich would solve this problem

They are two separate problems but they are related to each other. The reason why we couldn't fully make Rich optional was because Cookiecutter uses it as a dependency.

@lrcouto
Copy link
Contributor Author

lrcouto commented Jul 4, 2024

Here are our plans going forward, decided on the meeting today, July 4th.

Regarding the short-term, immediate Rich issue, we have decided that patching Cookiecutter might not be worth the "hackiness" of the solution. We've decided to rely for now on instructing users to downgrade Cookiecutter if they desire to uninstall Rich.

For a longer term solution, we are leaning towards the two-package approach. Besides the reasons already mentioned on previous comments, it would also offer the most seamless experience to users, as the effect of a Kedro installation would remain mostly the same.

I will be closing this issue for now, and we will start future steps for designing and implementing this more permanent solution.

@lrcouto
Copy link
Contributor Author

lrcouto commented Jul 4, 2024

wall-stickers-to-be-continued-comic-2051892420

@lrcouto lrcouto closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024
@merelcht
Copy link
Member

merelcht commented Jul 5, 2024

As a long-term solution I'm starting to like the multiple packages solution more. I would consider that a 1.0.0 redesign like @deepyaman proposed in the tech design meeting.

@merelcht What's the reason behind this? As I remember you support the "kedro[new]" approach more originally. (I'll catch up on the recording tomorrow).

Sorry for responding on a closed ticket, but just wanted to clarify my position on the long-term solution. When I said " I'm starting to like the multiple packages solution" above, I actually meant the idea of having an approach where it's possible to not install everything that is part of Kedro. So either the split dependencies or the multiple packages. My main preference still goes to the split dependencies (pip install kedro[new]) but I'm understanding the other approach more now than I did before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants