Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Make cookiecutter optional / not a core dependency of kedro #3884

Closed
astrojuanlu opened this issue May 22, 2024 · 16 comments
Closed

Spike: Make cookiecutter optional / not a core dependency of kedro #3884

astrojuanlu opened this issue May 22, 2024 · 16 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@astrojuanlu
Copy link
Member

Description

cookiecutter is a mandatory dependency of kedro, because it's what powers kedro new and kedro pipeline create.

However, it's a heavy dependency. From #3659:

cookiecutter
├── Jinja2<4.0.0,>=2.7
│   └── MarkupSafe>=2.0
├── arrow
│   ├── python-dateutil>=2.7.0
│   │   └── six>=1.5
│   └── types-python-dateutil>=2.8.10
├── binaryornot>=0.4.4
│   └── chardet>=3.0.2
├── click<9.0.0,>=7.0
├── python-slugify>=4.0.0
│   └── text-unidecode>=1.3
├── pyyaml>=5.3.1
├── requests>=2.23.0
│   ├── certifi>=2017.4.17
│   ├── charset-normalizer<4,>=2
│   ├── idna<4,>=2.5
│   └── urllib3<3,>=1.21.1
└── rich
    ├── markdown-it-py>=2.2.0
    │   └── mdurl~=0.1
    └── pygments<3.0.0,>=2.13.0

Currently we are forcing Kedro users to install dependencies that are needed for development (kedro new, kedro pipeline create) also in their production environments.

One of those problematic transitive dependencies is rich. As of #3838, kedro can work even if rich is not installed, but the current method to do that is very brittle and difficult for users. As a result, we are effectively blocked from making progress with #2928.

For backwards compatibility reasons, pip install "kedro==0.19.*" will need to keep carrying cookiecutter as a dependency. The task then is to come up with a roadmap in which pip install "kedro>=0.20" does not.

@astrojuanlu astrojuanlu added the Issue: Feature Request New feature or improvement to existing feature label May 22, 2024
@astrojuanlu
Copy link
Member Author

#2928 was prioritised as "High" and this issue blocks progress on that one, so I am prioritising this one as "High" too.

@merelcht merelcht added this to the Improve Developer Experience milestone Jun 4, 2024
@merelcht merelcht changed the title Make cookiecutter optional / not a core dependency of kedro Spike: Make cookiecutter optional / not a core dependency of kedro Jun 4, 2024
@lrcouto
Copy link
Contributor

lrcouto commented Jun 11, 2024

A couple more details about the issue:

Our whole project creation flow currently relies on cookiecutter for creating new projects, and for creating pipelines as well. This bumps into the problems mentioned by @astrojuanlu, since cookiecutter is pretty intense when it comes to dependencies.

As of now, cookiecutter is performing three functions on Kedro:

Project templates

This is what advertises itself for. It's a templating tool. We use it to fetch directory structure and pre-created files from both the Kedro repo itself and the kedro-starters repo. Which leads to...

Git operations

To fetch those templates, cookiecutter uses functions that do operations like cloning a repo or checking out a branch. This can become a bit of an issue from the developer side, because we cede control to an external library to do these operations and it can be tricky to debug. For example, this bit on the _create_project() function:

result_path = cookiecutter(template=template_path, **cookiecutter_args)

From then on, cookiecutter takes over and will do the whole cloning, selecting the right branch or tag of the repo, and creating the project itself. It's convenient, and at this moment our whole project creation from kedro new on is pretty much preparing data structures that will subsequently be fed to this one function. It effectively acts as a chokepoint, and because it's not ours, debugging it can be a little bit of a pain, having to dive down multiple functions on the cookiecutter library to understand what's going on. They have their own clone/checkout function so we're forced to adapt to the way they want to do it.

Prompting

This was the one that caused the issue when we tried to remove Rich as a mandatory dependency. That one cookiecutter function is also handling our prompts, we just pass which ones we require for that specific project creation on that **cookiecutter_args structure. That function is essentially equivalent to a CLI call to cookiecutter with whichever arguments you want to use, so it makes sense.

The TL;DR is that our whole project creation flow is a funnel that ends on cookiecutter.

@merelcht
Copy link
Member

Thanks for the comprehensive write-up @lrcouto ! So am I correct in thinking that if a user already has a Kedro project (e.g. cloned from a repo) they can do everything they want in Kedro without needing cookiecutter?

@lrcouto
Copy link
Contributor

lrcouto commented Jun 12, 2024

Thanks for the comprehensive write-up @lrcouto ! So am I correct in thinking that if a user already has a Kedro project (e.g. cloned from a repo) they can do everything they want in Kedro without needing cookiecutter?

We're also using it to create new pipelines using the kedro pipeline create command. But as long as they have the project and all of the pipelines they're going to use, they would not need cookiecutter anymore.

@noklam
Copy link
Contributor

noklam commented Jun 18, 2024

In other words, if we consider kedro new or kedro pipeline create not a "core" functionalities, we can move it out from the core dependencies.

The tricky bit here is for 0.19, we need to maintain backward compatibility. In 0.20, we can definitely split it to kedro[all] and move cookiecutter out from core.

I have test it today, after pip uninstall cookiecutter, I can perform these common Kedro's action

  • kedro ipython
  • kedro run
  • kedro registry list
  • kedro catalog list

In fact, ipython and jupyter are not Kedro's core dependencies, we could handle cookiecutter in similar way. We try to intercept this error in CLI, since all prompts are CLI. i.e.

  • kedro new
  • kedro pipeline create

image

Code:

def _check_module_importable(module_name: str) -> None:
try:
import_module(module_name)
except ImportError as exc:
raise KedroCliError(
f"Module '{module_name}' not found. Make sure to install required project "
f"dependencies by running the 'pip install -r requirements.txt' command first."
) from exc

@noklam
Copy link
Contributor

noklam commented Jun 18, 2024

For the records, there are proposal for optional opt-out dependencies (from the Pydantic's author):
https://discuss.python.org/t/optional-dependency-groups-omitting-package-requirements/20833

This problem is also not unique to us, see this quote from the FastAPI's author in the same thread, he has issue with rich as well:

I just came to chime in and say that I agree this would be useful to me for FastAPI and Typer at least

For example, pip install typer should include Rich and Shellingham by default,
and if someone explicitly wants to strip that out they would add some additional syntax (e.g. pip install typer[minimal]).

Long story short, there are plenty of discussion but until today it's not supported (PEP doesn't exist yet). I thought it's a good idea to take inspiration from pydantic and FastAPI as they are arguably the most popular libraries in Python ecosystem. It's beneficial if we don't deviate too much from them.

Pydantic:
There were discussion about pydantic-core and pydantic-core-lite (a smaller binary), I cannot find pydantic-core-lite. I suspect it's because of the discussion and there is no satisfying mechanism to support the "opt-out" dependencies, as a result it's not worth the effort to maintain this separately.

FastAPI:

  • fastapi-cli-slim
  • fastapi-cli
  • fastapi
  • fastapi-slim

In fastapi[standard], it install fastapi-cli as a dependency:
https://github.com/tiangolo/fastapi/blob/653315c496304be928b46c34d35097d0ce847646/pyproject.toml#L56-L61

In FastAPI case, it makes sense since it's a web application and the CLI part isn't a runtime requirements when served. For Kedro, the CLI is a core part. If we only split the "cookiecutters" part, it's too small and IMO not worth the effort with little gain.

When one install pip install fastapi, it installs fastapi-slim[standard] (I couldn't find the code for fastapi-slim, it's probably using the same fastapi code base but packaged separately for convenience). It automatically load the pdm_build.py with https://backend.pdm-project.org/migration/#setup-script

It is something quite new as well (2months old)
image

@noklam
Copy link
Contributor

noklam commented Jun 18, 2024

This is my proposed solution:

1. Move cookiecutter out from core and to something like kedro[new] /kedro[init]

When user run kedro new:
If they have cookiecutter installed, it should run as is.
If they don't, it will show a message like:

 f"Module '{module_name}' not found. Make sure to install requirements with `pip install kedro[new]`

Docs: We need to update the docs for "Get started", we need to replace pip install kedro with pip install kedro[new]

Pro: pip install kedro will work as the same for most people, except those need to create new project. (less often)
Con:

  • the command become a bit longer, not many people understand the optional dependencies and may get confused. This should not be too bad as running the CLI will trigger the error message.
  • Slightly breaking changes to those who need to create pipeline/project in CI. (need to update the requirement to kedro[new])

2. Two-package approach, i.e. kedro and kedro-core

The FastAPI way of having fastapi and fastapi-slim

  • fastapi-slim generate from fastapi codebase with an extra package build.
  • fastapi-slim is basically fastapi[standard] but a separate pypi package

It uses a magic constant to have different build:
https://github.com/tiangolo/fastapi/blob/653315c496304be928b46c34d35097d0ce847646/pdm_build.py#L6

It uses pdm, though I am not sure is that necessary to support these kind of workflow.

Pro:

  • Telemetry on fastapi-slim is easier, pip install fastapi does not install fastapi-slim[standard] although they are equivalent

Con:

  • It confuse other packages as many pinned fastapi, it requires the entire ecosystem to support the /
  • TBD

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Jun 18, 2024

Big fan of (2), pip install kedro including ["kedro-core", "cookiecutter", "rich"]. This retains backwards compatibility (we don't even have to wait until 0.20) and gives an exit path for users who want to install the "slim" or "lightweight" version.

@lrcouto
Copy link
Contributor

lrcouto commented Jun 18, 2024

Besides Nok's proposal, another idea to at least stop relying on Rich would be to try to handle the prompting outside of the cookiecutter context. That would allow us to keep everything mostly as it and not have to downgrade the Cookiecutter version if we want to remove Rich. I do think it would be a bit of reinventing the wheel though, since we'd have to create something that Cookiecutter is already doing.

@lrcouto
Copy link
Contributor

lrcouto commented Jun 18, 2024

Besides Nok's proposal, another idea to at least stop relying on Rich would be to try to handle the prompting outside of the cookiecutter context. That would allow us to keep everything mostly as it and not have to downgrade the Cookiecutter version if we want to remove Rich. I do think it would be a bit of reinventing the wheel though, since we'd have to create something that Cookiecutter is already doing.

On that idea, @ankatiyar suggested looking into Click's prompting functions to possibly replace Cookiecutter's.

@noklam opened an issue on the Cookiecuter repo regarding their own dependency on Rich.

@noklam
Copy link
Contributor

noklam commented Jun 18, 2024

Big fan of (2), pip install kedro including ["kedro-core", "cookiecutter", "rich"].
I will say this is (3). The most notable difference is that pip install kedro will automtically install kedro-core, the FastAPI approach will not, the benefit is that you can see exactly how many people are downloading the slim package on purpose.

(2) that FastAPI takes is a bit different, basically fastapi and fastapi-slim does not rely on each other. They are essentially duplicate but standalone packages as I understand. I am trying to ask why the author takes that approach and the lesson he learnt, one drawback that I found is that it will cause some issues to the plugins. As most likely they will pin kedro instead of kedro-core, this make it harder to actually install the lean dependencies.

I do like that both fastapi and fastapi-slim share the same codebase so we don't need to maintain a separate one.

@noklam
Copy link
Contributor

noklam commented Jun 18, 2024

Besides Nok's proposal, another idea to at least stop relying on Rich would be to try to handle the prompting outside of the cookiecutter context. That would allow us to keep everything mostly as it and not have to downgrade the Cookiecutter version if we want to remove Rich. I do think it would be a bit of reinventing the wheel though, since we'd have to create something that Cookiecutter is already doing.

On that idea, @ankatiyar suggested looking into Click's prompting functions to possibly replace Cookiecutter's.

@noklam opened an issue on the Cookiecuter repo regarding their own dependency on Rich.

Cookiecutter use click.prompt before they switch to rich, the implementation can be found in the cookiecutter/cookiecutter#1901 linked.

@noklam
Copy link
Contributor

noklam commented Jun 19, 2024

See this from the fastapi community. Some arguments I see there:

  • libraries shouldn't try to include extra dependencies as default, but use optional dependencies when needed

tiangolo/fastapi#11733

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Jun 19, 2024

I get it: plugins depending on kedro would need to migrate to kedro-core if they want to drop the unwanted dependencies. And for projects, the moment there is 1 plugin depending on the full kedro, they're stuck.

And yet, as @/tiangolo explains here:

So, the idea with having the new setup with fastapi apart from fastapi-slim is to make the simplest developer experience for the majority of cases, in particular for newcomers, as it's much easier to install pip install fastapi and have the default recommended packages.

For a newbie, pip install fastapi[standard] means explaining what the [] means, what is this standard keyword, what is the difference between pip install fastapi and pip install fastapi[standard], and in many cases that command won't even work, because people might be using Zsh (e.g. in Mac by default) where it needs quotes around pip install "fastapi[standard]", so, what should be the simplest use case for the newcomers becomes the most complex one, more difficult to explain and use.

He got quite a bit of opposition, but I completely agree with his stance (and Sebastián's intuition for DX is what brought FastAPI this far anyway).

One insightful comment by the way:

in my humble opinion cli is not the thing that is needed for web server.

(back to #3659 and #143)

@merelcht
Copy link
Member

I actually prefer option 1 of @noklam's proposals. We already have the concept of pip install ...[...] in the Kedro ecosystem with kedro-datasets so I feel it's more consistent with other things we have done.

I can see why people like option 2, but with that one we need to think more about whether we would then break up Kedro in a modular way, where you can "add-on" parts by doing pip install kedro-... or if it's similar to fastapi where the packages are duplicates. I'm also still concerned if this is worth the engineering effort as well as the future maintenance.
What I'd like to know: are Kedro's core dependencies a reason for people not to adopt Kedro?

@lrcouto
Copy link
Contributor

lrcouto commented Jul 8, 2024

We will be closing this issue for now, based on this discussion on Kedro's dependencies. We have decided on a longer term approach to deal with Cookiecutter, and by extension Rich, that will take some time to design and implement. This issue will be addressed once we're ongoing with this process.

@lrcouto lrcouto closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: Done
Development

No branches or pull requests

4 participants