Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users cannot install specific components of Kedro separately #3659

Closed
astrojuanlu opened this issue Feb 28, 2024 · 16 comments
Closed

Users cannot install specific components of Kedro separately #3659

astrojuanlu opened this issue Feb 28, 2024 · 16 comments
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented Feb 28, 2024

Description

Hello, I'm currently building a python package that derives some of the source code of Kedro, primarily the catalog and config classes. I might be open sourcing the package, how do I properly attribute Kedro? Any advice?
...
My team is reluctant to add kedro pipelines but liked catalogs and the config management. Also I'm working on integrating lakefs and dvc for version control and dbt.

https://linen-slack.kedro.org/t/16626961/hello-i-m-currently-building-a-python-package-that-derives-s#3e2cc457-1184-4b2d-b7b3-7cab59fa1fa0

Over the years, some Kedro power users and also folks evaluating the project have made it clear that they really like specific parts of Kedro, such as the catalog and the configuration loading, and they don't care much about the rest. See also #2741 (2021), #2898 (comment), https://linen-slack.kedro.org/t/16593946/is-there-a-way-of-installing-only-the-data-catalog-part-of-k#b6d532c4-2d7f-4add-b0ee-b0bfcffbdd5e, #2409 (comment) ("kedro is scary") and many more.

What has been done so far

We have already done a lot to make Kedro leaner and simpler. For example, Kedro 0.19 dropped datasets, which now have to be installed separately as kedro-datasets #2126 This has been unanimously celebrated.

In addition, we also dropped some CLI commands that were rarely used #1616. There were some users that raised concerns, but so far we haven't heard more concerns.

We also improved our packaging infrastructure to avoid having explicit dependencies on pip and setuptools #2350 again, this has been received with silent approval.

Why is this still a problem?

And yet, this is not enough.

In principle nothing prevents these users from doing pip install kedro and using only the parts they need. This was the argument @idanov and myself defended in #2409 (comment)

However, this is not how most folks vet open source dependencies. In this absolutely fantastic survey from @/simonw, from Django fame, on Mastodon, people generally favour lean packages over bloated ones, and minimal dependencies over a large number of them. Some quotes:

The only other factor beyond this list I care about is size. Is it a large or "bloated" dependency?

Check how many/what dependencies are required.

one really important step I haven't seen many people mention: look at dependencies. Basically determine at what level the thing is solving its use case.

For users that would like to use only our Data Catalog and/or only our Configuration Loader, they would be forced to:

cookiecutter
├── Jinja2<4.0.0,>=2.7
│   └── MarkupSafe>=2.0
├── arrow
│   ├── python-dateutil>=2.7.0
│   │   └── six>=1.5
│   └── types-python-dateutil>=2.8.10
├── binaryornot>=0.4.4
│   └── chardet>=3.0.2
├── click<9.0.0,>=7.0
├── python-slugify>=4.0.0
│   └── text-unidecode>=1.3
├── pyyaml>=5.3.1
├── requests>=2.23.0
│   ├── certifi>=2017.4.17
│   ├── charset-normalizer<4,>=2
│   ├── idna<4,>=2.5
│   └── urllib3<3,>=1.21.1
└── rich
    ├── markdown-it-py>=2.2.0
    │   └── mdurl~=0.1
    └── pygments<3.0.0,>=2.13.0
gitpython
└── gitdb<5,>=4.0.1
    └── smmap<6,>=3.0.1
pre-commit-hooks
├── ruamel.yaml>=0.15
│   └── ruamel.yaml.clib>=0.2.7
└── tomli>=1.1.0
build
├── importlib-metadata>=4.6
│   └── zipp>=0.5
├── packaging>=19.0
├── pyproject_hooks
│   └── tomli>=1.1.0
└── tomli>=1.1.0
rope
└── pytoolconfig[global]>=1.2.2
    ├── packaging>=23.2
    ├── platformdirs>=3.11.0
    └── tomli>=2.0.1
  • And of course, install the dependencies for the "core" part of Kedro. This is more blurry, but we could include a few small dependencies like importlib_resources, toposort (which could be going away after we drop 3.8 support thanks to graphlib), possibly pluggy.

Proposed solution

The solution I propose is making kedro a meta-package, that would depend on smaller packages.

Specifically, we could start with 2: kedro-catalog and kedro-new. Why?

The case for splitting kedro-catalog

From the links above:

Would it make sense to make mini-kedro installable? My use case for projects like that are users doing EDA and just want easy access to the data with no fuss.

#2741

Some months back I experimented with using the kedro datacatalog as a dataloader for streamlit dashboards.

#2898 (comment)

For our use case, only the DataCatalog part is relevant - nothing about the nodes, pipelines or orchestration. So I'm thinking this might help us avoid additional dependencies.

For some reason, some clients also don't like and/or accept Kedro as well 🤔

https://linen-slack.kedro.org/t/16593946/is-there-a-way-of-installing-only-the-data-catalog-part-of-k#b6d532c4-2d7f-4add-b0ee-b0bfcffbdd5e

The case for splitting kedro-new

kedro new is a weird command for at least 2 reasons:

  • It's only needed once per project. It doesn't make sense that we are forcing users to install all the dependencies of kedro new in every deployment target, production setting, cloud platform etc for projects that are already created.

From the kedro-boot docs:

from typing import List
from fastapi import FastAPI, Query
from fastapi.responses import RedirectResponse
from pydantic import BaseModel

from kedro_boot.app.fastapi.session import KedroFastApi

app = FastAPI(title="Spaceflights shuttle price prediction")

...

@app.post("/predict", tags=["Inference"], operation_id="inference")
def predictions(
    features_store: ShuttleFeature, kedro_run: KedroFastApi
) -> ShuttlePrediction:
    return {"shuttles_prices": kedro_run}

Why should an app like this carry all the weight of unneeded commands?

So I have done venv first, then install kedro, then kedro new, then pip install requirements. Which is a bit confusing for python newbies

Future developments

If we are happy with the initial iterations we could take this idea further and make more smaller packages. Something like:

kedro
├── kedro-pipelines  # Pipelines and nodes
├── kedro-cli  # All things CLI
├── kedro-new
│   └── kedro-cli  # Obviously `kedro new` would depend on `kedro-cli`, plus other things
├── kedro-io  # The catalog
├── kedro-config  # The config loader
├── kedro-framework  # Session, context, hooks, Kedro "classic way of working"
│   ├── kedro-config
│   ├── kedro-pipelines
│   └── kedro-io
└── kedro-micropackaging
   └── kedro-cli  # Everything that extends the CLI depends on `kedro-cli`

Advantages

  • We track the adoption of individual components. No need to have tailored telemetry (cc @yetudada)
  • We can iterate faster by releasing specific components more quickly

Considerations

  • Backwards compatibility: This can be done in a fully backwards compatible way, and pip install kedro would get regular users exactly the same.
Implementation idea

We could even retain imports by leveraging PEP 420 implicit namespace packages https://packaging.python.org/en/latest/guides/packaging-namespace-packages/

  • Regular Kedro users shipping their Kedro project as a Docker container or FastAPI server: pip install kedro-framework (does not carry the CLI at all!)
  • Power Kedro users: pip install kedro-catalog kedro-config. They only use the part they like and forget about the rest.
  • We would need to choose whether to keep a monorepo or multirepo approach. Both solutions have their pros and cons.
  • More coordination needed between releases, backward and forward compatibility considerations etc to avoid exact pinnings between individual components, guarantee seamless upgrades, and so on. On the other hand, the kedro metapackage might as well do more strict pinning (and if folks don't like it, they can install the individual components themselves!)

Alternative solutions

One alternative solution is to reject the metapackage approach, tell users to keep installing the full kedro, and continue gradually reducing the number of dependencies. The disadvantages are that

  1. There's so much one can do to trim dependencies. Ultimately Kedro engineers want to, well, leverage open source to be more productive.
  2. Ultimately pip install kedro will keep containing parts some users will not want, hence not addressing the core issue.

Footnotes

  1. Although quite honestly, Packaged kedro pipeline does not work well on Databricks #1807 and https://github.com/pallets/click/issues/2249 made me question some Click design choices, and https://github.com/kedro-org/kedro-plugins/pull/552 made me think that definitely I'd love to find a modern alternative.

@sbrugman
Copy link
Contributor

sbrugman commented Mar 4, 2024

Thanks for the writeup! Addressing this was also on my list.
Not being able to prevent Kedro from installing unnecessary dependencies can be quite painful (e.g. when a feature is not used locally, or for deployment).

Would be interesting to be pragmatic and compare this proposal (elegant) to feature-gating functionality as is being done with jupyter. The biggest con in my opinion is the lack of backwards compatibility in pip install kedro. But is that really such a big problem if the fix is to change to pip install kedro[quickstart]? (or any other extra name) In the technical dimension I agree that the above approach is superior, but what about other dimensions (time and effort required to deliver this feature, cost of maintaining these packages etc.)

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Mar 5, 2024

Thanks for the validation @sbrugman !

The biggest con in my opinion is the lack of backwards compatibility in pip install kedro.

On the contrary, my idea is that this is completely backwards compatible:

Regular Kedro users: pip install kedro. Absolutely nothing changes. We could even retain imports by leveraging PEP 420 implicit namespace packages

Does that address your concerns?

In the technical dimension I agree that the above approach is superior, but what about other dimensions (time and effort required to deliver this feature, cost of maintaining these packages etc.)

Yep that's something to be discussed and evaluated.

@sbrugman
Copy link
Contributor

sbrugman commented Mar 5, 2024

@astrojuanlu Sorry for the misunderstanding. "the biggest con ..." was referring to the alternative solution of using pyproject.toml extra's for feature gating.

@astrojuanlu
Copy link
Member Author

Oh sorry I didn't read your sentence:

Would be interesting to be pragmatic and compare this proposal (elegant) to feature-gating functionality as is being done with jupyter.

@datajoely
Copy link
Contributor

The other point here is that as much as we'd like everyone to upgrade to 0.19.x ASAP, that doesn't work in practice and in many cases people like to keep production deployments static.

The situation we find ourselves in today is that lots of people still pull 0.18.x and that now has outdated dependencies which (1) crop up on enterprise scanning tools like Sonarqube (2) We don't retroactively patch.

The great unbundling of Kedro will allow us to be much more dynamic and potentially support individual components better and for longer.

@idanov
Copy link
Member

idanov commented Apr 8, 2024

Kedro is a rather small framework by all standards, e.g. running cloc . yields that there's only 23k lines of Python code (including tests). Breaking it further down into smaller micro-packages (because if we go with the proposed split into 7 subpackages, each of them will be in the order of 3k lines with tests) will result into too much overhead for too little functionality for each package separately. I would certainly be in favour in trying to reduce further the dependencies or completely removing features like micropackaging (or factoring them out as a separate plugin), but breaking it down into a constellation of subpackages will only make things way more difficult to coordinate in terms of versions.

Most issues with dependencies as always have been coming from the datasets and now that is largely solved by splitting them out of Kedro as a separate package. And just to illustrate that, before the move, our Snyk score was in the low 80s, check our Snyk score now: https://snyk.io/advisor/python/kedro (on par with Prefect and MLFlow and better than ZenML).

@merelcht
Copy link
Member

merelcht commented Apr 8, 2024

I wanted to link this conversation #1758 where we considered turning Kedro into a meta-package when separating out kedro-datasets. Long story short: we decided against it because of the engineering overhead of managing the separate packages and dependencies/release flows etc.

I personally still feel that the meta package structure introduces more burden than it offers benefits to our users. I think the points brought up are very valid, but they all seem to be coming from a very specific user groups: those that have developed Kedro plugins and/or are very advanced users that use Kedro as a library. From my understanding, this is a vocal but also a minority group of our users and I would like to hear other perspectives before betting on splitting Kedro into sub components.

I also hesitate to commit to an overhaul like this, because it would be a significant effort that doesn't introduce any new functionality to Kedro. It would just be a restructure of what we have and I question whether that's enough to gain new users and remain relevant as a tool.

@astrojuanlu
Copy link
Member Author

I wanted to link this conversation #1758 where we considered turning Kedro into a meta-package when separating out kedro-datasets.

Thanks a lot for this, will have a look.

From my understanding, this is a vocal but also a minority group of our users and I would like to hear other perspectives before betting on splitting Kedro into sub components.

I'm not so sure about that. There might be some survivorship bias at play here - IOW, people who see Kedro and are discouraged by its approach to dependencies but never complain to us. By its own nature, this would be very difficult to detect. We need a leap of imagination.

Note that this is not an all-or-nothing approach. We can start by spinning off kedro-catalog, the one that gets requested the most.

Splitting this into 7 packages is a proposed solution, but it's not necessarily the only one. I feel the discussion is too centered on that proposed solution rather than the user problem, which I argue has plenty of supporting evidence.

@astrojuanlu
Copy link
Member Author

@datajoely gave some feedback and I reworked the top comment a bit. Changes:

  • Suggest starting with just 2 packages
  • More clearly signal the user value and supporting evidence
  • Improved considerations

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Apr 11, 2024

We introduced this topic in yesterday's Tech Design session: I gave some views on the broader context, described the user problem, and collected some feedback.

Summary

Some ideas were proposed, broadly falling under 3 categories:

Some concerns were raised:

  • Unclear if Kedro is ready to be broken up, or if its components are ready to live independently
  • It may be harder for a beginner to understand which components they need
  • Kedro is not heavy
  • Users can ignore what they don't want
  • We have already done enough
  • This would be difficult to maintain

Key concerns

Kedro is not heavy

I presented abundant qualitative evidence in #3659 (comment) that our users perceive Kedro as heavyweight. Some of those comments are old, but have been unaddressed. Some other comments are recent.

📣 We don't get to decide how our users perceive our product 📣. We can only influence their perception.1

I also presented quantitative evidence that makes it crystal clear that Kedro is too heavyweight in terms of dependencies. I purposedly did not discuss about SLOC (Single Lines Of Code) or bytesize because those are uninteresting metrics.

Re-stating the quantitative evidence here:

❯ pipdeptree -p django
Django==5.0.4
├── asgiref [required: >=3.7.0,<4, installed: 3.8.1]
└── sqlparse [required: >=0.3.1, installed: 0.4.4]
❯ pipdeptree -p fastapi
fastapi==0.110.1
├── pydantic [required: >=1.7.4,<3.0.0,!=2.1.0,!=2.0.1,!=2.0.0,!=1.8.1,!=1.8, installed: 2.6.4]
│   ├── annotated-types [required: >=0.4.0, installed: 0.6.0]
│   ├── pydantic_core [required: ==2.16.3, installed: 2.16.3]
│   │   └── typing_extensions [required: >=4.6.0,!=4.7.0, installed: 4.11.0]
│   └── typing_extensions [required: >=4.6.1, installed: 4.11.0]
├── starlette [required: >=0.37.2,<0.38.0, installed: 0.37.2]
│   └── anyio [required: >=3.4.0,<5, installed: 4.3.0]
│       ├── idna [required: >=2.8, installed: 3.6]
│       └── sniffio [required: >=1.1, installed: 1.3.1]
└── typing_extensions [required: >=4.8.0, installed: 4.11.0]
`pipdeptree -p kedro`
kedro==0.19.3
├── attrs [required: >=21.3, installed: 23.2.0]
├── build [required: >=0.7.0, installed: 1.2.1]
│   ├── packaging [required: >=19.1, installed: 24.0]
│   └── pyproject_hooks [required: Any, installed: 1.0.0]
├── cachetools [required: >=4.1, installed: 5.3.3]
├── click [required: >=4.0, installed: 8.1.7]
├── cookiecutter [required: >=2.1.1,<3.0, installed: 2.6.0]
│   ├── arrow [required: Any, installed: 1.3.0]
│   │   ├── python-dateutil [required: >=2.7.0, installed: 2.9.0.post0]
│   │   │   └── six [required: >=1.5, installed: 1.16.0]
│   │   └── types-python-dateutil [required: >=2.8.10, installed: 2.9.0.20240316]
│   ├── binaryornot [required: >=0.4.4, installed: 0.4.4]
│   │   └── chardet [required: >=3.0.2, installed: 5.2.0]
│   ├── click [required: >=7.0,<9.0.0, installed: 8.1.7]
│   ├── Jinja2 [required: >=2.7,<4.0.0, installed: 3.1.3]
│   │   └── MarkupSafe [required: >=2.0, installed: 2.1.5]
│   ├── python-slugify [required: >=4.0.0, installed: 8.0.4]
│   │   └── text-unidecode [required: >=1.3, installed: 1.3]
│   ├── PyYAML [required: >=5.3.1, installed: 6.0.1]
│   ├── requests [required: >=2.23.0, installed: 2.31.0]
│   │   ├── certifi [required: >=2017.4.17, installed: 2024.2.2]
│   │   ├── charset-normalizer [required: >=2,<4, installed: 3.3.2]
│   │   ├── idna [required: >=2.5,<4, installed: 3.6]
│   │   └── urllib3 [required: >=1.21.1,<3, installed: 2.2.1]
│   └── rich [required: Any, installed: 13.7.1]
│       ├── markdown-it-py [required: >=2.2.0, installed: 3.0.0]
│       │   └── mdurl [required: ~=0.1, installed: 0.1.2]
│       └── Pygments [required: >=2.13.0,<3.0.0, installed: 2.17.2]
├── dynaconf [required: >=3.1.2,<4.0, installed: 3.2.5]
├── fsspec [required: >=2021.4, installed: 2024.3.1]
├── GitPython [required: >=3.0, installed: 3.1.43]
│   └── gitdb [required: >=4.0.1,<5, installed: 4.0.11]
│       └── smmap [required: >=3.0.1,<6, installed: 5.0.1]
├── importlib_metadata [required: >=3.6,<8.0, installed: 7.1.0]
│   └── zipp [required: >=0.5, installed: 3.18.1]
├── importlib_resources [required: >=1.3,<7.0, installed: 6.4.0]
├── jmespath [required: >=0.9.5, installed: 1.0.1]
├── more-itertools [required: >=8.14.0, installed: 10.2.0]
├── omegaconf [required: >=2.1.1, installed: 2.3.0]
│   ├── antlr4-python3-runtime [required: ==4.9.*, installed: 4.9.3]
│   └── PyYAML [required: >=5.1.0, installed: 6.0.1]
├── parse [required: >=1.19.0, installed: 1.20.1]
├── pluggy [required: >=1.0,<1.4.0, installed: 1.3.0]
├── pre-commit-hooks [required: Any, installed: 4.6.0]
│   └── ruamel.yaml [required: >=0.15, installed: 0.18.6]
│       └── ruamel.yaml.clib [required: >=0.2.7, installed: 0.2.8]
├── PyYAML [required: >=4.2,<7.0, installed: 6.0.1]
├── rich [required: >=12.0,<14.0, installed: 13.7.1]
│   ├── markdown-it-py [required: >=2.2.0, installed: 3.0.0]
│   │   └── mdurl [required: ~=0.1, installed: 0.1.2]
│   └── Pygments [required: >=2.13.0,<3.0.0, installed: 2.17.2]
├── rope [required: >=0.21,<2.0, installed: 1.13.0]
│   └── pytoolconfig [required: >=1.2.2, installed: 1.3.1]
│       └── packaging [required: >=23.2, installed: 24.0]
├── toml [required: >=0.10.0, installed: 0.10.2]
└── toposort [required: >=1.5, installed: 1.10]

Users can ignore what they don't want

I have presented abundant qualitative evidence in #3659 (comment) that this is not how people vet their open source dependencies.

There is also quantitative evidence that our monolithic approach to dependency management causes extensive user pains. For example: #1807, #1752, #1733, #681, and large parts of #3094, just to name a few.

Kedro is not a monopoly, there are plenty of adjacent open source frameworks whose functionality intersects or directly competes with Kedro. If we force users to make decisions against their will, they might as well use something else.

We have already done enough

We have already done very important things. Spinning off kedro-datasets was crucial, a massive achievement, one that our users celebrate, and something we should be proud of.

But, ironically, kedro-datasets is its own monolith now, and we are struggling to cope with the maintenance cost, see discussions and linked issues in kedro-org/kedro-plugins#535.


This is difficult to maintain

It's easy to grasp the idea that more repositories or more subprojects in monorepos somewhat increases the complexity.

There's two sources of complexity:

  • Design. How do we spin off components while avoiding circular dependencies?
  • Maintenance. How do we ensure juggling more repos or more subprojects doesn't consume an excessive amount of engineering resources?

My proposal to move forward with this is:

  • If folks disagree with the premises of this proposal, we can continue discussing asynchronously on this thread.
  • Whenever we are more or less aligned that this is worth exploring, let's have a technical discussion on the different solutions, with a focus on the additive approach (breaking down Kedro into packages) and the subtractive approach (making key Kedro dependencies optional).
  • If we reach the point of considering the additive approach, let's try to break down what "difficult to maintain" means exactly in both cases (monorepo with subprojects vs multirepo) and what can be done to alleviate that extra complexity.

Footnotes

  1. This applies to "Kedro is not an orchestrator" too.

@datajoely
Copy link
Contributor

For me there are key really critical parts of this, I've ranked them in 'unbundling' effort/impact

  1. `Cookiecutter feels like the best candidate for unbundling, it is used in two ways:

    • kedro new / kedro starter workflows, this happens 1 per project and by that measure <1 time per user!
    • kedro pipeline create this can happen many times per project and can even point to a custom template

    The question I have for engineering is as follows - could we use something like pipx to include a dependency like this when it has no impact on Kedro's wider API

ChatGPT generated the following and I (read possible dum dum) think it looks neat:

import subprocess

def create_project_from_template(template_url, project_name):
    subprocess.run(["pipx", "install", "cookiecutter"])  # Install Cookiecutter if not already installed
    subprocess.run(["cookiecutter", template_url, "--output-dir", ".", "--no-input", "-f", "-o", project_name])

# Example usage
create_project_from_template("https://github.com/audreyr/cookiecutter-pypackage.git", "my_project")
  1. The data catalog would be useful on it's own, this is user validated and I'm excited to see where @iamelijahko's research lands in terms of how it could improve as a standalone component.

  2. It would be great to remove click in favour of argparse, it's used everywhere and we have "click>=4.0" set. My hunch is that the long term user value versus effort calculation doesn't make this a compelling option.

@idanov
Copy link
Member

idanov commented Apr 14, 2024

@astrojuanlu Thanks for the writeup and the tech design session from last week. Let me comment on some of the arguments you brought forward in favour of splitting Kedro into smaller packages.


Kedro is (not) heavy

I presented abundant qualitative evidence in #3659 (comment) that our users perceive Kedro as heavyweight. Some of those comments are old, but have been unaddressed. Some other comments are recent.

I followed all links and I am still unconvinced by the evidence, a lot of it has subjective opinions on what we should do. What I'd rather see is what actual problem we are causing by not doing the suggested actions.

  • There was almost no follow up by the reporters (at least in the discussion threads) when someone asked on what the real problem is and why they can't just use the pieces they need and not care about the rest.
  • Almost all of the examples involved just using the DataCatalog on its own and had no interest in other components.
  • Moreover the definition of heavyweight is lacking and this causes a lot of confusion, on my end at least.

I would really love to have a compilation of "whys" in order to have a grounded conversations in facts and not opinions.

📣 We don't get to decide how our users perceive our product 📣. We can only influence their perception.

Yes, and that's a communication challenge mostly, which we still haven't addressed. It will not be solved by splitting or not splitting Kedro, but with proper comms. Including the orchestrator confusion. If we self-assign a label to Kedro heavyweight, imo fully undeserved, then we only add to the problem, not solving it.

I also presented quantitative evidence that makes it crystal clear that Kedro is too heavyweight in terms of dependencies. I purposedly did not discuss about SLOC (Single Lines Of Code) or bytesize because those are uninteresting metrics.

It is not crystal clear, to me at least. Why are those uninteresting? This is quite an important metric for determining if something is heavyweight. When saying that "kedro is a huge monolith" and "heavyweight", we need to define many words here, like huge, monolith and heavyweight. Because, I, for one, certainly measure the size of software projects in terms of how many lines of code they are, having worked on projects from a couple of hundred lines to a couple of hundred thousand lines and knowing the difference in complexity of either extremes. By that measure, Kedro is tiny, not huge. Here's a table to illustrate that:

Package Python LOC Python LOC w/o tests Direct Dependencies Size on Disk (+ deps)
kedro 24K 7K 23 36MB
django 351K 107K 3 37MB
fastapi 87K 10K 3 11MB
mlflow 190K 76K 27 522MB
prefect 212K 92K 54 170MB
zenml 144K 100K 22 285MB
metaflow 54K 48K 2 34MB
dbt-core 83K 32K 23 90MB

As you can see, Kedro is definitely not an outlier from comparable frameworks, and I'd argue its the most lightweight one (bar FastAPI, which is quite impressively small btw).

The only valid discussion arising from this table (and your research) is if it has too many direct dependencies, and the answer probably is maybe and certainly not a definitive yes. However this issue can very easily be solved without breaking Kedro up into tiny packages, but carefully examining what we really need as mandatory dependency and what can be optional. Here's just my list of dependencies that can either be removed or made optional easily:

  • build (can be made optional or removed)
  • cookiecutter (can be made optional)
  • dynaconf (can be replaced by existing attrs dependency)
  • gitpython (can be removed)
  • ipython (can be optional)
  • jmespath (already removed)
  • pre-commit-hooks (why is this even here?)
  • pyarrow (why is this even here?)
  • rope (can be made optional or removed as it serves rarely used feature)
  • toml (is it really necessary? we can live with it too though, as it is harmlessly simple dependency)
  • toposort (already removed)

That's already 11 removed, which will make us second only to metaflow from the ML ecosystem packages in this list. Nothing suggesting Kedro is huge or monolith, but rather the opposite in my opinion - in fact, Kedro seems to be more similar to FastAPI than Django by almost all measures of huge-ness.

Users can(not) ignore what they don't want

I have presented abundant qualitative evidence that this is not how people vet their open source dependencies.

As I've pointed, the evidence you've shown only beats around the bush of the actual problem, not revealing the problem itself. We need a double-click on all of the evidence you have provided and this is what we are lacking, although we can all suppose that the problem people face is indeed the list of direct dependencies, but that'd be an educated guess, not a fact.

There is also quantitative evidence that our monolithic approach to dependency management causes extensive user pains. For example: #1807, #1752, #1733, #681, and large parts of #3094, just to name a few.

I don't see the connection between those issues to dependency management. There's a lot about CLI (rich concretely) and logging, some research on deployment, but nothing suggesting that the dependency management is the culprit for all of those. I would try to tackle one problem at a time and not conflate all issues at once with the silver bullet of splitting.

If we choose to force our users to ignore the parts that they don't want, we are going against well established software engineering principles and also decades of collectively learned behaviors.

I am unsure what you mean by that. How can you force someone to ignore something? A package or module in Python is unused if not imported. I am pulling a number out of thin air here, but I won't be surprised that most packages have no more than 20% of their API being utilised by any given application. Does that mean that each package forces the users to ignore the other 80% of their API footprint? People have agency over importing modules, we can't automagically force them to import the Pipeline class if they don't want to use it.

Kedro is not a monopoly, there are plenty of adjacent open source frameworks whose functionality intersects or directly competes with Kedro, and as such if we force users to make decisions against their will, they might as well use something else.

Sure it isn't, but this comment is not very constructive. By looking at the table above - comparable tools don't seem to be too bothered by being huge and monolith either...

📣 We don't get to force our users to do what they don't want to do 📣.

True, but we are not really forcing them anything. Our users can make a good judgement of the trade-offs of different packages, including Kedro. We can only try to convince them that Kedro is suitable for their use case, but that happens through a good community with lots of examples and education materials. And, obviously, try to improve their experience where we can, but first we need to fully understand the failure modes.

We have (not) already done enough

But, ironically, kedro-datasets is its own monolith now! And discussions around that are becoming more dense and difficult as well, see kedro-org/kedro-plugins#560, kedro-org/kedro-plugins#535.

As far as I am aware, you can install whatever you want fairly independently: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml
Sure, you can scatter all of those datasets into one package per class, but what's the point in that and is the multi-package setup worth it?
What is the difference between pip install kedro-datasets[pandas] vs pip install kedro-pandas? How does that improve discoverability?
As for the issue where someone installed all kedro datasets at once, I think this is rather a development workflow issues than anything else. Maybe we should have better contribution guide on how not to install everything, when you are modifying only one group of kedro datasets.

Before, kedro was a huge monolith. Now we have two less huge monoliths: kedro and kedro-datasets.

Again, less huge monolith of 7K lines of Python code doesn't sit right to me. I suspect we have different definitions of huge and monolith. A monolith in software engineering is not defined only by its dependencies, so the usage of those words from where I stand is incorrect and causes confusion and unnecessary disagreement, apart from leading us towards potentially unsuitable solutions.

Finally, I don't recall anyone claiming we've done enough. Absolutely, we can do more! We should always look out for some small improvements that amount to a bigger change at the end. Such small improvement is removing unneeded dependencies.

This is difficult to maintain

It's easy to grasp the idea that more repositories or more subprojects in monorepos somewhat increases the complexity.

There's two sources of complexity:

  • Design. How do we spin off components while avoiding circular dependencies?
  • Maintenance. How do we ensure juggling more repos or more subprojects doesn't consume an excessive amount of engineering resources?

Even with the best design and great automation, over the years entropy is increasing, team members rotate and forget, make automated things less automated and so on. Coordinating the releases of 2 things is easier than coordinating the releases of 10 things. Currently the team experiences only a taster of that by releasing kedro, kedro-starters, kedro-datasets, kedro-telemetry, kedro-airflow, kedro-viz and yet, we quite often have some issues with some components not working well together because we forgot to bump a version or something else breaking one of the components (most recent one is release 0.19.1, but there was one with kedro-viz depending on toposort which is to be dropped, occasionally some kedro starter not working, etc, etc).

And that's not surprising, the probability of something going wrong is $1 - P(one release failing)^{number of packages}$. It takes only 15 packages with 5% error rate to get to a point that we need to release hotfixes on every other release. 5% error rate is 1 in 20 releases - you be the judge what our release error rate is currently and if 5% is representative. Bringing this error rate down is not trivial and it's a lot of work and many integration tests for n inter-dependent components ($tests = \frac{n * (n - 1)}{2}$), which is high burden no matter what. We can decrease this number a lot through the topology of the dependencies, but it will still be a big number of tests.

Let's agree to include this in the equation when deciding if splitting is the right solution for the problem, when we find an agreement on what the actual problem is.


As a conclusion, let's start with the problem and then define a solution and not the other way around. So in that spirit, what is preventing users from installing Kedro as is and use only the DataCatalog for example?

@astrojuanlu
Copy link
Member Author

TIL: MLFlow publishes a separate package mlflow-skinny with fewer dependencies https://github.com/mlflow/mlflow/blob/master/README_SKINNY.rst, https://github.com/mlflow/mlflow/blob/master/pyproject.skinny.toml#L24-L36

@astrojuanlu
Copy link
Member Author

From a related discussion at FastAPI fastapi/fastapi#11525 (reply in thread)

in my humble opinion cli is not the thing that is needed for web server.

Top upvoted comment in the thread.

@astrojuanlu
Copy link
Member Author

Sequence of events in FastAPI:

And yet, Kedro is even bigger than the new big FastAPI:

Django fastapi-slim FastAPI (full) Kedro
image image image image

About

in fact, Kedro seems to be more similar to FastAPI than Django by almost all measures of huge-ness.

@idanov I don't know how to say this in a way that it doesn't sound bad or confrontational, but this was wrong in April 14th, and is still wrong today. Everyone is entitled to their opinion but I'd rather not mix opinions with facts in this way.

I feel that the push to conduct a full fledged research stream on why developers don't want fat dependencies is just delaying the inevitable.


In any case, #3884 is already exploring options to make cookiecutter optional, next is rich #2928, and probably click will be next before we can meaningfully tackle #143. Let's continue this conversation in the places in which we're taking action.

@astrojuanlu astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale Jun 19, 2024
@datajoely
Copy link
Contributor

With a purely 80/20 hat on:

Big wins:

  • cookiecutter is definitely the biggest ROI and will reduce the surface area for click, jinja2, pyyaml and rich (I still think we could do this with pipx)
  • click is always a big risk since it's so widely used.

Stuff we can easily kill:

  • GitPython is used for retrieving tags available in starters, this can replaced with a single terminal command called via subprocess git ls-remote --tags https://github.com/kedro-org/kedro-starters.git | grep refs/tags/ | cut -f2 | sed 's|refs/tags/||'
  • Do we need pre-commit-hooks in the core package or just development requirements?

Already on chopping block:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

5 participants