Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating CLI command usage #1293

Closed
antonymilne opened this issue Feb 25, 2022 · 18 comments
Closed

Evaluating CLI command usage #1293

antonymilne opened this issue Feb 25, 2022 · 18 comments
Assignees
Labels
Type: User Research Synthesis ✍️ Issues to document results from user research

Comments

@antonymilne
Copy link
Contributor

antonymilne commented Feb 25, 2022

As per discussions #1077 and #844 we want to separate dev requirements into their own file outside requirements.txt. This means any requirements that are not required for kedro run or running a packaged kedro project (which only exposes kedro run). Remember in v0.18 there is no kedro install command.

  • Change project template to move requirements from requirements.txt into the new dev requirements file. Also move docs requirements from setup.py extras_require to there
  • Modify kedro CLI commands that currently install requirements (like kedro build-docs) to do the right thing
  • Modify kedro CLI commands that need dev requirements but don't currently install them (like kedro lint) do the right thing
  • Change starters templates in same way
  • Update documentation

Things to check and think about:

  • Would all kedro lint, kedro build-docs etc. commands need to run a pip install to install dev requirements? Is this annoying and slow? Is Waylon's idea of pre-commit a better alternative?
  • Assuming we want a dev requirements file, what is the right name for it? Use whatever is convention these days
  • Do commands like kedro pipeline, kedro catalog have any dependencies outside kedro, and if so where should those dependencies go (requirements or dev requirements)?
  • What should kedro build-reqs do?

Also worth checking whether any kedro requirements should be moved to project requirements. e.g. pip-tools is there but after deprecation of kedro install can presumably be moved to a project requirement? Do we need jupyter_client in kedro requirements? etc.

@yetudada
Copy link
Contributor

yetudada commented Feb 28, 2022

I have been thinking about this ticket because I think we should base it on how useful these dependencies are for users. Here's how our users have used CLI-commands which have dependencies in the project's requirements.txt, since the 1st of September 2021:

CLI Command Unique Runs Dependencies
kedro lint 127 black==21.5b1, flake8>=3.7.9, <4.0 and isort~=5.0
kedro test 212 pytest-cov~=3.0, pytest-mock>=1.7.1, <2.0 and pytest~=6.2
kedro activate-nbstripout 10 nbstripout~=0.4
kedro ipython 374 ipython==7.0
kedro jupyter lab 124 ipython==7.0, jupyter~=1.0, jupyter_client>=5.1.0, <7.0 and jupyterlab~=3.0
kedro jupyter notebook 266 ipython==7.0, jupyter~=1.0 and jupyter_client>=5.1.0, <7.0

kedro run has been run 7,747 times and kedro viz has been run 933 times

If you look at this table, we can ask some questions like:

  • Can we consider a different workflow if you require kedro lint, kedro test, kedro ipython, kedro jupyter lab and kedro jupyter notebook?
  • Can we remove kedro activate-nbstripout and its dependency?
  • Why are there such few users of kedro ipython, kedro jupyter lab and kedro jupyter notebook series? And should we remove this workflow?

@yetudada
Copy link
Contributor

Additionally @AntonyMilneQB messaged with this comment:
Just for the record in case it gets discussed next week when I’m not here, because this is something I’ve been thinking about for ages also... When I asked @idanov whether these kedro lint, kedro activate-nbstripout, kedro test, etc. commands belonged in Kedro at all, his opinion was “no, not really, but people like them for convenience”. So if telemetry data shows that no one is using them then great 😄

With that said, part of Kedro’s current offering is that it simplifies setting up a data science project because it bundles all of the things you need (isort, black, etc.). Maybe this isn’t so important any more, but if we do remove them then we might need to change our “Kedro’s value” messaging slightly. One way of making these tools still available but not part of the core Kedro project would be to move them into custom starter a bit like this… but that does add a maintenance burden.

If it turns out that we do want to keep some of these commands then I think the really belong in a CLI group together, say kedro develop or something. My reasoning is thatkedro lint etc. are each just a single command rather than a group they horribly clutter up the CLI and do not belong at the top-level of the hierarchy along with core commands like run or components like pipeline, catalog

So personally I would be very happy if, as part of the CLI cleanup, we could either remove or at least hide away these commands 😄

@yetudada
Copy link
Contributor

And I'll respond with:

  • I don't think we should be creating more work for ourselves by maintaining a starter or a new CLI group - which data suggests that our users won't use
  • I'm not sure we need to change Kedro's offering that much because users still identify with it, even though they don't use these features

lorenabalan pushed a commit that referenced this issue Feb 28, 2022
[AUTO-MERGE] Merge master into develop via merge-master-to-develop
@antonymilne
Copy link
Contributor Author

antonymilne commented Mar 7, 2022

Just a quick note on telemetry data for kedro jupyter and kedro ipython - even though these might be executed relatively few times, the number of times the command is run isn't really comparable to the number of times a command like kedro lint is run. kedro jupyter starts a potentially long-running Jupyter session which the user can interact with over an extended period of time rather than just it just being a one-off process that exits when complete, like kedro lint. So you could be interacting heavily with Jupyter even though you run the command only once a week.

@antonymilne
Copy link
Contributor Author

Also @yetudada did you deliberately not look at use of kedro build-docs and kedro build-reqs? I suspect they're both very low. If we're thinking of removing kedro lint etc. then I think the future of these commands should also be considered.

@merelcht merelcht added this to the 0.18+ milestone Mar 7, 2022
@antonymilne
Copy link
Contributor Author

One more point for consideration. Keeping these requirements in requirements.txt adds a maintenance burden for keeping on top of dependencies, e.g. #1327 #1322. Moving dependencies to dev_requirements.txt would improve the situation, and obviously removing them altogether would remove the maintenance burden completely.

@antonymilne antonymilne changed the title [KED-2346] Separate out dev requirements in kedro project template Separate out dev requirements in kedro project template. Or remove commands that require them altogether? Mar 8, 2022
@antonymilne antonymilne added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 8, 2022
@merelcht
Copy link
Member

merelcht commented Jun 6, 2022

Outcome of this ticket would be to check with users whether we can remove the mentioned CLI commands or not. That would give us evidence about whether we need dev-requirements or not.

@antonymilne
Copy link
Contributor Author

antonymilne commented Jun 6, 2022

I think kedro jupyter notebook/lab and kedro ipython should be left out of this and saved for the ongoing work on improving the interactive workflow. kedro jupyter convert is a good candidate to kill though. In my opinion the following commands are the ones that we should consider killing/moving in this issue:

activate-nbstripout
build-docs
build-reqs
jupyter convert
lint
test

@merelcht merelcht self-assigned this Jun 7, 2022
@merelcht merelcht added the Stage: User Research 🔬 Ticket needs to undergo user research before implementation label Jun 7, 2022
@merelcht
Copy link
Member

merelcht commented Jun 10, 2022

Summary of survey results + telemetry data

For comparison: unique user calls to kedro run from 1 Jan - 10 June: 3068, and unique user calls to kedro viz: 726

CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro activate-nbstripout 6
yes no
1 21
yes no
1 10

Comments from users about kedro activate-nbstripout:

  • [Yes] "As part of the commit trigger so that the notebook output is not carried to the git"
  • [No] "I forgot it existed."
  • [No] "I did not know about it"
  • [No] "I don't work with notebooks often"
CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro build-docs 88
yes no
4 18
yes no
2 11

Comments from users about kedro build-docs:

  • [Yes] "It's more like a bonus for us than a real need. I like to provide to my team both auto generated doc and dag (w/ kedro viz), it help us understend each pipe/node." Are the docs generated as you expect? "Yes, I currently never encountered any issue relating to the auto-generated documentation" Is it enough to fulfil my needs? "Yes and no, the doc + the dags is great but it would be amazing to have a in-doc view of the dags (like this, no need to share screenshots of the viz)"
  • [Yes] "I use it to document the pipelines I build and make the onboarding process smoother. I'm working for a university and apart from few others the majority of my colleagues are staying for 6 months to 3 years. The DAG and the docs makes it easier for them to understand the structure and quickly start working on the pipelines. [...] it would be cool to have the dag in the docs (maybe mermaid kind of dag that can be version controlled? 🙂 )".
  • [Yes] "Tried to create documentation but did not work and/or did not understand how to make it work"
  • [Yes] "I used to after creating a template for a new study"
  • [No] "Technically, no.... but I understand the value in being able to build docs for your package. Problem is, it's an afterthought for most projects (for handover or something), and the command would probably fail or raise so many issues that it might not be worth it. It would be good to somehow enforce working docs from the beginning."
  • [No] "we used the kedro cli build-docs command"
  • [No] "Still useful to have docs but limited need for a publish-ready artifact"
CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro build-reqs 263
yes no
7 15
yes no
7 5

Comments from users about kedro build-reqs:

  • [Yes] "CI/CD and use case installation."
  • [Yes] "I used to install the requirements for deploying an use case"
  • [Yes] "And it was not such a smooth experience"
  • [Yes] "resolve version dependencies in requirements.txt, though would prefer kedro build-reqs --no-annotate version instead by default"
  • [No] "Technically, no, because haven't been working on any large Kedro project outside a Vertical. That being said, do essentially do a pip-compile in CI to generate a "guaranteed working" set of requirements to ship to CSTs/for demos, as part of a downloadable project. Think there's value here, but also most QB/McK DE/DS don't know how to properly define loose requirements and pin stuff anyway, so probably limited value for internal users until they understand the purpose of this."
  • [No] "While I haven't used this myself, I had others use this on my project especially when the global pip wasn't compatible for them due to windows vs mac difference"
  • [No] "I use python -m piptools compile -q src/requirements.in"
  • [No] "Dependency versions specified in requirements.txt already"
CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro jupyter convert 19
yes no
0 22
yes no
3 9

Comments from users about kedro jupyter convert:

  • [Yes] "I used it some times while prototyping small pipelines to quickly export the functions I was using in the jupyter notebook."
  • [No] "Forgot what exactly it does even..."
  • [No] "It is easier to create the note directly on nodes.py (I even prefer to copy and paste the function)"
  • [No] "because tipically i've been working directly in VSC debug mode to develop/debug/try some specific snippets into nodes/pipelines - i.e. not developing large code or nodes in jupyter notebooks"
  • [No] "This feels dirty"
CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro lint 159
yes no
4 18
yes no
5 6

Comments from users about kedro lint:

  • [Yes] "Used locally and in a Github action to ensure code formatting"
  • [Yes] "As a part of the CI test"
  • [No] "However, when I was client-facing, I did use to override kedro lint command to provide a single entry-point for DE/DS on the project to make sure all CI checks passed. I think there's value. I suppose it could be replaced with preconfigured pre-commit."
  • [No] "linting was done using a Makefile which would call a number of commands"
  • [No] "I use vscode with those lint"
  • [No] "haven't needed to run kedro lint explicitly with pre-commit in place"
CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro test 459
yes no
4 18
yes no
5 5

Comments from users about kedro test:

  • [Yes] "Also as a CI test & as a precommit hook"
  • [Yes] "Used locally and in Github Action to execute unit tests. Could take an optional parameter to execute a specific test for debugging, i.e. kedro test src/tests/test_mydataset.py"
  • [No] "Eh... technically not, being on a Vertical, but 100% used this for Kedro projects in the past (as a shortcut, preconfigured pytest command)."
  • [No] "I use pytest"

@merelcht
Copy link
Member

merelcht commented Jun 13, 2022

Based on the above results my suggestions for next steps are as follows:

  1. Deprecate and remove the kedro jupyter convert command
  2. Deprecate and remove the kedro activate-nbstripout command
  3. I'm leaning towards deprecating the kedro build-docscommand, but we could do some additional research to re-evaluate its purpose.
    Some users shared that they really like this command, but would also like it to have more extensive functionality, such as
    automatically including the kedro viz diagram. Because of the nature of this command it isn't run that often, likely only at the end of a project (which is reflected by the telemetry data) and one user reported that it the command didn't work, with another elaborating "[building documentation] it's an afterthought for most projects (for handover or something), and the command would probably fail or raise so many issues that it might not be worth it"
  4. Re-evaluate the purpose of kedro lint and see if we can offer an alternative to users
  5. Re-evaluate the purpose of kedro test and see if we can offer an alternative to users.
    My guess is that kedro test is also run as part of CI builds and that's the reason why it seems to be used quite a lot in telemetry compared to some of the other commands. kedro test is really only a wrapper around pytest so we could re-educate our users to just start using pytest like we did with kedro install and pip install -r requirements.txt.
  6. kedro build-reqs: I would leave this command for now. This command (compared to the others) has the highest usage numbers combined with self-proclaimed usage by both OS and internal QB users. It's also one of the more complex commands and we'd have to investigate what alternative we could offer our users.

@antonymilne
Copy link
Contributor Author

Thanks for all the research and the most excellent write up of all this! Do you have the number of unique calls to kedro run and kedro viz in the same time period just for comparison?

Based on this research, personally I would remove all these commands in favour of my alternative proposal (to follow in next post). If we don't want to go down that route then:

  1. kedro activate-nbstripout: completely agree, kill it
  2. kedro jupyter convert: completely agree, kill it
  3. kedro build-docs: agree with you, but more strongly in favour of killing it. In one sense it seems slightly more valuable than the very thin wrapper ones like kedro lint and kedro test because it does something that is a bit tricky (i.e. sphinx, which we provide some config for). On the flip side if we killed it then we could also remove docs folder from the template which would be nice. If we keep this then completely agree we should re-think what it's really for and whether we want to continue using sphinx for it. There are probably easier to use alternatives available now. N.B. if we want to do mermaid diagrams as one user mentioned, we could use Mermaid Diagrams #1269
  4. kedro lint. if we keep it then I think we should consider again whether pre-commit might be useful here. I think the argument against it before was difficulty on windows, but I'm not sure.
  5. kedro test: agree with you, but more strongly in favour of killing it. Unlike kedro lint this just wraps a single CLI command. Agree with you that's it's like deprecation of kedro install in favour of pip install, which is a way more important command and AFAIK we haven't received any complaints about that at all. And I don't think there is a good alternative to pytest here really
  6. kedro build-reqs: disagree here, I think the case for keeping this isn't any stronger than for kedro test (both in terms of telemetry data and user feedback). It did used to be a complex command but is not so any more in 0.18. It's really just a thin wrapper for pip-tools compile but with non-standard file names (i.e. requirements.txt -> .lock rather than requirements.in -> .txt). Now that pip has ok dependency resolution built in I think bundling piptools is much less important than it used to be. Personally I do like the piptools workflow (just as I like pytest), but I don't think it belongs in core kedro. So IMO the same re-education journey as for deprecation of kedro test would apply here

If we do keep some of these commands in kedro core then I think we should consider hiding them away inside a new group (name tbc) so they clutter the CLI less:

kedro dev lint
kedro dev test
kedro dev build-reqs

@antonymilne
Copy link
Contributor Author

Proposal

We strip out all these commands from core kedro, all our current starters, remove packages from requirements.txt, etc.

We (or someone else) make a new plugin kedro-with-tools (name completely made up for now) that includes:

  • kedro tools group of commands (like the kedro dev group I said above) including lint, test, build-docs, whatever
  • dependencies to make all these work
  • a starter template (call it kedro with-tools-starter) that is heavier than the core one and includes any required configuration (e.g. pyproject.toml, .pre-commit-config.yaml). Along these lines: add simple starter kedro-starters#40

Important note: I say "we or someone else" because I think could be a very good candidate for an unofficial community-maintained plugin. Or we could get it started and then hand over, or we could maintain it but be much more relaxed in accepting updates than we are with the core kedro template.

User journey

Note that pip install immediately makes the new kedro-with-tools-starter alias available thanks to #1592.

Starting a new project:

pip install kedro-with-tools 
kedro new -s kedro-with-tools-starter  # Makes new project template with pre-commit etc. config
kedro tools lint etc. commands are now available

In existing project:

pip install kedro-with-tools 
kedro tools init  # Not sure if we'll actually need this command at all. But it would copy pre-commit etc. config to already existing project
kedro tools lint etc. commands are now available

Pros

  • The kedro CLI becomes very clean. We no longer have any little-used commands cluttering it up. The top-level groups like kedro pipeline, kedro catalog become much more discoverable because they're not hidden among lower-level commands all sitting at the same level
  • The core kedro template and starters become very clean and lighter weight
  • No need for any dev_requirements.txt. A project requirements.txt would include the bare minimum of what's actually needed. No need to ever touch these dependencies again (as with kedro-datasets, these project dev dependencies tend to move faster than than our core ones and cause a disproportionate number of conflicts and irritations, e.g. piptools has broken through conflicts with pip 3 times this year already)
  • We no longer need to worry so much about maintaining the kedro lint etc. commands. These are hard to keep up to date since best practice Python tooling changes (e.g. sphinx; this article. The current "recommendations" we make through the project template are kind of arbitrary (e.g. why do we do isort but not mypy in lint? Just because we haven't updated it for ages I think). We're not really best placed to judge what are the best tools, and it's always changing. If this is community maintained then we don't need to worry about staying on top of it at all!
  • It's consistent with the move to kedro-datasets, i.e. a lean core which is extendible
  • Very non-disruptive. Everyone who doesn't use these commands immediately benefits from a cleaner kedro; those who do want to use them can still do just by doing a single pip install. None of the current functionality is actually lost - it's just moved. In fact, user will have access to better, more up to date versions of these commands
  • Nicely showcases the power of plugins for providing custom CLI commands and starters

References that inspired this idea: #826; #844 (reply in thread); kedro-org/kedro-starters#40

Cons

  • Discoverability of these commands is harder than it currently is since wouldn't appear by default when you do kedro. On the other hand, discoverability of the "core" kedro commands is enhanced
  • We need to update our docs and messaging a bit. kedro is all about promoting good software engineer practice, and at the moment a (small) part of that is in our "recommendation" of the tools that we have in our project template and associated lint etc. commands. @yetudada above didn't see this is a problem. Given how few people use the commands, I don't think they can be a major part of the kedro value proposition

Overall I think both the above can be addressed by documentation and deprecating warnings.

@merelcht
Copy link
Member

@AntonyMilneQB Thanks for sharing and elaborating on your thoughts!

I like the idea of moving the commands we want to keep to a separate plugin if we're keeping a reasonable amount. If we decide we're removing most of them anyway, I'm not sure it's even worth creating this plugin. From your earlier comment it sounded like you'd only really want to keep kedro lint, so then a new plugin might be overkill. But 100% agree to create a plugin if we decide to keep several of the commands.

@antonymilne
Copy link
Contributor Author

antonymilne commented Jun 13, 2022

Just to clarify: if we (or someone else) made this plugin, I think it could still contain the build-docs, build-reqs and test functionality. If it's not part of the core kedro package I feel much more relaxed about putting in functionality that is deemed useful for software engineering best practice but not commonly used at the moment.

Also, slightly simpler than a whole plugin: my original thought was that this could be done with just a custom (ideally community-maintained) starter that would provide:

  • configuration files like pyproject.toml, dependencies in requirements.txt, etc.
  • cli.py that defined any kedro lint etc. commands that you want

I still think this is fine, just doing it as a plugin is slightly neater since that way you also get a nice kedro new --starter= alias even if's not in our official starters repo.

@NeroOkwa
Copy link
Contributor

Thanks @MerelTheisenQB and @AntonyMilneQB for the research, and excellent summary.
I agree with @AntonyMilneQB on removing these commands to declutter the CLI, or alternatively moving the additional commands we decide to keep to a separate plugin.

@yetudada
Copy link
Contributor

This stellar research @MerelTheisenQB 🎉 Thank you so much for this. I've read through your thoughts; thanks for the additions @MerelTheisenQB, @AntonyMilneQB and @NeroOkwa.

Here are my summary thoughts:

  • We're deleting all of these commands from our CLI, with particular emphasis on kedro activate-nbstripout and kedro jupyter convert
  • We should create a discovery ticket to explore how we enable our users to document their workflows for sharing; there seemed to be something interesting around how they imagined kedro build-docs could help them but didn't. However, this ticket is a low priority.
  • I would prioritise making additions to our documentation. So it would look like new sections on how to use pytest in a Kedro workflow and maybe a ticket to document some of the linting tools.
  • I like the plugin's workflow; it's slick. However, I would create a GitHub issue to track demand. Additionally, I would follow the consumption of the testing and linting documentation. My baseline assumption is that users wouldn't use it because these features have had these convenience wrappers for a while, and they still didn't use them.

So with these changes, we still maintain some of our value proposition. But it's clear our users like other parts of Kedro and not necessarily our support of tooling that helps them create better software.

@deepyaman
Copy link
Member

Proposal

We strip out all these commands from core kedro, all our current starters, remove packages from requirements.txt, etc.

We (or someone else) make a new plugin kedro-with-tools (name completely made up for now) that includes:

  • kedro tools group of commands (like the kedro dev group I said above) including lint, test, build-docs, whatever
  • dependencies to make all these work
  • a starter template (call it kedro with-tools-starter) that is heavier than the core one and includes any required configuration (e.g. pyproject.toml, .pre-commit-config.yaml). Along these lines: add simple starter kedro-starters#40

@AntonyMilneQB--and everybody else in favor of killing kedro lint and kedro test, but kedro lint especially--I don't entirely see how this workflow would work.

I do think a non-trivial part of the value proposition of Kedro is introducing users to and enforcing basic software engineering best practices. Anecdotally, I remember being one of the first people on the client-facing side in QuantumBlack (at least in North America?) to enforce coding standards on projects over 3 years ago. More recently, I think a much larger percentage of projects do so--yes, in parallel to greater adoption of tools like Black and isort by the broader community, but also I think at higher rates due to it being bundled with Kedro. In my humble opinion, Kedro has always been opinionated, and part of the value comes from this opinionated nature.

I think removing the kedro lint command and documenting how to replicate its current behavior by leveraging the underlying tools (or something like pre-commit) directly is fine. Maybe this will help some users understand that there's no "magic".

However, going so far as to remove linter configuration needs to be considered much more carefully:

  • If there's to be any hope that 90% of Kedro users currently linting their code will still lint their projects, whatever Kedro template is spit out should satisfy said linters from day one. However, how do you do this if you don't have any canonical linter configuration?
  • Are most Kedro users (e.g. people under pressure to get projects rolling quickly, data scientists just wanting to do some experimentation) going to spend time looking up Black-compatible isort and Flake8 configs? How many of them even know this is a thing?

When it comes to kedro test, I honestly think most users still aren't writing tests (hence the low command usage), but making the config optional isn't going to make that any better. Again, totally fine with just pointing people to the equivalent pytest command here.

With regards to making these commands part of an optional plugin, if anything, I would make them part of a plugin installed by default. This doesn't reduce the maintenance burden, and you can't outsource the development of such a core plugin to the broader community, but it does make it optional for those who may have good reason to not want the functionality.

Finally, FWIW, I think the set of core tools included as part of the project template is actually pretty good. While I personally would include things like mypy and Prettier, I think the black/isort/Flake8 suite is a good representation of "the bare minimum" to say you're following software engineering best practices on this front, without being too difficult for the majority of users to follow. Outsourced to the community, I'm sure somebody will throw in random things like importlinter that may be nice to have, but definitely not at the same level as Black or isort in terms of getting somebody on board with better software development practices.

@merelcht
Copy link
Member

@yetudada yetudada changed the title Separate out dev requirements in kedro project template. Or remove commands that require them altogether? User research to understand CLI command usage Jul 27, 2022
@yetudada yetudada changed the title User research to understand CLI command usage Evaluating CLI command usage Jul 27, 2022
@merelcht merelcht added Type: User Research Synthesis ✍️ Issues to document results from user research and removed Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation Stage: User Research 🔬 Ticket needs to undergo user research before implementation labels Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: User Research Synthesis ✍️ Issues to document results from user research
Projects
Status: Done
Status: Shipped 🚀
Development

No branches or pull requests

5 participants