Introducing Utility Modules to Kedro #2388

amandakys · 2023-03-03T19:59:38Z

Introduction

The desire to simplify Kedro to make it less intimidating for new users contradicts with the goal of Kedro to provide an opinionated, structured application of SWE best practices to data science code. This conflict represents the guiding arguments on whether users should have to opt-in (simplify) or opt-out (opinionated) of ‘non-essential’ functionality.

Kedro’s user base can be broadly split into 2 groups:

Beginner User
Expert User

Both of these groups need to be targeted to drive adoption

We do not want to deteriorate the experience of our expert users who know and love Kedro at the expense of driving beginner/new user adoption. However, their needs are often contradictory. In maintaining the status quo, we do not move towards addressing our beginner user problem. Therefore, our goal should be to find solutions that improve the beginner user experience without significantly degrading the expert user experience.

Background

Discussion of whether user should opt-in or opt-out of features grey from discussion about removing support project-side logging.yml [#2281] and continued in discussions of how to simplify the default project template [#2149] by removing ‘non-essential’ directories.

The status quo is opt-out. The key unknown for opt-in is how to streamline the opt-in journey. Our default approach is to put the information in the docs. This raises the question of feature discoverability. How will new users known these features exist, if they aren’t indicated by the project template?

I generalised the opt-in and opt-out flows as best I can, they will be a little different for each functionality

Related tickets

Concept 1: Simplifying project template by removing ‘unused’ or ‘unnecessary’ features

Concept 2: Improve starter journey to increase accessibility of Kedro

Kedro incremental starter #2054
- a starter journey that incrementally adds in more components of kedro
Revise the starter selection journey #1970
- increase discoverability of Kedro starters
Create new advanced Kedro starter #1948
- create starters for more advanced use cases, example code rather than refer to docs flow

Design

Step 1: Revised Kedro initialisation journey

integrate starter selection into kedro new
- Choose a starter from this list
- by default none of the starters will contain the removed ‘utilities’
allow the configuration of utility options
- i.e. Do you want linting? Y/N, Do you want Testing Y/N
For those who want to skip ‘wizard creation’ a shortcut can be used?
- kedro new --starter=blank --add-lint --add-test --local-data -add-logs [project-name]
- kedro new --modules=all
- kedro new --modules=lint,test,data,logs

(my-virtual-environment) ➜  kedro new 

Project Template
=============
Choose a project template for your new project. 
- astro-airflow-iris: An Iris dataset example project with a minimal setup for deploying the pipeline on Airflow with Astronomer
- blank: A minimal project template
- pandas-iris: An Iris dataset example using Pandas
- pyspark: Configuration and inistialisation for a PySpark pipeline
- pyspark-iris: An Iris dataset example using PySpark 
- spaceflights: Spaceflights tutorial example code
- standalone-datacatalog: A minimum setup to use Kedro's DataCatalog

 [Select your template]: blank

Project Utilities
===========
Here you can select which project utilities you'd like to include. 
Don't worry if you change your mind you can always add/remove these modules later.
To read more about these utilities and what they do visit: kedro.org/ 

Would you like to include linting? [y/n]: 
Would you like to include testing? [y/n]:
Will you be storing data locally? [y/n]:
Would you like custom logging functionality? [y/n]: 

Project Name
============
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.

 [New Kedro Project]: My ML pipeline 

The project name 'My ML pipeline' has been applied to: 
- The project title in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline/README.md 
- The folder created for your project in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline 
- The project\'s python package in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline/src/my_ml_pipeline

A best-practice setup includes initialising git and creating a virtual environment before running 'pip install -r src/requirements.txt' to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/

Change directory to the project generated in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline by entering 'cd /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline'

at this point of the flow: we have a lot of options about what the default behaviour will be
kedro new could default to all modules or no modules, the module selection journey can be compulsory, we can add convenience options for expert users. (although since project creation is not a frequent action, I don’t think it taking an extra few steps is terrible)

Step 2: Kedro Utility Modules

Testing, Linting, Logging, Data Structure, can be a growing list and a way to add in new features

What is a module?
They can be a file structure (i.e. data folder), a set of files + file structure + configuration settings (i.e. logging, testing, linting). Primarily their goal is to be easily inserted and removed from a repo. To achieve this, they should be self-sufficient and independent components.

Step 3: Simplify Module insertion journey åfter initialisation

an initial implementation should focus on allowing users to select and ‘plug in’ modules on initialisation.
the later touchpoint of users realising that they want a module, and plugging that into a ‘unclean’ project repo will be more complex as we will have to deal with more unknowns and code clashes.
at this point a try-catch approach could be a utility command that ‘tries’ to insert logging, given certain prerequisites that the command or the user is responsible for checking.
- You already have a logging.yml file are you sure you want to overwrite it?
- If fails, users should be directed to a step by step walkthrough of the steps the utility command tries to perform. Article: How to add logging manually.
```
kedro add --modules=logs
You already have a logging.yml file are you sure you want to overwrite it? [y/n]:

.
.
.

We were unable to automatically add logging. For step by step instructions visit: xxx.com 
```

Step 4: Simplify Module deletion journey after initialisation

if we know what we supply by default, we can know whether users modify those files.
With some regularity, our CLI can prompt the users to delete modules they aren’t using.

We've noticed you are still using our template testing files. 
To learn more about testing with Kedro visit: kedro.org/testing
To learn more about using alternative testing libraries with Kedro visit: xxx.com 
If you no longer need the testing module run `kedro remove-testing`

Evaluative Questions & Thoughts

What modules should be included in Kedro by default is an open discussion
- if telemetry could tell us what were the most commonly enabled/disabled modules this would be valuable information.
A simple starting journey could mean that users are comfortable starting again, maybe creating a new starter project with the utilities included then plugging in code they’d written from their simpler ‘test project’.
A detailed look at what we consider starters, when we should provide starters and how they differ from each other
- this feeds into Advanced Starters, but should be considered separately to the idea of modularity.

Alternative Options

update/add readme files into directories i.e. \notebooks or \logs to direct users to the relevant part of the docs

Rollout strategy

phased rollout: We should not support module insertion/deletion beyond the initialisation step
- if modules perform well, we can look into the post-initialisation insertion/deletion journey

Planned Research Activities

Team design session aligning Kedro’s priorities and goals
Technical design session to evaluate technical feasibility of potential solutions

The text was updated successfully, but these errors were encountered:

yetudada · 2023-03-06T11:38:34Z

@amandakys you're killing it on this issue! 🎉 Let me provide supporting evidence and comments.

Things that we have evidence for

We must support beginner users better because this group's size is significantly larger than the expert users; internally, we have 19 MLEs vs 427 DS. This skew is likely representative of the external industry too.
Current data shows that users do not necessarily leverage best practices, even when it is visible in their project template and CLI (Evaluating CLI command usage #1293); while this data is imperfect, it is the only data that we have.
It appears our beginner users will not leverage tools because they don't know how to; look at the oversubscription of "Software Engineering for Data Scientists" - 60 marked for attendance, 69 on the waitlist.

Broadly, I agree with a journey that makes it easy and discoverable for our users to opt-in to additional functionality because it's in line with our principles of "growing beginners into experts".

Comments and questions on the prototype

I agree with the direction of the prototype, with a focus on steps 1 and 2 before building out steps 3 and 4
Telemetry at project creation will be challenging, but I would like to see that this is explored
We might also need an option for documentation
I think we might need to call these things "utilities" instead of "modules" because we already have concepts like "modular pipelines"
Question: Do we have data to suggest splitting the questions on linting, testing and documentation? Could they be grouped so that users have fewer questions?

Comments on Step 1

Can there be consistency on the CLI flags e.g. could kedro new --starter=blank --add-lint --add-test --local-data -add-logs [project-name] just become kedro new --starter=blank --add-utilities=lint,test,data,docs,logs --name=project-name
I also really like the addition of allowing users to add in a project-name via the flags

merelcht · 2023-03-06T12:49:42Z

This is fantastic! Thanks for scoping out this work @amandakys ⭐

I have one minor comment around testing: kedro test was always a tiny wrapper around pytest so if we add any testing structure I would prefer if we just set it up as pure pytest. kedro lint is a different story because that combined several tools and could potentially still be a useful command.

noklam · 2023-03-06T13:22:58Z

From the backward compatibility perspective, do these utilities need to be registered in some settings file, or it works just by recognizing certain files structure?

amandakys · 2023-03-14T17:49:18Z

I've done some additional work based on feedback.

I agree with @yetudada that Utilities is a less ambiguous name than Modules.

Step 1: Revised Initialisation Journey

The proposed new commands would include:
kedro new --template=blank which would be a direct rename of kedro new --starter (see #2422)
kedro new --project-name=my_ML_project allowing users to specify a project name inline

If these options --template and --project-name are not supplied the CLI will go through the relevant creation wizard

Project Name
============
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.

 [New Kedro Project]: My ML pipeline

Project Template
=============
Choose a project template for your new project. 
- astro-airflow-iris: An Iris dataset example project with a minimal setup for deploying the pipeline on Airflow with Astronomer
- blank: A minimal project template
- pandas-iris: An Iris dataset example using Pandas
- pyspark: Configuration and inistialisation for a PySpark pipeline
- pyspark-iris: An Iris dataset example using PySpark 
- spaceflights: Spaceflights tutorial example code
- standalone-datacatalog: A minimum setup to use Kedro's DataCatalog

[Select your template]: blank

Step 2: Introduce Utilities

Proposed new commands include:
kedro new --utilities=test,logs,local_data,lint,docs
kedro new --utilities=all
kedro new --utilities=none

If no --utilities is supplied, the CLI will go through the Utility selection wizard. (This has been changed to reduce the number of questions users need to answer to select modules)

Project Utilities
===========
Here you can select which Kedro utilities you'd like to include. 
Don't worry if you change your mind you can always add/remove these modules later.
To read more about these utilities and what they do visit: kedro.org/

Kedro Utilities
1) Linting : some description
2) Testing : xxx
3) Local Data Storage : xxx
4) Custom Logging : xxx
5) Documentation: xxx

Which utilities would you like to include in your project? [1-4/all/1,3]:

This would then be distinct from the command --add-utilities relevant in Step 3

Telemetry at Project Creation

Yetu mentions that telemetry consent may only asked for and granted after the project creation step, so we might struggle to collect data about what parameters are supplied to the kedro new command. But if we know what files/folders/imports each utility adds, we can find something unique in each utility to look for after project creation is complete to verify what utilities users chose to include.

deepyaman · 2023-03-27T13:30:16Z

Copying a couple comments from Slack to have them in the same place...

Randomly reading the Getting Started docs for React, I like this goal/phrasing:
~~React~~Kedro has been designed from the start for gradual adoption, and you can use as little or as much ~~React~~Kedro as you need.

Fits in with/adds fuel to the whole opt-in/opt-out discussion, as well as recent requests to be able to use pieces of Kedro.

Originally posted in https://kedro-org.slack.com/archives/C03QP0NH2J2/p1678806542848929

I assume the "minimal" path to using Kedro right now is the standalone data catalog (https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_as_a_data_registry.html), but it's not the starting point for new users the way things are written--the starting point is very much the whole thing with pipelines and I/O with all the CLI.

To be honest, even this isn't very minimal, since it requires the CLI + starter + folder structure; could imagine the more minimal case would be just pip install kedro , start a notebook, and create a data catalog programmatically and use it in functions. But then we get back into the loop of, is this showing Kedro's value proposition. 😂

Originally posted in https://kedro-org.slack.com/archives/C03QP0NH2J2/p1678817179219289?thread_ts=1678806542.848929&cid=C03QP0NH2J2

amandakys · 2023-04-12T10:34:14Z

Relevant Sub-Issues:

amandakys · 2023-04-17T17:36:05Z

Following our discussion in tech design the next steps for this work are (in order):

investigate how Kedro Init and Utility Modules can fit together (Document workflow to incrementally create a minimal Kedro project from scratch #2512)
address starter strategy work to we can create appropriate prototype the new project creation flow (Investigate options for Starter restructure #2505)

yetudada · 2023-09-04T09:56:34Z

I'm going to close this ticket in favour of #2506

yetudada added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 6, 2023

astrojuanlu mentioned this issue Mar 8, 2023

Messaging - One Sentence Summary for Kedro kedro-org/kedro-devrel#2

Closed

2 tasks

amandakys mentioned this issue Mar 14, 2023

Standardise terminology for starters #2422

Closed

yetudada mentioned this issue Mar 23, 2023

Create user personas to understand value proposition #2462

Closed

merelcht added this to the Segment template depending on user persona milestone Mar 27, 2023

This was referenced Apr 12, 2023

Investigate options for Starter restructure #2505

Closed

Design full Project Creation CLI Flow #2506

Closed

amandakys added the Type: Parent Issue label Apr 12, 2023

amandakys mentioned this issue Apr 12, 2023

Project Creation Wizard GUI #2509

Closed

amandakys mentioned this issue Apr 12, 2023

Revise the starter selection journey #1970

Closed

antonymilne mentioned this issue Apr 17, 2023

Consider removing support for project-side logging.yml #2281

Closed

This was referenced May 8, 2023

Move default template to static pyproject.toml #2569

Closed

Create migration content for users updating their Kedro version #1887

Closed

astrojuanlu mentioned this issue Jun 19, 2023

Move AbstractDataSet to Kedro-Plugins #2409

Closed

This was referenced Jul 3, 2023

Strip project template #2756

Closed

Spike: Investigate options to pull a pipeline and other files into a project #2758

Closed

astrojuanlu mentioned this issue Jul 5, 2023

Document workflow to incrementally create a minimal Kedro project from scratch #2512

Open

This was referenced Jul 25, 2023

Should we remove the docs/ folder from the template? #2381

Closed

Should we remove the notebooks/ folder from the template? #2380

Closed

Should we remove the data/ folder from the template? #2379

Closed

astrojuanlu mentioned this issue Jul 28, 2023

Move default template to static pyproject.toml, take 2 #2853

Merged

5 tasks

This was referenced Aug 7, 2023

Kedro incremental starter #2054

Closed

Insights and opportunities related to helping Kedro impact more users #2901

Closed

yetudada closed this as completed Sep 4, 2023

merelcht mentioned this issue Sep 5, 2023

[Needs User Input!] Is there demand for a kedro-tools plugin? #1622

Closed

This was referenced Sep 6, 2023

Remove linting and test files + setup from the project template #1947

Closed

Remove linting and test files + setup from all starters kedro-org/kedro-starters#106

Closed

This was referenced Sep 6, 2023

Investigate removal of prompts.yml #1692

Closed

Rename spaceflights starter to spaceflights-pandas #3020

Closed

This was referenced Oct 3, 2023

Add a deprecation notice to signal the archiving of starters + new project creation flow #3114

Closed

Add new spaceflights starters to --starter flag options #3112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing Utility Modules to Kedro #2388

Introducing Utility Modules to Kedro #2388

amandakys commented Mar 3, 2023 •

edited

yetudada commented Mar 6, 2023 •

edited

merelcht commented Mar 6, 2023

noklam commented Mar 6, 2023

amandakys commented Mar 14, 2023 •

edited

deepyaman commented Mar 27, 2023 •

edited

amandakys commented Apr 12, 2023 •

edited by merelcht

amandakys commented Apr 17, 2023 •

edited by merelcht

yetudada commented Sep 4, 2023

Introducing Utility Modules to Kedro #2388

Introducing Utility Modules to Kedro #2388

Comments

amandakys commented Mar 3, 2023 • edited

Introduction

Background

Related tickets

Design

Step 1: Revised Kedro initialisation journey

Step 2: Kedro Utility Modules

Step 3: Simplify Module insertion journey åfter initialisation

Step 4: Simplify Module deletion journey after initialisation

Evaluative Questions & Thoughts

Alternative Options

Rollout strategy

Planned Research Activities

yetudada commented Mar 6, 2023 • edited

Things that we have evidence for

Comments and questions on the prototype

Comments on Step 1

merelcht commented Mar 6, 2023

noklam commented Mar 6, 2023

amandakys commented Mar 14, 2023 • edited

Step 1: Revised Initialisation Journey

Step 2: Introduce Utilities

Telemetry at Project Creation

deepyaman commented Mar 27, 2023 • edited

amandakys commented Apr 12, 2023 • edited by merelcht

amandakys commented Apr 17, 2023 • edited by merelcht

yetudada commented Sep 4, 2023

amandakys commented Mar 3, 2023 •

edited

yetudada commented Mar 6, 2023 •

edited

amandakys commented Mar 14, 2023 •

edited

deepyaman commented Mar 27, 2023 •

edited

amandakys commented Apr 12, 2023 •

edited by merelcht

amandakys commented Apr 17, 2023 •

edited by merelcht