Simplify pandas-iris starter example #79

SajidAlamQB · 2022-04-05T08:56:40Z

Motivation and Context

Currently, the pandas-iris starter and documentation contain four nodes split across two modular pipelines, and the hand-crafted linear regression data science algorithm is quite complicated. I think this example should be simpler than spaceflights, not more complicated.

~~Opted to convert the examples to linear regression to find the coefficient of determination between sepal length and other features.~~

Implemented a simple 1-nearest neighbour classifier and calculated its accuracy.

related to: kedro-org/kedro#1393

Development Notes

Remade the pandas-iris starter through kedro new on Kedro 0.18.0

How has this been tested?

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Assigned myself to the PR
Added tests to cover my changes

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

antonymilne

At a glance this looks great, huge improvement on what we had before! Would be good to get @noklam to give another independent opinion on it I think.

Given that there's only one pipeline, I wonder whether we should make this even simpler by removing the data_science pipeline folder altogether. So it would look like this:

{{ cookiecutter.repo_name }}/src
├── {{ cookiecutter.python_package }}
│   ├── hooks.py
│   ├── __init__.py
│   ├── __main__.py
│   ├── nodes.py
│   ├── pipeline.py
│   ├── pipeline_registry.py
│   ├── README.md
│   └── settings.py

and similarly for un-nesting conf and tests.

This is actually what the kedro template used to look like before modular pipelines. Advantage of it is that this is meant to be a very basic introductory example and it would make the project structure easier to understand. We already have the spaceflights tutorial which goes to the next step with multiple pipelines. Curious what others think of this idea.

More detailed comments to follow, but just a few initial things that sprang out at an initial scan through:

did you generate this with kedro 0.18? Because I don't think hooks.py should be there any more
I think you'll need to pull from main and resolve some conflicts - several of the changes in the diff I think are probably already in main after we released 0.18
you'll need to make similar changes to the pyspark-iris starter. Probably easiest to leave that for when we've finalised pandas-iris though, since it will be almost exactly the same
was deleting .gitignore intentional? See Missing .gitignore when using starters #56. Whatever .gitignore file is generated by kedro new should be in the starters too
y_train and y_test will be pd.Series rather than pd.DataFrame. A series is what you get if you take a single column of a dataframe

...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_science/pipeline.py

pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/settings.py

...iecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_science/README.md

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

SajidAlamQB · 2022-04-07T10:11:46Z

I remade the starter with kedro 0.18.0 hooks.py may have slipped in while I was copying some stuff over, I have now removed it. Also added .gitignore back in. Once we are happy with pandas-iris I will repeat it for pyspark-iris (may do this as a separate PR to prevent this one from getting too bloated). Also note we have to update the docs on Kedro for Iris dataset example project.

…code

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

merelcht

I really like how much simpler the starter looks already! 🤩 There are still some mentions to data_science and ds, so don't forget to update those.

pandas-iris/.gitignore

pandas-iris/{{ cookiecutter.repo_name }}/conf/base/catalog.yml

pandas-iris/{{ cookiecutter.repo_name }}/src/tests/test_pipeline.py

pandas-iris/{{ cookiecutter.repo_name }}/src/tests/test_run.py

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

...iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py

pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py

antonymilne · 2022-04-07T15:25:47Z

As per @noklam's comments that there's not really any model training going on, I wonder if we should go for the following terminology for function names, etc.:

split_data
make_predictions
report_accuracy

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

pandas-iris/{{ cookiecutter.repo_name }}/src/tests/test_run.py

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

AhdraMeraliQB

Nicely done!

antonymilne

This is awesome, great work! Lots of small suggestions to do before merging but nothing big.

pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py

pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/README.md

pandas-iris/{{ cookiecutter.repo_name }}/conf/base/parameters.yml

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

pandas-iris linear regression example

e9811e8

SajidAlamQB self-assigned this Apr 5, 2022

SajidAlamQB added 3 commits April 5, 2022 10:08

Update nodes.py

3db40a2

formatting

8ec802c

revamped pandas-iris example

25f3169

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

SajidAlamQB requested review from antonymilne and merelcht April 6, 2022 22:09

SajidAlamQB marked this pull request as ready for review April 6, 2022 22:34

SajidAlamQB added 3 commits April 6, 2022 23:57

Updated readme and images to reflect changes

b972ad2

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

lint

fbd342a

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

update pipeline readme

641a8c7

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

antonymilne requested a review from noklam April 7, 2022 08:14

antonymilne reviewed Apr 7, 2022

View reviewed changes

noklam reviewed Apr 7, 2022

View reviewed changes

...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_science/pipeline.py Outdated Show resolved Hide resolved

noklam reviewed Apr 7, 2022

View reviewed changes

pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/settings.py Show resolved Hide resolved

noklam reviewed Apr 7, 2022

View reviewed changes

...iecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_science/README.md Outdated Show resolved Hide resolved

Changes based on review

903f685

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

Merge branch 'main' into feat/simplify-iris-dataset-starters-example-…

8bdc76e

…code

SajidAlamQB requested a review from antonymilne April 7, 2022 10:50

SajidAlamQB added 3 commits April 7, 2022 12:03

lint

7193bb1

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

lint

305004f

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

doc changes based on review

c165bcd

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

SajidAlamQB requested review from noklam and AhdraMeraliQB April 7, 2022 11:44

merelcht reviewed Apr 7, 2022

View reviewed changes

SajidAlamQB added 2 commits April 7, 2022 14:11

changes based on review

53553dc

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

Update pipeline_registry.py

2978874

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

antonymilne reviewed Apr 7, 2022

View reviewed changes

...iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py Outdated Show resolved Hide resolved

antonymilne reviewed Apr 7, 2022

View reviewed changes

...iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py Outdated Show resolved Hide resolved

antonymilne reviewed Apr 7, 2022

View reviewed changes

...iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py Outdated Show resolved Hide resolved

antonymilne reviewed Apr 7, 2022

View reviewed changes

pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py Outdated Show resolved Hide resolved

SajidAlamQB added 2 commits April 7, 2022 16:49

changes based on review

d5668a9

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

rename node functions

0a6aa52

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

SajidAlamQB changed the title ~~Simplify iris dataset starters example.~~ Simplify pandas-iris starter example Apr 7, 2022

lint

7ac2024

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

SajidAlamQB requested a review from antonymilne April 7, 2022 16:40

noklam reviewed Apr 7, 2022

View reviewed changes

pandas-iris/{{ cookiecutter.repo_name }}/src/tests/test_run.py Outdated Show resolved Hide resolved

noklam approved these changes Apr 7, 2022

View reviewed changes

changes based on review

a323759

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

AhdraMeraliQB approved these changes Apr 8, 2022

View reviewed changes

SajidAlamQB mentioned this pull request Apr 8, 2022

Simplify Iris dataset example project kedro-org/kedro#1426

Merged

5 tasks

antonymilne approved these changes Apr 11, 2022

View reviewed changes

SajidAlamQB added 2 commits April 11, 2022 13:06

changes based on review

346b4c9

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

lint

db35f88

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

SajidAlamQB merged commit 0bbf4be into main Apr 11, 2022

SajidAlamQB deleted the feat/simplify-iris-dataset-starters-example-code branch April 11, 2022 12:17

SajidAlamQB mentioned this pull request Apr 12, 2022

Simplify pyspark-iris starter #82

Merged

3 tasks

SajidAlamQB mentioned this pull request Apr 22, 2022

Apply same simplification from pandas-iris to pyspark-iris starter #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify pandas-iris starter example #79

Simplify pandas-iris starter example #79

SajidAlamQB commented Apr 5, 2022 •

edited

Loading

antonymilne left a comment

SajidAlamQB commented Apr 7, 2022 •

edited

Loading

merelcht left a comment

antonymilne commented Apr 7, 2022

AhdraMeraliQB left a comment

antonymilne left a comment

Simplify pandas-iris starter example #79

Simplify pandas-iris starter example #79

Conversation

SajidAlamQB commented Apr 5, 2022 • edited Loading

Motivation and Context

Development Notes

How has this been tested?

Checklist

antonymilne left a comment

Choose a reason for hiding this comment

SajidAlamQB commented Apr 7, 2022 • edited Loading

merelcht left a comment

Choose a reason for hiding this comment

antonymilne commented Apr 7, 2022

AhdraMeraliQB left a comment

Choose a reason for hiding this comment

antonymilne left a comment

Choose a reason for hiding this comment

SajidAlamQB commented Apr 5, 2022 •

edited

Loading

SajidAlamQB commented Apr 7, 2022 •

edited

Loading