-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify pandas-iris starter example #79
Simplify pandas-iris starter example #79
Conversation
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance this looks great, huge improvement on what we had before! Would be good to get @noklam to give another independent opinion on it I think.
Given that there's only one pipeline, I wonder whether we should make this even simpler by removing the data_science pipeline folder altogether. So it would look like this:
{{ cookiecutter.repo_name }}/src
├── {{ cookiecutter.python_package }}
│ ├── hooks.py
│ ├── __init__.py
│ ├── __main__.py
│ ├── nodes.py
│ ├── pipeline.py
│ ├── pipeline_registry.py
│ ├── README.md
│ └── settings.py
and similarly for un-nesting conf
and tests
.
This is actually what the kedro template used to look like before modular pipelines. Advantage of it is that this is meant to be a very basic introductory example and it would make the project structure easier to understand. We already have the spaceflights tutorial which goes to the next step with multiple pipelines. Curious what others think of this idea.
More detailed comments to follow, but just a few initial things that sprang out at an initial scan through:
- did you generate this with kedro 0.18? Because I don't think hooks.py should be there any more
- I think you'll need to pull from
main
and resolve some conflicts - several of the changes in the diff I think are probably already in main after we released 0.18 - you'll need to make similar changes to the
pyspark-iris
starter. Probably easiest to leave that for when we've finalisedpandas-iris
though, since it will be almost exactly the same - was deleting .gitignore intentional? See Missing .gitignore when using starters #56. Whatever
.gitignore
file is generated bykedro new
should be in the starters too y_train
andy_test
will bepd.Series
rather thanpd.DataFrame
. A series is what you get if you take a single column of a dataframe
...cutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_science/pipeline.py
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/settings.py
Show resolved
Hide resolved
...iecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipelines/data_science/README.md
Outdated
Show resolved
Hide resolved
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
I remade the starter with kedro 0.18.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how much simpler the starter looks already! 🤩 There are still some mentions to data_science
and ds
, so don't forget to update those.
pandas-iris/{{ cookiecutter.repo_name }}/src/tests/test_pipeline.py
Outdated
Show resolved
Hide resolved
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
...iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py
Outdated
Show resolved
Hide resolved
...iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py
Outdated
Show resolved
Hide resolved
...iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py
Outdated
Show resolved
Hide resolved
As per @noklam's comments that there's not really any model training going on, I wonder if we should go for the following terminology for function names, etc.:
|
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, great work! Lots of small suggestions to do before merging but nothing big.
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/nodes.py
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/README.md
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/README.md
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/README.md
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/README.md
Outdated
Show resolved
Hide resolved
pandas-iris/{{ cookiecutter.repo_name }}/conf/base/parameters.yml
Outdated
Show resolved
Hide resolved
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Motivation and Context
Currently, the pandas-iris starter and documentation contain four nodes split across two modular pipelines, and the hand-crafted linear regression data science algorithm is quite complicated. I think this example should be simpler than spaceflights, not more complicated.
Opted to convert the examples to linear regression to find the coefficient of determination betweensepal length
and other features.Implemented a simple 1-nearest neighbour classifier and calculated its accuracy.
related to: kedro-org/kedro#1393
Development Notes
Remade the
pandas-iris
starter throughkedro new
on Kedro 0.18.0How has this been tested?
Checklist