Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify pandas-iris starter example #79

Merged
merged 20 commits into from
Apr 11, 2022

Conversation

SajidAlamQB
Copy link
Contributor

@SajidAlamQB SajidAlamQB commented Apr 5, 2022

Motivation and Context

Currently, the pandas-iris starter and documentation contain four nodes split across two modular pipelines, and the hand-crafted linear regression data science algorithm is quite complicated. I think this example should be simpler than spaceflights, not more complicated.

Opted to convert the examples to linear regression to find the coefficient of determination between sepal length and other features.

Implemented a simple 1-nearest neighbour classifier and calculated its accuracy.

related to: kedro-org/kedro#1393

Development Notes

Remade the pandas-iris starter through kedro new on Kedro 0.18.0

How has this been tested?

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Assigned myself to the PR
  • Added tests to cover my changes

@SajidAlamQB SajidAlamQB self-assigned this Apr 5, 2022
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB marked this pull request as ready for review April 6, 2022 22:34
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@antonymilne antonymilne requested a review from noklam April 7, 2022 08:14
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance this looks great, huge improvement on what we had before! Would be good to get @noklam to give another independent opinion on it I think.

Given that there's only one pipeline, I wonder whether we should make this even simpler by removing the data_science pipeline folder altogether. So it would look like this:

{{ cookiecutter.repo_name }}/src
├── {{ cookiecutter.python_package }}
│   ├── hooks.py
│   ├── __init__.py
│   ├── __main__.py
│   ├── nodes.py
│   ├── pipeline.py
│   ├── pipeline_registry.py
│   ├── README.md
│   └── settings.py

and similarly for un-nesting conf and tests.

This is actually what the kedro template used to look like before modular pipelines. Advantage of it is that this is meant to be a very basic introductory example and it would make the project structure easier to understand. We already have the spaceflights tutorial which goes to the next step with multiple pipelines. Curious what others think of this idea.

More detailed comments to follow, but just a few initial things that sprang out at an initial scan through:

  • did you generate this with kedro 0.18? Because I don't think hooks.py should be there any more
  • I think you'll need to pull from main and resolve some conflicts - several of the changes in the diff I think are probably already in main after we released 0.18
  • you'll need to make similar changes to the pyspark-iris starter. Probably easiest to leave that for when we've finalised pandas-iris though, since it will be almost exactly the same
  • was deleting .gitignore intentional? See Missing .gitignore when using starters #56. Whatever .gitignore file is generated by kedro new should be in the starters too
  • y_train and y_test will be pd.Series rather than pd.DataFrame. A series is what you get if you take a single column of a dataframe

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB
Copy link
Contributor Author

SajidAlamQB commented Apr 7, 2022

I remade the starter with kedro 0.18.0 hooks.py may have slipped in while I was copying some stuff over, I have now removed it. Also added .gitignore back in. Once we are happy with pandas-iris I will repeat it for pyspark-iris (may do this as a separate PR to prevent this one from getting too bloated). Also note we have to update the docs on Kedro for Iris dataset example project.

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how much simpler the starter looks already! 🤩 There are still some mentions to data_science and ds, so don't forget to update those.

pandas-iris/.gitignore Outdated Show resolved Hide resolved
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@antonymilne
Copy link
Contributor

As per @noklam's comments that there's not really any model training going on, I wonder if we should go for the following terminology for function names, etc.:

  • split_data
  • make_predictions
  • report_accuracy

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB changed the title Simplify iris dataset starters example. Simplify pandas-iris starter example Apr 7, 2022
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, great work! Lots of small suggestions to do before merging but nothing big.

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB merged commit 0bbf4be into main Apr 11, 2022
@SajidAlamQB SajidAlamQB deleted the feat/simplify-iris-dataset-starters-example-code branch April 11, 2022 12:17
@SajidAlamQB SajidAlamQB mentioned this pull request Apr 12, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants