Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: how to handle overlap in example spaceflights projects? #2874

Closed
merelcht opened this issue Jul 31, 2023 · 5 comments
Closed

Spike: how to handle overlap in example spaceflights projects? #2874

merelcht opened this issue Jul 31, 2023 · 5 comments
Assignees

Comments

@merelcht
Copy link
Member

Description

Follow up on #2758 and #2838

We should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?

Context

In #2838 we'll add several new spaceflights based projects that will also serve as the examples a user can add to a project at creation with the new utilities flow. These examples will all likely have similar files, so the question is do we need to have each complete project or can we somehow combine them and still serve the purpose of providing users different examples?

Possible Implementation

The aim of this spike is to come up with possible implementations for merged examples.

@merelcht
Copy link
Member Author

merelcht commented Aug 17, 2023

The different starters we need are:

  • spaceflights based on pandas (the existing spaceflights starter)
  • spaceflights based on pyspark
  • spaceflights based on pandas with viz features enabled
  • spaceflights based on pyspark with viz features enabled
  • spaceflights with Databricks setup (maybe)
  • spaceflights with Airflow setup (maybe)

I've mapped out the differences between these various projects, the green highlighting means a change is required in the file, the ⭐️ indicates is a new file that needs to be added.

Spaceflights Pandas -> Spaceflights Pyspark

├── conf
│   ├── base
+   │   ├── catalog.yml
+   │   ├── spark.yml ⭐️
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── hooks.py ⭐️
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pandas -> Spaceflights Databricks

├── conf
│   ├── base
+   │   ├── catalog.yml
+   │   ├── spark.yml ⭐️
│   │   ├── parameters.yml
+   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── databricks_run.py ⭐️
+   │   ├── hooks.py ⭐️
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pyspark -> Spaceflights Databricks

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── spark.yml 
│   │   ├── parameters.yml
+   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
│   │   │   │   ├── nodes.py
│   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
│   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── databricks_run.py ⭐️
│   │   ├── hooks.py 
│   │   ├── pipeline_registry.py
│   │   ├── settings.py
│   ├── tests
│   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pandas -> Spaceflights Pandas Viz

Viz features added: experiment tracking, plotting with Plotly, and plotting with Matplotlib

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
+   │   │   ├── reporting ⭐️
+   │   │   │   ├── nodes.py ⭐️
+   │   │   │   ├── pipeline.py ⭐️
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pyspark -> Spaceflights Pyspark Viz

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── spark.yml 
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
+   │   │   ├── reporting ⭐️
+   │   │   │   ├── nodes.py ⭐️
+   │   │   │   ├── pipeline.py ⭐️
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── hooks.py 
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Based on the above, the only obvious merging of projects I see is with the Pyspark and Databricks examples. The other combinations require a lot of changes and the reduction we'd get in maintenance burden for the starters would be added complexity in logic on how to pull in the correct examples for users in the kedro new selection flow.

@amandakys
Copy link

Based on these findings - would you recommend merging the Pyspark and Databricks example then? What projects would we expose to the users in the new Starter repo.

When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose.

How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.

@merelcht
Copy link
Member Author

Based on these findings - would you recommend merging the Pyspark and Databricks example then?

Yes, that one is easy to merge and also de-duplicate.

What projects would we expose to the users in the new Starter repo.

  1. "Vanilla" spaceflights. This is the existing spaceflights project based on pandas.
  2. Spaceflights with Viz features
  3. Pyspark spaceflights, which also includes Databricks setup.
  4. Pyspark spaceflights with Viz features

When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose.
How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.

I think that's basically the "vanilla" spaceflights. If we ever create a new starter it should just be based on that.

@merelcht
Copy link
Member Author

Conclusion

My recommendation is to only merge the pyspark and Databricks starter and keep the rest separate. This means we need to create:

  1. spaceflights based on pandas (the existing spaceflights starter)
  2. spaceflights based on pyspark + databricks
  3. spaceflights based on pandas with viz features enabled
  4. spaceflights based on pyspark with viz features enabled
  5. spaceflights with Airflow setup (maybe)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants