Spike: how to handle overlap in example spaceflights projects? #2874

merelcht · 2023-07-31T15:18:32Z

Description

Follow up on #2758 and #2838

We should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?

Context

In #2838 we'll add several new spaceflights based projects that will also serve as the examples a user can add to a project at creation with the new utilities flow. These examples will all likely have similar files, so the question is do we need to have each complete project or can we somehow combine them and still serve the purpose of providing users different examples?

Possible Implementation

The aim of this spike is to come up with possible implementations for merged examples.

merelcht · 2023-08-15T14:37:37Z

Useful links:

merelcht · 2023-08-17T08:34:57Z

The different starters we need are:

spaceflights based on pandas (the existing spaceflights starter)
spaceflights based on pyspark
spaceflights based on pandas with viz features enabled
spaceflights based on pyspark with viz features enabled
spaceflights with Databricks setup (maybe)
spaceflights with Airflow setup (maybe)

I've mapped out the differences between these various projects, the green highlighting means a change is required in the file, the ⭐️ indicates is a new file that needs to be added.

Spaceflights Pandas -> Spaceflights Pyspark

├── conf
│   ├── base
+   │   ├── catalog.yml
+   │   ├── spark.yml ⭐️
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── hooks.py ⭐️
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pandas -> Spaceflights Databricks

├── conf
│   ├── base
+   │   ├── catalog.yml
+   │   ├── spark.yml ⭐️
│   │   ├── parameters.yml
+   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── databricks_run.py ⭐️
+   │   ├── hooks.py ⭐️
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pyspark -> Spaceflights Databricks

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── spark.yml 
│   │   ├── parameters.yml
+   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
│   │   │   │   ├── nodes.py
│   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
│   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── databricks_run.py ⭐️
│   │   ├── hooks.py 
│   │   ├── pipeline_registry.py
│   │   ├── settings.py
│   ├── tests
│   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pandas -> Spaceflights Pandas Viz

Viz features added: experiment tracking, plotting with Plotly, and plotting with Matplotlib

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
+   │   │   ├── reporting ⭐️
+   │   │   │   ├── nodes.py ⭐️
+   │   │   │   ├── pipeline.py ⭐️
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pyspark -> Spaceflights Pyspark Viz

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── spark.yml 
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
+   │   │   ├── reporting ⭐️
+   │   │   │   ├── nodes.py ⭐️
+   │   │   │   ├── pipeline.py ⭐️
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── hooks.py 
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Based on the above, the only obvious merging of projects I see is with the Pyspark and Databricks examples. The other combinations require a lot of changes and the reduction we'd get in maintenance burden for the starters would be added complexity in logic on how to pull in the correct examples for users in the kedro new selection flow.

amandakys · 2023-08-17T14:11:04Z

Based on these findings - would you recommend merging the Pyspark and Databricks example then? What projects would we expose to the users in the new Starter repo.

When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose.

How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.

merelcht · 2023-08-17T16:03:27Z

Based on these findings - would you recommend merging the Pyspark and Databricks example then?

Yes, that one is easy to merge and also de-duplicate.

What projects would we expose to the users in the new Starter repo.

"Vanilla" spaceflights. This is the existing spaceflights project based on pandas.
Spaceflights with Viz features
Pyspark spaceflights, which also includes Databricks setup.
Pyspark spaceflights with Viz features

When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose.
How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.

I think that's basically the "vanilla" spaceflights. If we ever create a new starter it should just be based on that.

merelcht · 2023-08-30T09:07:51Z

Conclusion

My recommendation is to only merge the pyspark and Databricks starter and keep the rest separate. This means we need to create:

spaceflights based on pandas (the existing spaceflights starter)
spaceflights based on pyspark + databricks
spaceflights based on pandas with viz features enabled
spaceflights based on pyspark with viz features enabled
spaceflights with Airflow setup (maybe)

merelcht added this to the Segment template depending on user persona milestone Jul 31, 2023

merelcht self-assigned this Aug 15, 2023

github-actions bot mentioned this issue Aug 18, 2023

Monthly issue metrics report #2950

Closed

yetudada mentioned this issue Aug 21, 2023

Create new spaceflight starters #2838

Closed

merelcht closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: how to handle overlap in example spaceflights projects? #2874

Spike: how to handle overlap in example spaceflights projects? #2874

merelcht commented Jul 31, 2023

merelcht commented Aug 15, 2023

merelcht commented Aug 17, 2023 •

edited

Loading

amandakys commented Aug 17, 2023

merelcht commented Aug 17, 2023

merelcht commented Aug 30, 2023

Spike: how to handle overlap in example spaceflights projects? #2874

Spike: how to handle overlap in example spaceflights projects? #2874

Comments

merelcht commented Jul 31, 2023

Description

Context

Possible Implementation

merelcht commented Aug 15, 2023

merelcht commented Aug 17, 2023 • edited Loading

Spaceflights Pandas -> Spaceflights Pyspark

Spaceflights Pandas -> Spaceflights Databricks

Spaceflights Pyspark -> Spaceflights Databricks

Spaceflights Pandas -> Spaceflights Pandas Viz

Spaceflights Pyspark -> Spaceflights Pyspark Viz

amandakys commented Aug 17, 2023

merelcht commented Aug 17, 2023

merelcht commented Aug 30, 2023

Conclusion

merelcht commented Aug 17, 2023 •

edited

Loading