Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ab/update roadmap may2024 #56

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 48 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,122 +1,91 @@
<p align="center"><img src="pangeo-forge-logo-blue.png" /></p>

# Pangeo Forge public roadmap
# Pangeo Forge Public Roadmap

In this repository, you can find the the Pangeo Forge project roadmap.
The roadmap is where you can learn about Pangeo Forge project, its subprojects, how they fit together, and the road ahead.
Pangeo Forge is just getting started so please open [issues](https://github.com/pangeo-forge/roadmap/issues) to ask questions or to propose changes and/or additions to the roadmap itself.
🆕 Updated May 2024 🆕

In this repository, you can find a high-level overview of the Pangeo Forge project, its subprojects, how they fit together, and the road ahead.
Pangeo Forge is a community driven project so please open [issues](https://github.com/pangeo-forge/roadmap/issues) to ask questions or to propose changes and/or additions to the roadmap itself.
Pangeo Forge has grown out of the [Pangeo Project](http://pangeo.io/), an open-source community promoting open, reproducible, and scalable science.

## Inspiration

Pangeo Forge is inspired to copy the very successful pattern of [Conda Forge](https://conda-forge.org/).
Pangeo Forge is inspired by the very successful pattern of [Conda Forge](https://conda-forge.org/).
Conda Forge makes it easy for anyone to create a [conda package](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/packages.html), a binary software package that can be installed with the conda package manager.
In Conda Forge, a maintainer contributes [a recipe](https://conda-forge.org/#add_recipe) which is used to generate a conda package from a source code tarball. Behind the scenes, CI downloads the source code, builds the package, and uploads it to a repository.
By automating the difficult parts of package creation, Conda Forge has enabled the open-source community to collaboratively maintain a huge and dynamic library of software packages.

## Vision

Pangeo Forge aspires to be like Conda Forge, but for data--specifically, Analysis Ready, Cloud Optimized (ARCO) data.
(For a detailed working definiton of ARCO data, see our paper [Cloud Native Repositories for Big Scientific Data](https://ieeexplore.ieee.org/abstract/document/9354557).)
We envision a vibrant, dynamic library of open-access ARCO data stored in public clouds, shared among thousands of scientists and directly accessible to data-proximate computing.
However, manually populating such a library would be prohibitively difficult and tedious.
Instead, we are building Pangeo Forge to automate the production of ARCO data and enable the croudsourcing of such a data library.

In Pangeo Forge, a maintainer contributes a recipe which is used to generate an analysis-ready cloud-based copy of a dataset in a cloud-optimized format like Zarr. Behind the scenes, Pangeo Forge cloud-based automation downloads the original files from their source (e.g. FTP, HTTP, or OpenDAP), combines them into one coherent dataset (e.g. using xarray), and writes the data in a cloud optimized format (e.g. Zarr) to cloud storage in a streaming fashion.

## Technical Concepts and Architecture

:exclamation: **Warning!** Pangeo Forge doesn't actually "work" yet. The integration and development of these compoments is work in progress.

### Recipes

A recipe defines how to transform data in one format / location into another format / location.
The primary way people contribute to Pangeo Forge is by writing / maintaining recipes.
Recipes are python objects generated by the [pangeo_forge](https://pangeo-forge.readthedocs.io/en/latest/) package.
These recipes can be used in a standalone fashion, without integration with the Pangeo Forge cloud automation infrastructure.
Or they can be turned into feedstocks and become part of the library.
Pangeo Forge aspires to be like Conda Forge, but for data — specifically, for Analysis Ready, Cloud Optimized (ARCO) data. For a detailed working definition of ARCO data, see our paper [Cloud Native Repositories for Big Scientific Data](https://ieeexplore.ieee.org/abstract/document/9354557).

### Feedstocks

Feedstocks are recipes that are managed and executed by Pangeo Forge cloud automation.
Feedstocks are stored in GitHub repositories in the [pangeo-forge GitHub organization](https://github.com/pangeo-forge/).
The community develops and maintains recipes through interaction with these repositories.

### Bakeries

Bakeries turn recipes into data.
They do the heavy lifting of actually executing the recipes: extracting data from its source, transforming it, and loading it into its target destination.
Bakeries are controlled by triggers from GitHub workflows.
Bakeries can run in cloud or on-premises compute nodes; they should be placed in close network proximity to data sources and / or targets.
We hope that eventually there will be Pangeo Forge bakeries running in most regions of major cloud providers.

![diagram](pangeo-forge-diagram.png)
We envision a vibrant, dynamic library of open-access ARCO data stored in public clouds, shared among thousands of scientists and directly accessible to data-proximate computing.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like keeping this as the vision, even if the current implementation is a bit more distributed in terms of ownership.


However, manually populating such a library would be prohibitively difficult and tedious. Instead, we are building Pangeo Forge to automate the production of ARCO data and enable the crowdsourcing of such a data library.

## Subprojects
Execution environments, managed by different institutions, are used to automate and scale the pipeline of downloading original files from their source (e.g. FTP, HTTP, or OpenDAP), combining them into one coherent dataset (e.g. using xarray), and writes the data in a cloud optimized format (e.g. Zarr) to cloud storage in a streaming fashion. Object storage may also be considered a data store but the download step may be skipped in cases where the execution environment is deployed in the same region as the same cloud environment executing the recipe.

Pangeo Forge brings together a number of smaller subprojects to implement this vision.
The currently-active subprojects are
## Technical Concepts and Architecture

### pangeo-forge
Pangeo Forge’s high-level infrastructure includes:

<https://github.com/pangeo-forge/pangeo-forge>
1. Recipes
2. Execution environments

![CI](https://github.com/pangeo-forge/pangeo-forge/workflows/CI/badge.svg)
![Codecov](https://img.shields.io/codecov/c/github/pangeo-forge/pangeo-forge)
[![Documentation Status](https://readthedocs.org/projects/pangeo-forge/badge/?version=latest)](https://pangeo-forge.readthedocs.io/en/latest/?badge=latest)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
### Recipes

The `pangeo_forge` python package provides the core API for creating Recipes.
All of the "business logic" for how to extract, transform, and load data lives in this library; as such, it is the focal point of Pangeo Forge development.
In Pangeo Forge, recipes generate analysis-ready cloud-based copies of datasets in a cloud-optimized format like Zarr. A recipe defines how to transform data in one format + location into another format + location. Recipes reduce the technical burden on scientists, enabling them to contribute without needing in-depth knowledge of cloud infrastructure.

### staged-recipes
The standalone python package [`pangeo-forge-recipes`](https://github.com/pangeo-forge/pangeo-forge-recipes) consists of configurable and chainable algorithms for ARCO data production. These recipes can be used in a standalone fashion, without integration with an execution environment or they can be deployed into an execution environment for offline and scaling features.

<https://github.com/pangeo-forge/staged-recipes>
### Execution Environments (aka Bakeries)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still call them bakeries? Seems like we've moved on to execution environments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer Execution Environments as well


Staged-recipes is a GitHub repository that manages the submission of new Pangeo Forge recipes.
You can think of this as a holding area for new feedstocks.
This repo contains the automation components of Pangeo Forge.
Execution environments (aka "bakeries") are automated systems for executing these recipes on a distributed system for scalability. Execution environments could be constructed and configured for running recipes using many frameworks, such as flink, spark, dask and possibly many others.

### Bakeries
We hope that eventually there will be one or more Pangeo Forge bakeries for all major cloud providers.

Bakery deployments are being developed for specific cloud providers.
There are a number of execution environments in development at time of writing. Reach out to the current Pangeo Forge development team (via community meetings, see the Governance section) for the latest details.

- <https://github.com/pangeo-forge/pangeo-forge-aws-bakery>
- A pyspark runner: https://github.com/moradology/beam-pyspark-runner) with AWS EMR deployment in-progress.
- A flink runner: https://github.com/NASA-IMPACT/veda-pforge-job-runner
- It is also possible to use a DirectRunner with options for multi_processing set to scale things without a managed distributed infrastructure.

### Pangeo Forge Website

<https://github.com/pangeo-forge/pangeo-forge-vue-website>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep maintaining this site? It needs to be updated.


Once the system is operating, the data library catalog will be viewable on this vue.js website.
However, we aren't quite ready for this front-end work yet.
## Governance

## Milestones
Pangeo Forge is a community-driven project with lots of work to do and lots of room for contributors to engage. Please join one of our community meetings which are included on the calendar of the [Pangeo Meeting Schedule and Notes page](https://pangeo.io/meeting-notes.html)

Here we outline some rough milestones we hope to meet.
* Open Pangeo Forge coordination meetings will be held every 2 weeks. The agenda will be flexible to the needs of current Pangeo Forge developers to address open and context-heavy questions as a group. This is a meeting for Pangeo Forge developers to focus on getting pangeo-forge-recipes and the runner to a new stable release.
* Open Pangeo Forge jam sessions will be held once a week. This agenda will focus on troubleshooting or knowledge sharing of Pangeo Forgee developers on active development tasks.
* Community recipe development: Previously, all recipes were submitted to staged-recipes. Soon we will have a new method for developing and contributing recipes:
* There will be a template (Github repository template or cookiecutter) for users to get started with a skeleton recipe. Users are encouraged to create a recipe in their own organization. Once the recipe has been developed and tested, users can optionally request to transfer the recipe repo to the official pangeo-forge organization.
* Questions about recipes and datasets could be asked on those individual repositories.
* Github discussions in pangeo-forge should be used for higher level issues.

| Date | Features | KPIs |
|------|----------|------|
| May 1, 2021 | Launch cloud automation | - |
| Nov 1, 2021 | Launch catalog website | Functional recipes from partners GHRSST & iHESP; 20 active users |
| May 1, 2022 | Launch JupyterHub / Binder integration | Functional recipes from partners ECMWF (ERA5) and ESGF (CMIP6); 20 user-contributed recipes; 100 active users |
| May 1, 2023 | - | 100 user-contributed recipes; 500 active users |
## Roadmap May 2024 - October 2024

### Recipes

* We expect to test and document the following functionality:
* parquet kerchunk reference generation and append (MUR SST)
* OpenWithXarray supports Zarr
* Methods for validating zarr stores and chunk manifests will be developed or recommended.

## Contributing
### Execution environments

Pangeo-forge is just getting started. There's lots of work to do and lots of room for contributors to engage.
Overall progress on the project can be tracked via two project boards:
- The [Recipe Implementation project board](https://github.com/pangeo-forge/staged-recipes/projects/1).
This tracks the progress of implementing the recipes outlined in staged-recipes.
- The [software development project board](https://github.com/orgs/pangeo-forge/projects/1) shows the progress of the `pangeo_forge` python package, defines what sort of recipes Pangeo Forge can support.
* Pressure test and release the pyspark beam runner through testing existing and new recipes, such as MUR SST kerchunk, CMIP6 and GPM IMERG. Documentation will provide instructions on how to deploy and use the pyspark beam runner.
* A pyspark beam runner will be cloud agnostic but we anticipate maintaining a stable deployment on one cloud provider in the short term.

At this stage, there are a few ways you may consider getting involved.
### Documentation + Governance

1. Scientists and data managers can [document an example recipie](https://github.com/pangeo-forge/staged-recipes/issues/new?assignees=&labels=example&template=example-pipeline.md&title=Example+pipeline+for+%5BDataset+Name%5D). Gathering use cases very helpful for defining the technical needs of pangeo-forge. You don't have to write any code to do this; you just have to understand the dataset you want to work with.
2. Python software developers can contribute to the code base. The [software development project board](https://github.com/orgs/pangeo-forge/projects/1) is a great place to start.
3. Anyone can comment on the project road map in this repository.
4. Eventually (but not yet), organizations can provide support for operating the bakeries (or run their own).
* [Documentation](https://pangeo-forge.readthedocs.io) will be updated to reflect changes, such as the migration from per-recipe feedstocks for deployment on a single centralized runner (formerly known as a bakery) to decentralized recipe actions and execution environments.
* Move recipes out of staged-recipes or forked repos into their own repos.
* Archive feedstocks in the pangeo-forge organization.
* Create a template for recipe development.
* Migrate pangeo-forge jam sessions to a shared calendar.

------

Expand Down