Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

Preparing Orchestrator for the After (NSF) Times #233

Closed
cisaacstern opened this issue Feb 17, 2023 · 6 comments
Closed

Preparing Orchestrator for the After (NSF) Times #233

cisaacstern opened this issue Feb 17, 2023 · 6 comments
Labels
enhancement New feature or request priority

Comments

@cisaacstern
Copy link
Member

cisaacstern commented Feb 17, 2023

The NSF Award supporting the current phase of Pangeo Forge development has a little over a year remaining in it. So this is an opportune moment to strategize about where we (and I, with my remaining full time funded effort) want to get Pangeo Forge Orchestrator by the conclusion of this funding cycle. Here is a first pass at some priorities (with big 🙏 to @yuvipanda for so much thoughtful brainstorming on these points):

🔸 State-minimization

Minimizing state managed by Orchestrator reduces the operational and maintenance burden of Pangeo Forge Cloud. A basic assumption here is that the future operator of Pangeo Forge Cloud will value the tradeoff of reduced maintenance and greater ease of federated participation in exchange for some performance cost of decentralization.

Here are forms of state we currently manage and/or deploy from this repo, which we probably can drop:

  • A postgres database (this can be replaced by relying more heavily on Github, including the Check Runs API, which can stand in for our own RecipeRun table).
  • Bakery config (this can exist external to this repository)

Dropping the database has implications for the frontend site https://pangeo-forge.org/, but these can be worked around by either re-imagining what that site is used for (perhaps it does not need a dashboard) and/or having that site populated with data directly from the GitHub API.

Factoring bakery config into separate repos, in earlier iterations bakery repositories were defined for generic bakery types (AWS, GCP, etc.). As the definition of bakeries shifts to "just a Beam runner", so can the way in which bakery repos are formatted. In order for a bakery repo to be pluggable with Pangeo Forge Cloud, I think it needs to define some contractual set of config that Pangeo Forge Cloud can use to deploy a job to that bakery.

🤔 Should Pangeo Forge Cloud make deployments itself? Or perhaps a bakery needs to run it's own agent, for that purpose? This would be something like https://github.com/pangeo-forge/cloudrun-recipe-handler, from which the core FastAPI service can be factored out to be reusable across various deployment targets?

This points to the possibility that bakeries as defined in meta.yaml would become simply, the name of a GitHub repo.

🔹 What is orchestrator (i.e. Pangeo Forge Cloud)?

In this paradigm, orchestrator, or the core of Pangeo Forge Cloud, becomes "just" a GitHub App which understands a meta.yaml spec and can forward appropriately-formatted requests to the bakery "agent" service.

cc @rabernat

Edit: This is a very rough sketch, but wanted to get something out in the open to start a conversation, I'll continue to follow-on below as more thoughts develop.

@cisaacstern cisaacstern added enhancement New feature or request priority labels Feb 17, 2023
@yuvipanda
Copy link

In order for a bakery repo to be pluggable with Pangeo Forge Cloud, I think it needs to define some contractual set of config that Pangeo Forge Cloud can use to deploy a job to that bakery.

Yep, this was part of the motivation for https://github.com/yuvipanda/pangeo-forge-cloud-federation/ (although I think there have been other bits like that too)

@cisaacstern
Copy link
Member Author

cisaacstern commented Feb 17, 2023

@yuvipanda, as we discussed yesterday (and you suggested), I believe a natural way to prototype all of this is by refactoring pangeo-forge-orchestrator to move all of the NSF-funded bakery components into their own repo(s), leaving only the core GitHub App functionality here.

@rabernat
Copy link
Contributor

Charles and I just discussed this, and I'm 100% on board with the idea that we should work to simplify, decouple, and make stateless the Pangeo Forge service.

@cisaacstern
Copy link
Member Author

cisaacstern commented Feb 27, 2023

We had a great discussion of this issue on today's Coordination Meeting. Minutes in the linked doc; copying here for reference:

  • Extra props to Yuvi here!!! (He has provided a huge amount of design guidance on this.)
  • Implications for frontend (no database)
  • Actions vs. App
    • Actions can run everything except production deployments (for situations where the recipe contributor does not also own the deployment creds), including pruned test with LocalDirectRunner, linting
    • Actions can easily deploy to prod if the recipe contributor also owns the deployment creds
    • In situations where some org/institution/company wants to support a beam runner for a group of users (lab members, employees, community, etc.), without granting them ownership of executor creds
  • What is a bakery? Where does the interface between Actions/App + bakery?
    • Bakery is a beam runner
    • new Bakery should also run an Agent service to deploy the jobs to the runner. This allows bakeries to “own” their deployment creds, and not have to hand them over to the GitHub App.
    • GitHub App POSTs to Bakery Agent to deploy
  • Questions:
    • [Ryan] What becomes of the website? The main feature which is missing is a catalog. Catalogs are very stateful. Catalogs are the essence of state. Something else needs to provide a catalog.
    • [Yuvi] The state has not disappeared, it’s just stored in GitHub. The source of truth lives alongside the recipes, on GitHub. The frontend can cache data from GitHub, but not have the source of truth.
    • [Yuvi] Catalog for end users to find data is one type of state. Another type of state is operational logs. Separation of these concerns will be valuable.
    • [Yuvi] In the conda-forge analogy, Anaconda hosts catalog and build artifacts (i.e. data).
    • [Ryan] Target for data are public dataset programs (OSN, AWS, etc.)
    • Home for the catalog(s) remains an open question. Earthmover is building catalog tooling, which may be part of the answer.
    • [Charles/Yuvi] Moving towards a multi-instance world, like Jupyterhub, etc. The pangeo-forge github org becomes a public demonstration, but not the only instance.

@cisaacstern
Copy link
Member Author

Thinking aloud about what a minimal prototype of this new system would require:

  • pangeo-forge-orchestrator PR that:
    • Removes database and associated routes
    • Removes bakery infrastructure (i.e. dataflow status monitoring submodule + terraform)
    • We should probably eventually rename pangeo-forge-orchestrator to just, pangeo-forge/deployment-service or something more descriptive and narrowly-scoped (after this refactor is complete).
  • GCP Bakery repository which deploys:

Note: Short term, the easiest way to link these two services is to deploy The GitHub App on either Heroku or GCP App Engine. (Because the GitHub App is stateless, there is nothing design-wise difficult about deploying it to a serverless platform, but Columbia IT limitations make that impossible using our GCP account.) The GCP Bakery Agent can run on GCP Cloud Run. To invoke the Bakery Agent, the GitHub App can be logged into a gcloud session using a service account key with GCP Cloud Run Invoker permissions for the Bakery Agent (this account key would need to be handed off to The GitHub App deployment repo, by the Bakery Agent developers.)

Here is a sequence diagram outlining the proposed new architecture. Note the main differences with the existing system are:

  • No database (state is captured as GitHub check runs only).
  • Previously, GitHub App deployed jobs directly to the Beam Runner. Now, the Bakery Agent does this.
sequenceDiagram
    Feedstock Repo->>GitHub App:event webhook
    GitHub App-->>Feedstock Repo:creates check run (queued)
    GitHub App->>Bakery Agent:notifies: event
    Bakery Agent->>Beam Runner:deploys job
    Bakery Agent-->>GitHub App:notifies: job deployed
    GitHub App-->>Feedstock Repo:updates check run (in progress)
    Beam Runner->>Beam Runner Status Monitor:notifies: job complete
    Beam Runner Status Monitor-->>Bakery Agent:notifies: job complete
    Bakery Agent-->>GitHub App:notifies: job complete
    GitHub App-->>Feedstock Repo:updates check run (completed)
Loading

@cisaacstern
Copy link
Member Author

I am not 100% decided on the best interface between the GitHub App and the Bakery Agent. Following the example above, The GitHub App could simply pass along some parsed version of the event payload, leaving it up to the Agent to decide what action to take in response to the event:

sequenceDiagram
    Feedstock Repo->>GitHub App:event webhook
    GitHub App->>Bakery Agent:notifies: event
Loading

Alternatively (and I think I'm a bit more partial to this approach), the GitHub App could generate the appropriate pangeo-forge-runner command for the given event type, and then pass it along to the Bakery Agent:

sequenceDiagram
    Feedstock Repo->>GitHub App:event webhook
    GitHub App-->>GitHub App:generates `pangeo-forge-runner` cmd for event
    GitHub App->>Bakery Agent:POSTs cmd
Loading

This latter approach allows the Bakery Agent to (mostly) just be an invoker of pangeo-forge-runner commands, and it can have much less knowledge of GitHub Events. This feels like the right separation of concerns to me: GitHub App handles GitHub-y things, and insulates Bakery Agent from that layer.

In either approach, The GitHub App should pass the name of the GitHub actor that triggered the event along to the Bakery Agent, as I am assuming some sort of allow-list of actors will be how the Bakery Agent decides whether or not to actually complete a given deployment request. (That is, anyone could specify a particular bakery in their meta.yaml, but eventually, not all bakeries may be for public use.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request priority
Projects
None yet
Development

No branches or pull requests

3 participants