Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture rationale for the 'Archetype' resource #102

Closed
steren opened this issue Feb 8, 2018 · 13 comments
Closed

Capture rationale for the 'Archetype' resource #102

steren opened this issue Feb 8, 2018 · 13 comments
Assignees
Labels
area/API API objects and controllers UI / UX

Comments

@steren
Copy link
Contributor

steren commented Feb 8, 2018

By Archetype, I mean the resource that contains a base specification for new Revisions but that is not a Revision by itself.

This issue is independent from #63, where we discuss the N:M mapping between the various resources.

I suggest here to capture the rational behind the decision of introducing the Archetype resource and the alternatives that have been considered.

My underlying goal is to understand if we could find a valid solution without introducing such concept as a resource.

@steren
Copy link
Contributor Author

steren commented Feb 8, 2018

I can start with a feature requirement that probably lead us towards the introduction of Archetypes:

Developers should not have to pass the entire specification of the revision they deploy. As an example, developers should be able to create a new revision that just differs by one environment variable from the previous one, without having to re-deploy new source or image. Similarly, a developer should be able to deploy new source and carry over previously created environment variables.
Also, we do not want to require to store the entire revision spec in local files that would be version controlled. Yes, this is something that should be possible if desired by the developer, but not something mandatory.
Thus, the requirement for a 'deploy' action is that users can re-use an existing spec.

@steren
Copy link
Contributor Author

steren commented Feb 8, 2018

Now let me capture an API design that was suggested earlier in the design process:

  • There are 2 resources: ElaService and ElaRevision.
  • When creating a new ElaRevision (i.e. when 'deploying'), the API allows to specify a base revision and what changes to apply to it.

The UI or CLI would be in charge of picking the right base revision (for example the latest successfully deployed), but users could decide to base on another one.

Creating an ElaRevision would just mean sending a POST with an ElaRevision spec, this spec containing a baseRevision attribute.

Once created, depending on its traffic configuration, the Service would either automatically point to it, or let the client set a new traffic configuration, potentially pointing 0% of the traffic to this newly created revision.
Getting a linear ElaRevision history would be acheived by sorting the ElaRevisions by creation date.

Note that this proposal does not prevent arbitrary traffic split or gadual rollout.
Please detail why exactly this is not sufficient and why the concept of Archetype needed to be introduced.

@vaikas
Copy link
Contributor

vaikas commented Feb 8, 2018

You don't have to pass the entire spec, you can always just patch:
https://kubernetes.io/docs/tasks/run-application/update-api-object-kubectl-patch/

Does this not meet your requirement stated above?

@steren
Copy link
Contributor Author

steren commented Feb 8, 2018

Yes, #102 (comment) was not about the availability of Patch semantics, but a developer workflow requirement.

@dewitt
Copy link
Contributor

dewitt commented Feb 8, 2018

It's a smart question, and I'm not sure I'm the best person to answer it. But playing it back, what is the difference in developer experience between authoring a RevisionTemplate vs authoring a base Revision that is used as a template?

Since the user isn't expected to directly poke at Revisions, just RevisionTemplates, it's still just one resource for the user to wrap their head around.

And one advantage to a separate RevisionTemplate is that it can contain additional fields beyond what the Revision can. Another advantage is that it allows Revisions to be immutable and always system generated.

Personally, I found it much easier to process the mental model once I began calling it a Configuration instead. Now, if you edit either your Configuration or your source code, then the system will produce and run a new immutable Revision based on those changes. If you want to create a whole new branch of that code (for experiments, for prod vs qa, whatever), just add another Configuration, and new Revision chain will magically appear. (Then you can Route traffic to those Revisions as you see fit.)

Something like: Code + Configuration == Revision. (Or maybe more literally, Δcode + Δconfiguration == new immutable revision.)

I think I'd be fairly comfortable dealing with that model on a daily basis, but we should run it through some UX testing, too.

@mattmoor
Copy link
Member

mattmoor commented Feb 9, 2018

A few thoughts on this top-of-mind:

The UI or CLI would be in charge of picking the right base revision (for example the latest successfully deployed), but users could decide to base on another one.

This puts a significantly higher burden on the client, and eliminates an API resource that essentially summarizes exactly what you are after. Juggling Revisions in this way seems strictly (and significantly) harder (and slower) to me than patching the Archetype as the source of truth.

List, Sort, Pick vs. Get

I believe that by having the Archetype we are essentially encouraging agreement across clients on how to identify the "base" Revision for changes.

To anticipate your counter-argument: "What if I don't want the Archetype's Revision spec because it failed."

Ok, so two Gets instead:

   arch := Archetype.Get(name)
   rev := Revision.Get(arch.Status.LatestReadyRevision)

Once created, depending on its traffic configuration, the Service would either automatically point to it, or let the client set a new traffic configuration, potentially pointing 0% of the traffic to this newly created revision.
Getting a linear ElaRevision history would be acheived by sorting the ElaRevisions by creation date.

Currently Auto works by pointing ElaService at Archetype. In a world without this concept, we are back to selectors (sorted by creation), which I feel we universally agreed was inferior to an Archetype reference.

Did you have something else in mind? If so, concrete examples might help focus the discussion.

@steren
Copy link
Contributor Author

steren commented Feb 9, 2018

Thanks,

I wish we could close this issue by capturing:

  • Why the API design group converged to this particular solution.
  • What were the alternatives considered and why they did not work.

@cooperneil
Copy link

A few more details on why we converged on this solution (also note @evankanderson posted a detailed doc that also summarizes the rational of the decision as part of issue #63 ):

Initially we did start with just 2 resources: Service and Revision (though they had different names at the time, and probably will be renamed again per issue #65 ). However, per @mattmoor's comment above, we recognized early on that we wanted Revisions to be immutable snapshots of code + config, and therefore there should be another resource responsible for minting new Revisions. Benefits of that include:

  • enabling "push code, keep config" and "change just an env var" scenarios through PATCH to that one resource
  • that resource becomes the source of truth; multiple client implementations don't have to juggle multiple revisions to accomplish the above, which can also lead to read-modify-write issues
  • it enables auto rollout from updating just one well known named resource

Initially that one resource was the Service... the initial API design had the Service include a RevisionSpec of the 'next' revision, and updating the Service (i.e. PATCHing the revisionSpec) would result in minting a new Revision that would be rolled out.

While simple, this had several drawbacks, including the inability to handle more complex traffic scenarios such as splitting traffic over a release track and a separately maintained experiment track[1], or arbitrary n-way splits between named Revisions. Additionally, the Service combined too many concerns - namely routing traffic and being the source of truth for a single train of Revisions.

Now consider the scenario below with the proposed model of a separate Archetype resource (from issue #63):

apiVersion:  elafros.dev/v1alpha
kind: ElaService
metadata:
  name: thumbnailer
  ...
spec:
  traffic:
   # manually rolled out
   - revisionName: 123
     name: stable
     percent: 90
   - revisionName: 456
     name: canary
     percent: 10
  # automatically rolled out
   - archetypeName: thumbnailer
     name: head
     percent: 0
 ...

In this scenario, a manual rollout is incrementally shifting traffic from named revision 123 to 456. In addition, the Archetype is referenced to have a floating addressable 'head' revision (that serves 0% traffic).

The act of creating a new Revision is done through a PATCH to the Archetype. The act of manually rolling it out is done through a PATCH to the Service. From an authorization perspective, these may be different user roles, which are reflected by the different resource types that embody these different concerns.

In this advanced Service configuration, a user with a role to update the Archetype can still create new revisions and test them through the 'head' subdomain, but to manually rollout that new revision to serve customer traffic, a user with a more restrictive role to update the Service is necessary. This becomes challenging if these 2 concerns are mixed in the same object.

[1] examples of the 1:N experiment use case (also from issue #63 ):

apiVersion:  elafros.dev/v1alpha
kind: ElaService
metadata:
  name: thumbnailer
  ...
spec:
  traffic:
   - archetypeName: thumbnailer
     name: thumbnailer
     percent: 90
   - archetypeName: thumbnailer-experiment
     name: exp
     percent: 10
...

This simple configuration lets you have a separate release track and experiment track, automatically rolling out newly minted Revisions from each to a specified percentage. This would not be possible without separating Archetype from Service, enabling the multiple cardinality relationship.

@dewitt's illustration (from issue #63, using the names "Configuration" and "Route" rather than "Archetype" and "Service") helps illustrate this example:

experiments

@steren
Copy link
Contributor Author

steren commented Feb 9, 2018

While simple, this had several drawbacks, including the inability to handle more complex traffic scenarios such as:
splitting traffic over a release track and a separately maintained experiment track

Today, Google App Engine does not have the concept of tracks, but only the concept of traffic split. This means that users are responsible for deploying revisions and then changing the traffic configuration to point to the new revisions. They are responsible for maintaining tracks by manipulating base resources (traffic and revisions).

or arbitrary n-way splits between named Revisions.

I am not sure this is correct. The following solution would provide arbitrary n-way traffic split:

spec:
  traffic:
   - revisionName: abc
     name: thumbnailer
     percent: 90
   - revisionName: def
     name: exp
     percent: 10

The act of creating a new Revision is done through a PATCH to the Archetype. The act of manually rolling it out is done through a PATCH to the Service. From an authorization perspective, these may be different user roles, which are reflected by the different resource types that embody these different concerns.

In #102 (comment), I propose that:

  • The act of creating a new Revision is done by POSTing a Revision (the spec would contain a baseRevision + spec "diff")
  • The act of manually rolling it out is done through a PATCH to the Service.

This is still compatible with having different roles.

@steren
Copy link
Contributor Author

steren commented Feb 9, 2018

It seems to me that the existence of the Archetype is the consequence of two things:

  1. We assume that we need to provide a construct to help users manage release tracks within a given service.
  2. Kubernetes-style APIs favors using PATCH to express a "spec diff", but in that case, it needs to patch an existing resource spec instead of pointing to a resource to apply to patch to.

@evankanderson
Copy link
Member

Status update: I'm working on a longer "why we did this" description, which will close out this issue. A rough outline of the content, which is based on experience building and running the GAE control plane API as well as discussions with the kubernetes team:

A common problem between App Engine and Kubernetes today is that the yaml artifacts are local details that users shouldn't need to manage and maintain. Configuration (aka Archetype, RevisionTemplate) is an attempt to move these yaml artifacts to the server in a way which doesn't require setting up additional git repos, triggers, and configuration a priori. (Customers should still be able to do this for their configuration if they wish.)

The next update will include alternative designs considered, along with problems which led them to be rejected.

@evankanderson
Copy link
Member

evankanderson commented Feb 14, 2018

The current API configuration (Routes, Configuration, and Revisions) was built based on experience with 12 factor and with existing products, particularly Google App Engine and Google Cloud Functions, which have two different models for managing resources with different tradeoffs.

Google App Engine

In App Engine, most configuration is performed via an app.yaml file, which contains information about scaling, environment variables, language runtime, and other configuration information. (Historically, this also includes utilities for things like URL path dispatch, since App Engine arose before frameworks like Flask were standardized in Python.) Typically, this file would be checked in alongside the source code, and encoded most of the application configuration in a file adjacent to the source.

Having configuration checked in to source is compelling from a mature operations point of view, but has several limitations:

  • If the source control system is the source of truth for the config, any changes made through another UI (for example, a web UI) needs to either be flowed back into the source control system, or all the tools need to be able to perform a dynamic merge of state – this is a current problem which makes kubectl thick, for example.
  • Additionally, having configuration sourced from a file introduces a speed bump during the getting started process. This speed bump takes two forms:
    1. Breaking the user out of their code in order to paste some semi-boilerplate into a file.
    2. Breaking the user out of e.g. a web UI flow in order to save away the config file and possibly source code (see GCF, below).
  • Alternately, the configuration can be divided in some way between tools such that the checked in file covers certain concerns while other tools cover the rest. My experience is that this line moves from project to project and over time, so what's reasonable in 2016 is unbearable by the end of 2017.
  • As multiple environments evolve, the config file ends up being copied/forked (ideally into a separate repo, but usually alongside the code), and managing the changes and flowing them through releases becomes a challenging problem. Hopefully, around this time a team will standardize on a single tool/process for managing this configuration, but that's not always true. (In particular, during the transition to a single tool, people will revert to old patterns during an emergency, which will tend to make things worse rather than better if old patterns and new tool can't interoperate.)
  • Because the revision history of app.yaml is stored in a separate system, it can be difficult to determine a semantic ordering of created versions, particularly if fanciful release naming is used.

Google Cloud Functions

In Cloud Functions (similar to AWS Lambda and other Function-as-a-Service frameworks), there is no .yaml file for configuration, and all configuration is managed on the server side, typically by full replacement of the existing resource (either by GET and PUT, or by POST of a new resource). This enables interaction between multiple clients (because the server-side state is authoritative), but introduces some additional problems:

  • Because the cloud resource information is canonical, clients must specify a base resource and do a round-trip (GET+PUT) to update the configuration, or collect the configuration information each time via command-line flags. In practice, most tools end up having a set of flags which are needed each time a code or configuration change is made, and these flags end up getting converted into shell scripts, effectively converting the problem back to the App Engine design.
  • Many implementations do not support gradual rollout/update of function definitions. Since Function-as-a-Service platforms typically scale from zero very rapidly, it's easy to simply scale up the new code on demand, rather than the gradual rollout style that e.g. Kubernetes uses.

Configuration (aka Archetype) as a solution

We designed the Configuration resource to act as a "head" of the sequential version history. Having a single mutable head of the revision history avoids embedding a lot of thick logic in the client to select the correct base Revision in cases of failed deployment or while a deployment is in progress.

Having a single head also allows us to present the history of Revisions as a (time-ordered) stream of deltas, rather than as a forking/branching history ala revision control. This simplifies both UI and user mental models over a general forking/history model. Without an explicit head to stamp out history from, it becomes easy for client tools to accidentally generate forks as users choose different base revisions:
forking revision trees 1

In particular, it becomes difficult to determine what the correct (user expected) behavior is when R4 and R5 become Ready if a user is attempting to follow the GCF "immediate update" pattern. By squashing the history into a single stream, it becomes much easier to understand the order that things will become live. (For example, in a single-streamed history, if R4 becomes ready after R5, R4 would never be pointed at by latestReadyRevisionName.)

Failure and rollback

In the case that the most recent Revision has failed to become Ready, tools should be able to detect this and suggest that the user reset the Configuration to the state of the last Ready Revision. Users may choose to "roll forward" off the current configuration, or replace the Configuration with the last working config. Obviously, some tools may not do this or be able to do this (e.g. a CI system may have no way to prompt the user).

Separation of duties and flexibility

In early designs, we had a single Service resource which contained aspects of both Route and Configuration. This had four main problems:

  • The Service resource was large and conflated many concerns, which made it difficult to manage and understand.
  • Having many different concerns in one resource makes it difficult to divide roles via RBAC (for example, "Code Releaser" vs "Rollout Operator".
  • There was no good way to refer to "latest Ready" (i.e. this was a special singleton token – singleton tokens usually end up being a point of regret)
  • A single Revision history for a Service made it difficult to express customer scenarios like "long-running experiment" or "gradual rewrite from language A to B".

See this document for a high-level overview of the other API options we considered.

@evankanderson
Copy link
Member

With the above writeup, I'm closing this issue. I expect that any future desire to revisit the resource architecture will include a writeup which addresses both the conceptual model and a number of specific customers scenarios (aka "Critical User Journeys"). The document which contains the current mapping of user journeys to sequences of API events is in the process of being converted from an internal Google Doc to a public set of markdown resources, and should form a basis for comparison with future proposals.

mgencur pushed a commit to mgencur/serving-1 that referenced this issue May 3, 2019
ReToCode added a commit to ReToCode/serving that referenced this issue Jan 5, 2023
Co-authored-by: Reto Lehmann <retocode@icloud.com>
nak3 pushed a commit to nak3/serving that referenced this issue Mar 3, 2023
Co-authored-by: Reto Lehmann <retocode@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers UI / UX
Projects
None yet
Development

No branches or pull requests

6 participants