Allow installer to include/exclude components based on user select in…

…stall solution
openshift · Oct 6, 2021 · a83a738 · a83a738
1 parent 553585b
commit a83a738
Showing 1 changed file with 304 additions and 0 deletions.
diff --git a/enhancements/installer/component-selection.md b/enhancements/installer/component-selection.md
@@ -0,0 +1,304 @@
+---
+title: component-selection-during-install
+authors:
+  - "@bparees"
+reviewers:
+  - "@decarr"
+  - "@staebler"
+approvers:
+  - "@decarr"
+creation-date: 2021-05-04
+last-updated: 2021-05-04
+status: provisional
+---
+
+# User Selectable Install Solutions
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+This enhancement proposes a mechanism for cluster installers to exclude one or more optional components for
+their installation which will determine which payload components are/are not installed in their cluster.
+Core components are defined as the set of Second Level Operators managed by the Cluster Version Operator
+which today cannot be disabled until after completing the install and editing a CVO override, or editing
+the CVO overrides as part of rendering+editing manifests.
+
+The proposed UX is to make this a first class part of the install config api with the implementation
+being arguments supplied to the CVO to filter the user-selected manifests.
+
+## Motivation
+
+There is an increasing desire to move away from "one size fits all" cluster installations, and
+towards flexibility about what should/should not exist in a new cluster out of the box.  This can
+be seen in efforts such as hypershift, single node, and code-ready-containers.  Each of these
+efforts has done some amount of one-off work to enable their requirements.  This EP proposes a
+mechanism that allows components to be disabled in a first class way that the installer exposes.
+
+### Goals
+
+* Admins can easily explicitly exclude specific "optional" components from their cluster, at install time.
+* Admins can enable a previously excluded optional component, at runtime.
+
+### Non-Goals
+
+* Making control-plane critical components optional (k8s apiserver, openshift apiserver, openshift controller,
+  networking, etc)
+* Defining which components should be disable-able (this will be up to component teams to classify themselves
+as `addons` or not)
+* Providing a way to install OLM operators as part of the initial cluster install.  This EP is about making
+the install experience around the existing CVO-based components more flexible, not adding new components to the
+install experience.
+* Allowing components to be disabled post-install.
+
+
+## Proposal
+
+### User Stories
+
+* As a user creating a new cluster that will be managed programmatically, I do not want the additional
+security exposure and resource overhead of running the web console.  I would like a way to install
+a cluster that has no console out of the box, rather than having to disable it post-install or
+modify rendered manifests in a way that requires deep understanding of the OCP components/resources.
+
+* As a team scaffolding a managed service based on openshift, I want to minimize the footprint of my
+clusters to the components I need for the service.
+
+* As a user creating a cluster that will never run an image registry, I do not want the additional overhead
+of running the image registry operator, or have to remove the default registry that is created.
+
+
+### Implementation Details/Notes/Constraints [optional]
+
+The CVO already has the ability to respect annotations on resources, as can be seen
+[here](https://github.com/openshift/cluster-kube-apiserver-operator/blob/c03c9edf5fddf4e3fb1bc6d7afcd2a2284ca03d8/manifests/0000_20_kube-apiserver-operator_06_deployment.yaml#L10) and leveraged [here](https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/assets/cluster-version-operator/cluster-version-operator-deployment.yaml#L47-L48).
+This proposal consists of two parts:
+
+1) Formalizing a concept of an "addon" annotation which allows a given resource to be excluded based
+on installer input. For example the console related resources could be annotated as
+
+```yaml
+annotations:
+  addon.openshift.io/console: "true"
+```
+
+2) Defining an install config mechanism whereby the user can opt out of specific addons.
+
+InstallConfig.ExcludeAddons
+- console
+- samples
+
+Which resources ultimately get installed for a given cluster would be the set of resources encompassed
+by the CLUSTER_PROFILE(if any), minus any resources explicitly excluded by the excluded addons configuration.
+
+Examples of candidate components to be treated as addons:
+
+* console
+* imageregistry
+* samples
+* cluster baremetal operator
+* olm/marketplace
+* kube-storage-version-migrator
+* insights
+* monitoring
+* ???
+
+3) Pass the list of filtered annotations to the CVO.  This is distinct from overrides because overrides
+put the cluster in an unsupported state.  Filtered annotations are supported for upgrades.  The filtered
+components will be listed in the ClusterVersion object:
+
+```yaml
+  spec:
+    excludedAddons:
+    - console
+    - samples
+```
+
+The CVO will filter out(not apply/reconcile) resources that are annotated with `addon.openshift.io/$exclusions`
+
+4) Admin can remove an item from the filtered annotations list, but they cannot add an item to it.  If an
+item is removed, the CVO will apply the previously filtered resources to the cluster on the next reconciliation.
+Adding an item to the filtered list is not supported because it requires the component be removed from the
+running cluster which has more significant implications for how all traces of the component are removed.
+
+The currently configured filter list for the CVO should be recorded in telemeter so we can understand
+the configuration of a given cluster.
+
+
+### Risks and Mitigations
+
+The primary risk is that teams understand how to use these new annotations and apply them
+correctly to the full set of resources that make up their addon.  Inconsistent or
+partial labeling will result in inconsistent or partially deployed resources.
+
+Another risk is that this introduces more deployment configurations which might
+have unforeseen consequences (e.g. not installing the imageregistry causes some
+other component that assumes there is always an imageregistry or assumes the
+presence of some CRD api that is installed with the imageregistry to break).
+
+There was some discussion about the pros/cons of allowing each component to be enabled/disabled independent
+of that component explicitly opting into a particular (presumably well tested) configuration/topology
+[here](https://github.com/openshift/enhancements/pull/200#discussion_r375837903).  The position of this EP is that
+we should only recommend the exclusion of fully independent "addon" components that are not depended on by
+other components.  Further the assumption is that it will be reasonable to tell a customer who disabled
+something and ended up with a non-functional cluster that their chosen exclusions are simply not supported
+currently.
+
+Since the only components/resources that can be filtered out of the installation are ones that are explicilty
+annotated with `addon.openshift.io/$component`, end-users will not be able to use this mechanism to filter
+components/resources that we did not intend for them to be able to filter out.
+
+## Design Details
+
+### Open Questions
+
+
+1. Do we want to constrain this functionality to turning off individual components?  We could
+also use it to
+  a) turn on/off groups of components as defined by "solutions" (e.g. a "headless" solution
+  which might turn off the console but also some other components).  This is what CLUSTER_PROFILES
+  sort of enable, but there seems to be reluctance to expand the cluster profile use case to include
+  these sorts of things.
+  b) enable/disable specific configurations such as "HA", where components could contribute multiple
+  deployment definitions for different configurations and then the installer/CVO would select the correct
+  one based on the chosen install configuration (HA vs single node) instead of having components read/reconcile
+  the infrastructure resource.
+
+2. How does the admin enable a component post-install if they change their mind about what components
+they want enabled?  Do we need/want to allow this?
+
+Turning on a component later is relatively easy (we expose a config resource for the CVO that defines
+the filter, we allow the user to remove items from the filter, the CVO will apply the previously
+filtered resources during the next reconciliation).
+
+Turning off a component later is more problematic because
+a) The CVO doesn't delete resources today, so that would be a new thing to teach it to do.
+b) Just deleting the resources for the component isn't sufficient, as the component also needs to clean
+itself up in case it created any additional resources on the cluster or contributed any configuration.
+
+
+3. What are the implications for upgrades if a future upgrade would add a component or resource which would
+have been filtered out during install time?  The install time choices need to be stored somewhere in
+the cluster(CVO config resource) and used to filter applied resources during upgrades also.  My understanding
+is today this is handled with CLUSTER_PROFILES and EXCLUDE_ANNOTATIONS by setting the env vars on the CVO
+pod, but if we want to allow the set to be changed (see (2), we need a more first class config
+resource that is admin editable)
+
+4. How prescriptive do we want to be about what can/can't be turned off?  Components need to opt into
+this by annotating their resources, so it's not completely arbitrary.
+
+5. What to do for components where disabling them has implications on other components or the way certain
+apis behave.  Example: disabling the internal registry changes the behavior of imagestreams
+(can't push to the imagestream anymore to push content to the internal registry) as well as the assumptions
+made by tools like new-app/new-build (create imagestreams that push to the internal registry).
+
+6. What to do(if anything) for components with interdependencies, to ensure a user doesn't break
+enabled components by disabling a dependency?
+
+### Test Plan
+
+1) Install clusters w/ the various add-on components included/excluded and confirm
+that the cluster is functional but only running the expected add-ons.
+
+2) Upgrade a cluster to a new version that includes new resources that belong to
+an addon that was included in the original install.  The new resources should be
+created.
+
+3) Upgrade a cluster to a new version that includes new resources that belong to
+an addon that was excluded in the original install.  The new resources should *not* be
+created.
+
+4) After installing a cluster, enable additional addons.  The newly enabled addons should
+be installed/reconciled by the CVO.
+
+5) After installing a cluster, disable an addon.  The configuration change should be
+rejected by the CVO.  Disabling a component post-install is not supported.
+
+
+
+### Graduation Criteria
+
+Would expect this to go directly to GA once a design is agreed upon/approved.
+
+#### Dev Preview -> Tech Preview
+N/A
+
+#### Tech Preview -> GA
+N/A
+
+#### Removing a deprecated feature
+
+N/A
+
+### Upgrade / Downgrade Strategy
+
+For upgrades, any new resources should have the same exclusion filters applied by the CVO.
+For downgrades, if downgrading below the version of the CVO that supports this logic
+previously excluded components will get created on the cluster.  This is likely
+not a concern since you can't downgrade below the version you started at, and if
+you're using this feature that means you started at a version of the CVO that supports it.
+
+If we allow enabling filters post-install, then we need to revisit the implications of
+downgrades.
+
+There is also some risk if a particular resource has different annotations in different
+versions, then upgrading/downgrading could change whether that resource is excluded by
+the CVO or not.  Once created, the CVO never deletes resources, so some manual cleanup
+might be needed to achieve the desired state.  For downgrades this is probably acceptable,
+for upgrades this could be a concern (resource A wasn't excluded in v1, but is excluded
+in v2.  Clusters that upgrade from v1 to v2 will still have resource A, but clusters
+installed at v2 will not have it).  Technically this situation can already arise today
+if a resource is deleted from the payload between versions.
+
+
+### Version Skew Strategy
+
+N/A
+
+## Implementation History
+
+N/A
+
+## Drawbacks
+
+The primary drawback is that this increases the matrix of cluster configurations/topologies and
+the behavior that is expected from each permutation.
+
+## Alternatives
+
+* CVO already supports a CLUSTER_PROFILE env variable.  We could define specific profiles like "headless"
+that disables the console.  CLUSTER_PROFILE isn't a great fit because the idea there is to define a relatively
+small set of profiles to define specific sets of components to be included, not to allow a user to fully pick
+and choose individual components.  We would have to define a large set of profiles to encompass all the possible
+combinations of components to be enabled/disabled.
+
+* CVO already supports an EXCLUDE_MANIFESTS env variable which is used to implement the ROKS deployment topology.
+Unfortunately it only allows a single annotation to be specified, so even if we want to use it for this purpose
+it needs to be extended to support multiple annotations so multiple individual components can be excluded
+independently rather than requiring all components to be excluded to share a single common annotation.
+
+Regardless we need a way to expose this configuration as a first class part of the install config provided by the
+user creating the cluster, so at a minimum we need to add a mechanism to wire an install config value into
+the CVO arguments and allow the CVO to consume more than a single annotation to exclude.
+
+
+* Allow the installer to specify additional resources to `include` in addition to ones to `exclude`.  This has the challenge
+of potentially conflicting with the specific set of resources that a cluster_profile defines.  There are some
+components that should never be deployed in a particular cluster_profile and so we do not want to allow the user
+to add them.  Examples would be resources that should only be created in standalone installs, not hypershift
+managed ones, because hypershift has its own versions of those resources.
+
+* Use clusteroverrides to exclude content.  The problem w/ this approach is it puts the cluster into an unsupported
+and non-upgradeable state.
+
+
+## Infrastructure Needed
+
+N/A