First iteration of running the Assisted Installer in end-user clusters. #574

mhrivnak · 2020-12-23T17:56:51Z

No description provided.

romfreiman · 2020-12-23T19:35:53Z

@filanov

enhancements/installer/assisted-installer-in-cluster.md

filanov · 2020-12-29T06:57:48Z

enhancements/installer/assisted-installer-in-cluster.md

+ * additionalNtpSource: string
+ * installConfigOverrides: string
+
+Status:


How do we show API errors? mostly talking about 4xx? maybe have another filed like errors?

Can't we check for that in the sync part and fail the request?

It's already in the reconcile part, so user can take a look at the events.

But how Hive will see the error? i'm not sure if they are monitoring the events.

If the controller is unable to make progress due to errors like that, it should log those errors, and possibly also set a condition in the resource status such as "progressing = false" with a reason such as "error received from backend service" or similar.

+1, we should use conditions with specific type to provide the necessary status updates instead of these specialized fields in the status...

enhancements/installer/assisted-installer-in-cluster.md

avishayt · 2020-12-30T08:50:54Z

enhancements/installer/assisted-installer-in-cluster.md

+ * logsCollectedAt:string
+ * installerVersion: string
+ * createdAt: string
+ * updatedAt: string


Should use creationTimestamp property of the object instead

avishayt · 2020-12-30T10:36:47Z

enhancements/installer/assisted-installer-in-cluster.md

+for kubernetes-native is more natural for it to be separate.
+
+Spec:
+ * clusterName: string


Use ObjectReference instead of string

Do we anticipate an image existing in a different namespace than the cluster it's related to?

Use of ObjectReference is discouraged upstream, I think because it's very unclear which of the many fields it exposes are in use. https://godoc.org/k8s.io/api/core/v1#ObjectReference

We use a mix of String name and LocalObjectRef in Hive if that's of any help.

@filanov I guess we move to back strings

I guess things like secrets should still be proper references? For example I could imagine the pull-secret being used by multiple entities in multiple namespaces.

Pull secret is tough to use in multiple namespaces as you can't mount in a secret for a pod. The pod would need very broad secret RBAC across the cluster, which is troublesome.

ACM fights with this as they keep a central index of cloud creds users can use for lots of clusters in different namespaces. @jnpacker was commenting on the challenges keeping these in sync in all namespaces and we floated some ideas, nothing concrete yet.

I'd suggest Secret be LocalReference as well, but if you need to go to secrets in another namespace, I believe best practice is to create your own type instead of using ObjectReference. (from our experience everyone is trying to stay away from it now)

avishayt · 2020-12-30T10:37:24Z

enhancements/installer/assisted-installer-in-cluster.md

+Installer's REST API.
+
+Spec:
+ * clusterName: string


Use ObjectReference instead of string

enhancements/installer/assisted-installer-in-cluster.md

nmagnezi · 2020-12-30T12:26:27Z

enhancements/installer/assisted-installer-in-cluster.md

+ * sshPublicKey: string
+ * pullSecretName: string (reference to k8s secret)
+ * ignitionOverrides: object
+ * staticIpConfiguration: object


ignitionOverrides and staticIpConfiguration are configMaps?

nmagnezi · 2021-01-03T11:21:39Z

enhancements/installer/assisted-installer-in-cluster.md

+ * installerArgs: string
+
+Status:
+ * state: string enum


should this be string alias, same as in line#172?

nmagnezi · 2021-01-03T14:01:46Z

enhancements/installer/assisted-installer-in-cluster.md

+     * macAddress: string
+     * flags: array of strings
+     * speedMbps: integer
+   * disks: array


what about InstallationEligibility and IoPerf as seen in models/disk.go?

These were recently added to the REST API, need to add them here

nmagnezi · 2021-01-03T14:11:05Z

enhancements/installer/assisted-installer-in-cluster.md

+     * bootable: boolean
+     * smart: string
+   * boot:
+     * current_boot_mode:string


this should probably be: currentBootMode

Yes, missed some instances of snake to camel case

enhancements/installer/assisted-installer-in-cluster.md

nmagnezi · 2021-01-04T08:52:10Z

enhancements/installer/assisted-installer-in-cluster.md

+    * stageStartedAt: string
+    * stageUpdatedAt: string
+ * connectivity: array of
+   * host


which object is this?

probably this one https://github.com/openshift/assisted-service/blob/master/swagger.yaml#L3541

@filanov proposed to reference the Host object, which makes sense to me.
Please approve/disapprove

markmc · 2021-01-04T16:03:51Z

enhancements/installer/assisted-installer-in-cluster.md

+ * createdAt: string
+ * expiresAt: string
+
+**AgentDrivenClusterInstallation**


One point of clarification that would be useful to make here - that this resource will play no part in the standalone cluster use cases, that it will only play a role in the multi-cluster use cases

And then in the context of multi-cluster installs, we should clearly distinguish between the purpose of ClusterDeployment and AgentDrivenClusterInstallation - right now, they appear to overlap in difficult to justify ways

The "create cluster" workflow below highlights this when it says:

2. User creates AssistedInstallCluster resource describing their desired cluster. 3. User creates ClusterDeployment describing a new cluster.

What is the user describing about the cluster with AgentDrivenClusterInstallation that should not be described by ClusterDeployment?

Another example would be e.g. machineNetworkCidr - if you created a ClusterDeployment referencing an install-config with a particular machine network (or with hiveutil create-cluster --machine-network), and separately AssistedInstallCluster listed a different machine network, which would win?

We're looking at having at least 3 clusters CRs in play, ACM, Hive, Assisted (and in a parallel stream a Hypershift cluster). There will likely be substantial if not complete duplication between them, as a Hive caller should be able to fully specify their cluster in Hive. They should not need to create CRs for the Assisted operator directly. (as far as I am aware)

We talked about doing this so teams can develop independently and as quickly as possible, and components remain useful in isolation. I'm still a bit nervous about this and not sure if it's the ideal approach. (and we should always strive to reduce the load on etcd, so I don't love duplicating all of this everywhere)

They should not need to create CRs for the Assisted operator directly. (as far as I am aware)

Discussions today have revealed this to be less of a mandate than we may have thought. Let's continue to discuss, and I think we can agree on something reasonable. Making duplicate pass-through APIs in hive to mirror the complete APIs of agent-based installation probably doesn't make sense.

After a chat with @eparis and @derekwaynecarr yesterday I think we should discuss eliminating the middle API and operator, if not now, then after prototyping. The end resulting topology should be the same, and it feels like the only reason we're duplicating those APIs is to not have to figure out how to work together. From Hive POV it doesn't feel like it makes sense that we would wrap a REST API with Kube APIs, and then wrap those Kube APIs with more Kube APIs. I don't think we have a need for a standalone Assisted operator, and I think that would be a misunderstanding of what exactly Hive is. It is quite small and lightweight, easy to get running, and it's job is to provision configurations of OpenShift at scale.

I would pitch that once we've got a prototype up (or sooner if desired), we start pushing that code to hive. Review process will be difficult for both sides but there's a big benefit there in that we can help translate that imperative API to a declarative Kube API as my team has been at this for a few years.

In this scenario I would assume that the current assisted install code base remains separate, as this allows the really deep domain logic for performing these installs to stay in the teams with that knowledge. Presumably this would also still need an OperatorHub operator to get the REST API up on a hub cluster, just with a much more limited scope.

markmc · 2021-01-04T16:18:11Z

enhancements/installer/assisted-installer-in-cluster.md

+ * httpsProxy: string
+ * noProxy: string
+ * userManagedNetworking: boolean
+ * additionalNtpSource: string


This NTP source example makes me super nervous about what this resource could turn into.

AIUI, there is only one recommended way to configure the NTP source in an cluster - using a MachineConfig that lays down an /etc/chrony.conf file

If this isn't sufficiently user friendly (and I agree that it is not!) then that is a good argument for it being a top-level install-config option

A new resource like this should not accrete installation options that are "too hard" to add to the core installer

See also:

openshift/installer#3258
openshift/machine-config-operator#629
openshift/installer#1951
https://github.com/openshift/machine-config-operator#applying-configuration-changes-to-the-cluster

This field is used to synchronize clocks of hosts that have not yet formed a cluster. If hosts boot the live ISO and run the agent, and they do not already have NTP sources configured through other means (DHCP, ...), and their clocks are divergent, then we see failed installations. AIUI the user can set this value if clock validation fails, then the agent will apply it, then the validation will eventually succeed after clocks synchronize, and finally installation can be initiated.

When this field is used to add an NTP source for the above use case, a machine config is additionally created so that the new cluster will continue using that source. But this is a secondary behavior driven by the need to converge clocks pre-installation. If an install-config option were available, that could be used instead.

Here's the relevant BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1884465

Does this alleviate your concerns? I know it's still creating an API for something that could or should reasonably also be in the install config, but putting it in the install config would not solve this problem.

Thanks, it's useful to know it serves this pre-installation purpose also

It makes the issue more complicated, but it doesn't alleviate the concern - there's nothing specific to agent-based installations about users needing to configure an NTP server. See openshift/hive#1247 (comment)

markmc · 2021-01-04T16:22:24Z

enhancements/installer/assisted-installer-in-cluster.md

+ * sshPublicKey: string
+ * pullSecretName: string (reference to k8s secret)
+ * ignitionOverrides: object
+ * staticIpConfiguration: object


I'd like to see some discussion about this design versus likely future directions around supporting static IP configs e.g. #540

For example, consider the standalone cluster case - if the static IP config is stored in a MachineConfig, and you want to create an installation image for a given machine ... wouldn't it make sense for the AgentDrivenInstallationImage to reference that MachineConfig?

Note that #540 is mainly about how to configure vlans/bonding - this is potentially possible via MachineConfig because it's likely to be a common configuration for all Machines - but there's no way to define per-Machine MachineConfigs ref
openshift/machine-config-operator#1720

Until/unless that's resolved I don't think MachineConfig provides a solution for static ips (I guess in theory it'd work for single-node deployments, but we want a common interface for multi-node deployments too, right?)

This does raise an interesting question of the proliferation of network interface config interfaces we're exposing users to though, e.g:

NetworkManager keyfiles injected via coreos-install --copy-network (potentially per-node for UPI/AI but no IPI platforms use this)

Ignition customization (potentially per-node in the UPI/AI case)

Some future machine-specific MachineConfig interface

The declarative kubernetes NMState work being done for day-2 networking (and long term it'd be really nice if day1 and day2 interfaces were the same?)

In this case we need some way to configure the install-agent ISO (which IIUC will then pass-through the configuration via coreos-install like in 1 above), which perhaps justifies yet-another interface, but I agree with @markmc it may be worth considering all-the-interfaces that already exist, and justifying if we need a net-new format or some kind of passthrough to one of the existing interfaces.

One simple option might be to allow passing in encoded NM keyfiles via this new API - that provides flexibility (for example bond/vlan support etc without having to add explicit support to a new DSL), but it exposes users to the underlying format (which isn't super-well documented IME).

BTW we made the API very similar to what the ignition config expects

BTW we made the API very similar to what the ignition config expects

@avishayt can you clarify that - ignition doesn't have any explicit support for network configuration, only to write files or systemd units, can you point me to any examples of what the API looks like please?

See here: https://coreos.com/ignition/docs/latest/network-configuration.html

But either way, it's another API

markmc · 2021-01-04T16:28:01Z

enhancements/installer/assisted-installer-in-cluster.md

+
+#### CAPBM 
+
+[Cluster-API Provider Bare Metal](https://github.com/openshift/cluster-api-provider-baremetal/) will need to gain the ability to interact with the new assisted installer APIs. Specifically it will need to match a Machine with an available assisted Agent that is ready to be provisioned. This will be in addition to matching a Machine with a BareMetalHost, which it already does today.


I'd like to see some discussion here of the alternative - a new machine API provider specifically for agent-driven installation, such that agent-driven support isn't limited to bare metal

The basic idea is straightforward - a new actuator that simply satisfies a Machine request by choosing an AgentDrivenHostInstallation resource and initiating an install

The challenging part is how to then leverage the per-platform providers to make it easy to boot a new machine with the agent, in the case where there is no existing AgentDrivenHostInstallation available

Another reason why it's important to consider this alternative - we may struggle to find good justification why support for agent-driven installation is within the scope of upstream CAPBM, and maye for this exact reason ... that there is nothing bare-metal specific about it

Also, consider the boot-it-yourself day 2 add-worker case. There really is no need for a BareMetalHost resource to be involved with such a workflow

UPI workflows get Nodes without Machines today, right? Is there any reason this workflow always needs a Machine either?

@mhrivnak This thread was detached from the text because of changes, but I don't think it's resolved.

I think we just agreed to delete this whole section, and the proposal no longer proposes any particular Machine API integration. There's a new section Future Use of MachinePools and Machine API that elaborates.

enhancements/installer/assisted-installer-in-cluster.md

dgoodwin · 2021-01-04T17:49:24Z

enhancements/installer/assisted-installer-in-cluster.md

+
+Agents that are running on hardware and waiting for installation-related
+instructions will continue to communicate with the assisted service using the
+existing REST API. In the future that may transition to using a CRD.


Will this status be mirrored onto the CRD's?

The communiction between the agent and service doesn't always yield something useful for the user. When it does, then it will be included in the AgentDrivenInstallationHost.

dgoodwin · 2021-01-04T17:53:55Z

enhancements/installer/assisted-installer-in-cluster.md

+for kubernetes-native is more natural for it to be separate.
+
+Spec:
+ * clusterName: string


Use of ObjectReference is discouraged upstream, I think because it's very unclear which of the many fields it exposes are in use. https://godoc.org/k8s.io/api/core/v1#ObjectReference

We use a mix of String name and LocalObjectRef in Hive if that's of any help.

enhancements/installer/assisted-installer-in-cluster.md

dgoodwin · 2021-01-04T18:03:01Z

enhancements/installer/assisted-installer-in-cluster.md

+Spec:
+ * name: string
+ * approved: boolean (used for starting installation)
+ * openshiftVersion: string


What does this openshift version look like?

We work with ClusterImageSets, which ended up being just a string pointing to a release image pull spec.

It looks like "4.6". We have a ConfigMap that we maintain with information about each version.

To answer some possible followup questions:

We don't currently have real support for arbitrary versions

We know that rhcos_version can be derived but haven't gotten around to it yet

{ "4.6": { "display_name": "4.6.8", "release_image": "quay.io/openshift-release-dev/ocp-release:4.6.8-x86_64", "rhcos_image": "https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.6/4.6.8/rhcos-4.6.8-x86_64-live.x86_64.iso", "support_level": "production" }, "4.7": { "display_name": "4.7-pre-release", "release_image": "quay.io/openshift-release-dev/ocp-release@sha256:2419f9cd3ea9bd114764855653012e305ade2527210d332bfdd6dbdae538bd66", "rhcos_image": "https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.6/4.6.8/rhcos-4.6.8-x86_64-live.x86_64.iso", "support_level": "beta" } }

dgoodwin · 2021-01-04T18:26:24Z

enhancements/installer/assisted-installer-in-cluster.md

+     * isApiVipConnected: [success, failure, pending, error]
+     * belongsToMajorityGroup: [success, failure, pending, error]
+     * isNtpSynced: [success, failure, pending, error]
+  * progress:


Another good fit for a condition.

dgoodwin · 2021-01-04T18:29:06Z

enhancements/installer/assisted-installer-in-cluster.md

+agent-based install tooling needs to know about the cluster that it will be
+creating. Some or all of that new resource may be folded into ClusterDeployment
+in the future, but in the immediate term, keeping it separate will enable much
+quicker development.


In general once an API goes out, it's near impossible to change it. So before this becomes a real thing, we'll want to transition to the permanent style. During initial devel it makes sense to keep it separate.

enhancements/installer/assisted-installer-in-cluster.md

dgoodwin · 2021-01-04T18:38:35Z

enhancements/installer/assisted-installer-in-cluster.md

+1. User creates AssistedInstallCluster resource describing their desired cluster.
+1. User creates ClusterDeployment describing a new cluster. The Platform section indicates that metal3 should be used to boot hosts with a live ISO. A new field references the AssistedInstallCluster resource.
+1. User associates some BareMetalAssets with the ClusterDeployment.
+1. Hive creates BareMetalHost resources corresponding to each BareMetalAsset, and uses them to boot the live ISO.


Not sure why this would be Hive, we'll need to discuss.

BMA is an ACM API. Hive is not presently aware of it. A dependency from ACM to Hive and from Hive back to ACM feels like trouble, so I'm having a hard time visualizing who or what should be responsible for this automated booting of BMA inventory.

Should BMA CRD move to Hive?

Would the Assisted Install operator be a good place to scan BMA inventory to auto-boot?

Should BMA CRD move to Hive?

That is a key question. I agree with what I think you're implying, which is that this move would make sense if we expect hive to act on BareMetalAssets.

Would the Assisted Install operator be a good place to scan BMA inventory to auto-boot?

One of the advantages of the design in this proposal is that it keeps the assisted installer components out of the business of booting hosts or needing any platform-specific capabilities. It's reasonable to think that other platforms besides bare metal may benefit from this pattern, particularly on-prem virtualization platforms for example. I don't think we would want to oblige assisted installer itself to know how to boot hosts for each platform.

abhinavdahiya · 2021-01-06T00:21:14Z

enhancements/installer/assisted-installer-in-cluster.md

+bare metal hardware. It is necessary to bring those capabilities on-premise in
+users' clusters for two purposes:
+
+1. A cluster created by the Assisted Installer needs the ability to add workers


how is adding workers related to running on premise?

I think running on premise is more useful when users want private setup compared to public endpoint as currently.

Currently we do not have a solution for adding workers to an on-premise cluster without depending on cloud.redhat.com. This proposal adds that capability.

abhinavdahiya · 2021-01-06T00:28:14Z

enhancements/installer/assisted-installer-in-cluster.md

+
+### Goals
+
+* Expose the Assisted Installer's capabilities as a kubernetes-native API.


There was no context provided above as to why this is one of the goals. What's wrong with keeping the current API but just running a local instance of the entire stack. Components like hive can communicate with rest APIs directly, noneed for k85 API i think?

OpenShift's infrastructure management is k8s-native, so agent-based installation through assisted installer needs to be exposed as k8s-native APIs. That's a design principle of OpenShift 4 that is outside the scope of this enhancement, though it wouldn't hurt to re-state it here.

As for hive, hive is not necessarily the only user of these APIs, so it would not be a suitable point at which to bridge the APIs from non-native to native.

Related to comment I just added below, but I think it's important to remember what Hive is. Hive is not a big bloated management application, it's an API for installing OpenShift at scale. I am presently thinking we should work together to get this into Hive and eliminate this middle API + operator.

enhancements/installer/assisted-installer-in-cluster.md

abhinavdahiya · 2021-01-06T01:22:24Z

enhancements/installer/assisted-installer-in-cluster.md

+ * approved: boolean (used for starting installation)
+ * openshiftVersion: string
+ * baseDnsDomain: string
+ * clusterNetworkCidr: string


There are multiple fields that describe networking for the cluster I so maybe we combine them into one one struct? If this is just an indication of all the fields that would be required & not structure of API, we can skip this comment.

abhinavdahiya · 2021-01-06T01:27:13Z

enhancements/installer/assisted-installer-in-cluster.md

+ * noProxy: string
+ * userManagedNetworking: boolean
+ * additionalNtpSource: string
+ * installConfigOverrides: string


What might be more ergonomic is allowing uses to provide an install config yaml object, with the expectations that fields defined in this api will replace/override the user provided

I'm not sure what you mean - can you please elaborate?

The purpose of this API is to allow advanced users in specific cases to override few values in the install config.

abhinavdahiya · 2021-01-06T01:29:11Z

enhancements/installer/assisted-installer-in-cluster.md

+ * additionalNtpSource: string
+ * installConfigOverrides: string
+
+Status:


+1, we should use conditions with specific type to provide the necessary status updates instead of these specialized fields in the status...

abhinavdahiya · 2021-01-06T01:42:08Z

enhancements/installer/assisted-installer-in-cluster.md

+
+1. Get an ISO URL from assisted operator.
+1. Use metal3's BareMetalHost to boot the live ISO on some hardware.
+


From the perspective of hive, I still feel some missing info for the workflow here... could we provide one example workflow here to understand the interactions here?

a user creates the cluster deployment and agent driven cluster installation object, there’s a bunch of duplicate things in these objects like clustername etc..

hive then uses these to figure out things like what version of open shift is required and then requests for an iso from the service

Hive then uses the iso and some bare metal api to boot these hosts

Hive then performs the matching of the hosts that report back to fulfill the cluster deployment requirements and then triggers the install??

hive then keeps track of status of installation

An high level workflow will help understand the next sections a little bit

We reworked the document (and the design) a bit, hopefully it answers your questions. Let us know if not.

Some high-level workflow descriptions like this would make excellent introductory text for the proposal section around line 66, and give the overall context for the rest of these details.

enhancements/installer/assisted-installer-in-cluster.md

dhellmann · 2021-01-06T13:42:44Z

enhancements/installer/assisted-installer-in-cluster.md

+For modest clusters that do not have persistent storage available but need
+day-2 ability to add nodes, the database will need to be reconstructable in
+case its current state is lost. Thus the CRs will be the source of truth for
+the specs, and the controller will upon statup first ensure that the DB


Suggested change

the specs, and the controller will upon statup first ensure that the DB

the specs, and the controller will upon startup first ensure that the DB

dhellmann · 2021-01-06T13:44:18Z

enhancements/installer/assisted-installer-in-cluster.md

+                      storage
+
+For modest clusters that do not have persistent storage available but need
+day-2 ability to add nodes, the database will need to be reconstructable in


In other places where we've done this sort of syncing we've found it to be a source of complexity and bugs. I strongly encourage you to require persistent storage for the service. If running without persistent storage is a requirement, then it needs to be listed as a goal above.

The idea was to require persistent storage for the multi-cluster case which is more complex, and there is already a precedent for (ACM). However, it would be a difficult to tell users that they need persistent storage to add workers.

dhellmann · 2021-01-06T13:46:25Z

enhancements/installer/assisted-installer-in-cluster.md

+Spec:
+ * clusterName: string
+ * sshPublicKey: string
+ * pullSecretName: string (reference to k8s secret)


I would expect to be able to reuse a secret like this across clusters.

dhellmann · 2021-01-06T13:57:21Z

enhancements/installer/assisted-installer-in-cluster.md

+resource.
+1. A host boots the live ISO, the assisted agent starts running, and the agent contacts the assisted service to register its existence. Communication utilizes the existing non-k8s REST API. The agent walks through the validation and inspection workflow as it exists today.
+1. The wrapping controller creates a new AssistedInstallAgent resource to be the k8s-native API for the agent.
+1. CAPBM creates a new BareMetalHost resource, setting the status annotation based on inspection information from the prior step.


I would expect the wrapping controller to do this step. What causes CAPBM to know that it needs to happen? Why does it need to happen here, instead of later when power management is added?

dhellmann · 2021-01-06T14:23:42Z

enhancements/installer/assisted-installer-in-cluster.md

+
+1. User creates BareMetalAssets containing BMC credentials.
+1. User creates AssistedInstallCluster resource describing their desired cluster.
+1. User creates ClusterDeployment describing a new cluster. The Platform section indicates that metal3 should be used to boot hosts with a live ISO. A new field references the AssistedInstallCluster resource.


@dgoodwin points it out earlier, but this linkage makes me think the new AssistedInstallCluster should just be a new platform within ClusterDeployment. Users shouldn't have to know about metal3.

dhellmann · 2021-01-06T14:26:03Z

enhancements/installer/assisted-installer-in-cluster.md

+1. User creates ClusterDeployment describing a new cluster. The Platform section indicates that metal3 should be used to boot hosts with a live ISO. A new field references the AssistedInstallCluster resource.
+1. User associates some BareMetalAssets with the ClusterDeployment.
+1. Hive creates BareMetalHost resources corresponding to each BareMetalAsset, and uses them to boot the live ISO.
+1. baremetal-operator uses redfish virtualmedia to boot the live ISO.


What validation are we going to do, and where, to ensure that uses only give us BMC settings that support virtualmedia?

I don't see an answer to this question in the latest draft.

The current draft has departed significantly from this. For example, hive no longer is making BareMetalHosts, and there are no BareMetalAssets involved. To the question though, the latest draft does not require virtualmedia. It now says "baremetal-operator boots the live ISO on the BareMetalHost."

enhancements/installer/assisted-installer-in-cluster.md

dhellmann · 2021-01-06T14:31:26Z

enhancements/installer/assisted-installer-in-cluster.md

+Scenario B: user maintains an inventory of hosts as BareMetalHosts. They are
+directly used to boot live ISOs as part of agent-based provisioning. When a
+user selects BMHs for cluster creation, they may be moved into a new
+cluter-specific namespace by deleting and re-creating them.


Deleting and recreating host resources is relatively expensive, because it requires updating ironic, which include ensuring the host is deprovisioned. It seems more flexible to not care which namespace a host is in, so that users can configure their inventory how they prefer.

Moves CRD detail discussions to dedicated PRs where they are being implemented.

enhancements/installer/assisted-installer-in-cluster.md

dhellmann · 2021-01-11T14:27:51Z

enhancements/installer/assisted-installer-in-cluster.md

-* install_completed_at: string
-* controller_logs_collected_at: string
+A first draft of this CRD has been
+[implemented](https://github.com/openshift/assisted-service/blob/master/internal/controller/api/v1alpha1/image_types.go)


It's a bit odd to have code approved before the design.

As a matter of practicality, reviewing the field-level details of the API seems to be most productive when looking at the go structs. And since the approach is currently being proved out through code, it doesn't seem productive to try duplicating those details here and keeping them in sync. Open to suggestions though.

I have noticed a tendency of the edge teams to rush to implementation before the broader organization is aligned with the direction, resulting in re-implementing some changes as many as 3 times (see #555 for example). We may all work more efficiently if we focus on agreement here first.

dhellmann · 2021-01-11T14:28:21Z

enhancements/installer/assisted-installer-in-cluster.md

-   * sourceState: [synced, combined, not_combined, error, variable, unreachable]
+The details of this API are being discussed in a [pull
+request](https://github.com/openshift/assisted-service/pull/861) that
+implements the CRD.


Again, the process here is backwards.

dhellmann · 2021-01-11T14:30:47Z

enhancements/installer/assisted-installer-in-cluster.md

+Other platforms may benefit from similar capability, such as on-premise virtualization
+platforms. Ideally the ability to match a Machine with an Agent will be delivered
+so that it can be integrated into multiple machine-api providers. Each platform would
+perform this workflow:


So (potentially) every cluster-api-provider implementation would have 2 workflows? One for the IPI process it uses today, and another for this new agent-based installation? Or we would eventually replace the IPI process with the new one?

Each machine-api provider could either A) add this workflow, or B) replace its existing workflow with this workflow. I think it'll be a per-provider decision on whether either option makes sense.

dhellmann · 2021-01-11T14:31:55Z

enhancements/installer/assisted-installer-in-cluster.md

+The host CRD has a field called "Role" that defaults to unset, and it must be
+set prior to provisioning to one of "master", "worker", or "auto-assign". The
+act of assigning a role and a cluster to a host will implicitly approve it to
+join that cluster with the specified role.


This still allows any arbitrary host to contact the service in the cluster that manages the agents, right? Are there any security implications from that? Can a user of that API learn anything about the cluster that we wouldn't want an unknown actor to know?

The API the agent reports to will require authentication, and that authentication will be baked into the live ISO. I don't think the agent can learn any information about anything else through that API today: @avishayt is that right?

Did we find the answer to the security question here?

@avishayt @filanov WDYT?

enhancements/installer/assisted-installer-in-cluster.md

dhellmann · 2021-01-11T15:55:30Z

enhancements/installer/assisted-installer-in-cluster.md

-* install_completed_at: string
-* controller_logs_collected_at: string
+A first draft of this CRD has been
+[implemented](https://github.com/openshift/assisted-service/blob/master/internal/controller/api/v1alpha1/image_types.go)


I have noticed a tendency of the edge teams to rush to implementation before the broader organization is aligned with the direction, resulting in re-implementing some changes as many as 3 times (see #555 for example). We may all work more efficiently if we focus on agreement here first.

enhancements/installer/assisted-installer-in-cluster.md

dgoodwin · 2021-02-04T12:52:16Z

enhancements/installer/assisted-installer-in-cluster.md

+implementation details that can be removed in the future. While continuing to
+run a separate database and continuing to utilize parts of the existing REST
+API are not ideal for an in-cluster service, this approach is the only
+practical way to deliver a working solution in a short time frame.


enhancements/installer/assisted-installer-in-cluster.md

enhancements/installer/agent-based-installation-in-hive.md

mhrivnak · 2021-03-05T03:16:37Z

I have modified the "API Versions and Potential Change" section to express our intent to use "beta" and "v1" APIs. @avishayt @dgoodwin @markmc @romfreiman

enhancements/installer/agent-based-installation-in-hive.md

yrobla

Added some feedback regarding our use case

enhancements/installer/agent-based-installation-in-hive.md

dhellmann · 2021-03-08T20:03:50Z

enhancements/installer/agent-based-installation-in-hive.md

+means.
+* a list of SSH public keys that will be added to all agents for use in debugging.
+* pull secret reference
+* a label selector for Agents. This is how Agent resources are identified as


Label selectors can allow a resource to match multiple queries. Do we want to allow an agent to match multiple environments? Would we be better of making the link between the agent and the environment explicit by adding a reference to the agent API pointing to an InstallEnv?

An agent should match a single InstallEnv. @dgoodwin @filanov I think this is a good suggestion, what do you think?

Yes I don't think there's any late binding or choose from pool functionality here, makes sense that the Agent would reference it's InstallEnv explicitly.

An explicit reference on the Agent could be useful, but I don't know of a direct use case for it yet. We could add it later potentially, or now if there is a use case.

While the label selector can potentially behave as you describe, the same would apply to a Deployment, MachineSet, or other similar resources.

Typically when you create a Deployment, you make a simple label selector and put the corresponding label in the Pod template a few lines below. The same approach would apply here and should make it simple to avoid conflicts.

ITSM that a controller replicating a resource using a template has very different needs from this case. The InstallEnv controller isn't doing anything with agents, and we just want to establish the relationship? And the agent labels you set on InstallEnv are really for the ClusterDeployment controller to identify a set of agents?

The InstallEnv controller creates the ISO that is basically a template for Agents. I see it as very similar in that way. The one and only place to define an Agent is the InstallEnv spec. Based on that, utilizing an ISO and some boot process, n number of Agents get stamped out.

Generally if a user or client wants to do a list query with some selector, a label is the way to do that. InstallEnv is a well-defined grouping for Agents and part of their identity, so a label referencing the InstallEnv seems to be the right way to reflect that relationship. kubectl get Agents -l 'installenv=foo' or similar.

We haven't dug into reporting details in this proposal, but there is value in reporting on an InstallEnv's aggregate Agent status, including how many Agents were booted from an out-of-date ISO for example. We could imagine future actions that may be taken on an Agent based on changes to its InstallEnv, though that's not in-scope for this enhancement.

Another related detail not mentioned but that's probably implied is that an Agent should likely have an owner reference to its InstallEnv. Agent deletion in general may still have some open questions to work out, but I think it will make sense to garbage-collect them if their InstallEnv gets deleted. In any case, the owner reference may serve the role of the explicit reference you're looking for.

The InstallEnv controller creates the ISO that is basically a template for Agents. I see it as very similar in that way. The one and only place to define an Agent is the InstallEnv spec. Based on that, utilizing an ISO and some boot process, n number of Agents get stamped out.

I think it's unlikely users will make that mental connection with the current design.

How would you feel about taking the next logical step with that analogy - putting the labels under a AgentTemplateSpec field, like installEnv.spec.template.metadata.labels ? Could we imagine wanting to extend AgentTemplateSpec as we want to allow the user to specify additional stuff that would show up in the created Agent resources?

Generally if a user or client wants to do a list query with some selector, a label is the way to do that. InstallEnv is a well-defined grouping for Agents and part of their identity, so a label referencing the InstallEnv seems to be the right way to reflect that relationship. kubectl get Agents -l 'installenv=foo' or similar.

We haven't dug into reporting details in this proposal, but there is value in reporting on an InstallEnv's aggregate Agent status, including how many Agents were booted from an out-of-date ISO for example. We could imagine future actions that may be taken on an Agent based on changes to its InstallEnv, though that's not in-scope for this enhancement.

Both of those seem compelling to me, thanks

I like that thinking, but I'm not able to come up with anything other than labels that we would want to templatize for the Agent. (Also note that labels are not part of the Agent's Spec, but the Metadata.) Most of what the InstallEnv is providing as input to Agent creation is data that's required to run the live ISO, which of course precedes the Agent resource. Maybe there's a use case for adding custom annotations that are meaningful to the user, in which case we could templatize the Agent's metadata. Imagine the ReplicaSet template without the spec part. But the annotation use case may or may not be terribly useful.

Right now, this is what the labels field looks like on the InstallEnv Spec:

// AgentLabels lists labels to apply to Agents that are associated with this InstallEnv upon // the creation of the Agents. AgentLabels map[string]string `json:"agentLabels,omitempty"`

Another option is we could try to name this field more clearly. WDYT?

I like the metadata template approach, if you think that can work well

enhancements/installer/agent-based-installation-in-hive.md

dhellmann · 2021-03-08T20:34:30Z

enhancements/installer/agent-based-installation-in-hive.md

+1. The Agent's Role field in its spec is assigned a value if a corresponding label and value were present on its BareMetalHost. (only "worker" is supported for now on day 2)
+1. The Agent is marked as Approved via a field in its Spec based on being recognized as running on the known BareMetalHost.
+1. The Agent runs through the validation and inspection phases. The results are shown on the Agent's Status, and eventually a condition marks the Agent as "ready".
+1. The Baremetal Agent Controller adds inspection data found on the Agent's Status to the BareMetalHost.


I assume this is so we get the IP address, etc.? Do we need that in the multi-cluster case, though? In a regular cluster we need the host, machine, and node to have matching IP details so the CSR for the node will be approved. What approves the CSR for nodes in this case, where the deployed cluster platform is not "baremetal" and the host resources are outside of the cluster that was deployed?

This scenario assumes that the spoke cluster does have platform "baremetal". A BareMetalHost resource gets created on the spoke cluster in part to enable the same CSR approval workflow that is already working on that platform.

So the CSR for the host in the spoke cluster will be approved by something in the hub cluster?

The CSR for the host in the spoke cluster will be approved in the spoke cluster by the same component and mechanism that approves them today for the baremetal platform. Creating the BareMetalHost and a corresponding Machine with correct details in the spoke cluster should make that work. Unless there's a detail we're missing?

There isn't anything in this section talking explicitly about data ending up in the spoke cluster, so it wasn't clear that was the case. Maybe that's all part of "installation of that host begins" on line 420?

Yes, that's one of the things assisted does as part of its workflow. We could dig up the details if necessary, but I don't think those details are central to this proposal.

enhancements/installer/agent-based-installation-in-hive.md

markmc · 2021-03-15T10:44:52Z

Lots of good discussion around the comments from @yrobla and @dhellmann - let's get the comment threads resolved with whatever changes are needed to incorporate that feedback

enhancements/installer/agent-based-installation-in-hive.md

markmc · 2021-03-29T11:38:17Z

Thanks for you patience capturing everyone's feedback, Michael. I think we're in a good position to merge this now

/approve
/lgtm

openshift-ci-robot · 2021-03-29T11:38:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [markmc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dhellmann · 2021-04-09T14:20:34Z

/priority important-soon

First iteration of running the Assisted Installer in end-user clusters.

a2ab497

openshift-ci-robot requested review from celebdor and derekwaynecarr December 23, 2020 17:57

filanov reviewed Dec 27, 2020

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

filanov reviewed Dec 28, 2020

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

filanov reviewed Dec 29, 2020

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

filanov reviewed Dec 29, 2020

View reviewed changes

mgoldboi reviewed Dec 29, 2020

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

avishayt reviewed Dec 30, 2020

View reviewed changes

nmagnezi reviewed Dec 30, 2020

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

nmagnezi reviewed Dec 30, 2020

View reviewed changes

nmagnezi reviewed Jan 3, 2021

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

nmagnezi reviewed Jan 4, 2021

View reviewed changes

markmc reviewed Jan 4, 2021

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

dgoodwin reviewed Jan 4, 2021

View reviewed changes

abhinavdahiya reviewed Jan 6, 2021

View reviewed changes

dhellmann reviewed Jan 6, 2021

View reviewed changes

updates based on review comments

648bdb3

Moves CRD detail discussions to dedicated PRs where they are being implemented.

dhellmann reviewed Jan 11, 2021

View reviewed changes

clarifies section on differences between SaaS and in-cluster

0cd5dcb

mhrivnak commented Jan 11, 2021

View reviewed changes

enhancements/installer/assisted-installer-in-cluster.md Outdated Show resolved Hide resolved

markmc mentioned this pull request Jan 18, 2021

Add agent based install platform openshift/hive#1247

Merged

dgoodwin reviewed Feb 4, 2021

View reviewed changes

mhrivnak force-pushed the assisted-installer-in-cluster branch from 760cd25 to 37a5777 Compare February 4, 2021 21:07

openshift-ci-robot assigned markmc Mar 2, 2021

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 2, 2021

markmc reviewed Mar 4, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Show resolved Hide resolved

Changes API version plan to "beta" and "v1" instead of "alpha".

e5d632f

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2021

yrobla reviewed Mar 5, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Outdated Show resolved Hide resolved

yrobla reviewed Mar 5, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Show resolved Hide resolved

yrobla reviewed Mar 5, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Outdated Show resolved Hide resolved

yrobla reviewed Mar 5, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Show resolved Hide resolved

yrobla reviewed Mar 5, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Outdated Show resolved Hide resolved

yrobla reviewed Mar 5, 2021

View reviewed changes

dhellmann reviewed Mar 8, 2021

View reviewed changes

markmc removed approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 15, 2021

carbonin reviewed Mar 15, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Outdated Show resolved Hide resolved

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2021

several minor improvements based on review

0020f5a

mhrivnak force-pushed the assisted-installer-in-cluster branch from 1c2c4b1 to 0020f5a Compare March 16, 2021 02:38

adds detail about the API and label/annotation group name

4f7fba5

osherdp reviewed Mar 25, 2021

View reviewed changes

enhancements/installer/agent-based-installation-in-hive.md Outdated Show resolved Hide resolved

typo fix

d4bad91

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2021

openshift-merge-robot merged commit ed9e4f4 into openshift:master Mar 29, 2021

openshift-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 9, 2021


		#### CAPBM

		[Cluster-API Provider Bare Metal](https://github.com/openshift/cluster-api-provider-baremetal/) will need to gain the ability to interact with the new assisted installer APIs. Specifically it will need to match a Machine with an available assisted Agent that is ready to be provisioned. This will be in addition to matching a Machine with a BareMetalHost, which it already does today.


		### Goals

		* Expose the Assisted Installer's capabilities as a kubernetes-native API.


		1. Get an ISO URL from assisted operator.
		1. Use metal3's BareMetalHost to boot the live ISO on some hardware.

	the specs, and the controller will upon statup first ensure that the DB
	the specs, and the controller will upon startup first ensure that the DB

First iteration of running the Assisted Installer in end-user clusters. #574

First iteration of running the Assisted Installer in end-user clusters. #574

Conversation

mhrivnak commented Dec 23, 2020

romfreiman commented Dec 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nmagnezi Dec 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgoodwin Jan 5, 2021 • edited

Choose a reason for hiding this comment

mhrivnak Jan 6, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardys Jan 6, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardys Jan 7, 2021 • edited

Choose a reason for hiding this comment

avishayt Jan 18, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nmagnezi Dec 30, 2020 •

edited

dgoodwin Jan 5, 2021 •

edited

mhrivnak Jan 6, 2021 •

edited

hardys Jan 6, 2021 •

edited

hardys Jan 7, 2021 •

edited

avishayt Jan 18, 2021 •

edited