MCO-1010: Add node disruption policies to MachineConfiguration CRD #1764

yuqi-zhang · 2024-02-09T19:25:37Z

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

re lists, we currently want users to specify policies such as follows:

userPolicies:
  - type: file
    value: "/etc/my-file"
    actions:
      - type: reload
        reload: my-service
      - type: daemon-reload
  - type: file
    value: "/etc/my-other-file"
    actions:
      - type: none
  - type: sshkey
    actions:
      - type: none

such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey). Maybe it would be better to make it more MachineConfig like and have it instead be:

userPolicies:
  files:
    - path: "/etc/my-file"
      actions:
        - type: reload
          reload: my-service
        - type: daemon-reload
    - path: "/etc/my-other-file"
      actions:
        - type: none
  sshkey:
    actions:
      - type: none

for the actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key

  - type: file
    value: "/etc/my-other-file"
    actions:
      - type: reload
        reload: service-1
      - type: reload
        reload: service-2
      - type: daemon-reload

So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

have no validation in the API, and have the MCC do all the validation
have no validation in the MCC and have the API logic do all the heavy lifting

controller validation and status:

Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

openshift-ci-robot · 2024-02-09T19:25:41Z

@yuqi-zhang: This pull request references MCO-1010 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

re lists, we currently want users to specify policies such as follows:
userPolicies:
 - type: file
   value: "/etc/my-file"
   actions:
     - type: reload
       reload: my-service
     - type: daemon-reload
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: none
 - type: sshkey
   actions:
     - type: none
such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey).

Re: actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: reload
       reload: service-1
     - type: reload
       reload: service-2
     - type: daemon-reload
So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

have no validation in the API, and have the MCC do all the validation

have no validation in the MCC and have the API logic do all the heavy lifting

Which will likely require us to modify the above approach, maybe to something like:
userPolicies:
 - action: reload
   reload: service-1
   file:
    - /etc/my-file
    - /etc/my-file-2
 - action: daemon-reload
   file:
    - /etc/my-file
- action: none
  sshkey: ""
a reverse mapping (although some of the other issues might still apply)

controller validation and status
Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-02-09T19:25:49Z

Hello @yuqi-zhang! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci · 2024-02-09T19:25:50Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2024-02-13T13:55:06Z

@yuqi-zhang: This pull request references MCO-1010 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

re lists, we currently want users to specify policies such as follows:
userPolicies:
 - type: file
   value: "/etc/my-file"
   actions:
     - type: reload
       reload: my-service
     - type: daemon-reload
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: none
 - type: sshkey
   actions:
     - type: none
such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey). Maybe it would be better to make it more MachineConfig like and have it instead be:
userPolicies:
 files:
   - path: "/etc/my-file"
     actions:
       - type: reload
         reload: my-service
       - type: daemon-reload
   - path: "/etc/my-other-file"
     actions:
       - type: none
 sshkey:
   actions:
     - type: none
for the actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: reload
       reload: service-1
     - type: reload
       reload: service-2
     - type: daemon-reload
So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

have no validation in the API, and have the MCC do all the validation

have no validation in the MCC and have the API logic do all the heavy lifting

controller validation and status
Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-02-13T14:23:21Z

@yuqi-zhang: This pull request references MCO-1010 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

re lists, we currently want users to specify policies such as follows:
userPolicies:
 - type: file
   value: "/etc/my-file"
   actions:
     - type: reload
       reload: my-service
     - type: daemon-reload
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: none
 - type: sshkey
   actions:
     - type: none
such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey). Maybe it would be better to make it more MachineConfig like and have it instead be:
userPolicies:
 files:
   - path: "/etc/my-file"
     actions:
       - type: reload
         reload: my-service
       - type: daemon-reload
   - path: "/etc/my-other-file"
     actions:
       - type: none
 sshkey:
   actions:
     - type: none
for the actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: reload
       reload: service-1
     - type: reload
       reload: service-2
     - type: daemon-reload
So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

have no validation in the API, and have the MCC do all the validation

have no validation in the MCC and have the API logic do all the heavy lifting

controller validation and status:

Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

deads2k · 2024-02-13T14:53:41Z

operator/v1/types_machineconfiguration.go

+type NodeDisruptionPolicyStatus struct {
+	// clusterPolicies is a merge of cluster default and user provided node disruption policies.
+	// +optional
+	ClusterPolicies []NodeDisruptionPolicyConfig `json:"clusterPolicies"`


status should use a different type because you're likely to grow different fields

deads2k · 2024-02-13T14:54:53Z

operator/v1/types_machineconfiguration.go

+	Type NodeDisruptionPolicyActionType `json:"type"`
+	// reload specifies the service to reload, only valid if type is reload
+	// +optional
+	Reload *string `json:"reload,omitempty"`


probably want a struct

should this be a list so that user can specify multiple servicename together that needs to be reloaded for a NodeDisruptionPolicyType ?

done. I opted them to be separate so a user can theoretically order them as they wish to apply in sequence with other actions

deads2k · 2024-02-13T14:55:22Z

operator/v1/types_machineconfiguration.go

+	Value string `json:"value"`
+	// actions represents the series of commands to be executed on changes to the corresponding type and value
+	// +kubebuilder:validation:Required
+	Actions []NodeDisruptionPolicyAction `json:"actions"`


this should probably be atomic

JoelSpeed · 2024-02-13T14:44:46Z

operator/v1/types_machineconfiguration.go

+	// nodeDisruptionPolicySpec allows an admin to set granular node disruption actions for
+	// MachineConfig-based updates, such as drains, service reloads, etc. Specifying this will allow
+	// for less downtime when doing small configuration updates to the cluster.


We need to make this very very clear that this doesn't apply to cluster version upgrades

added a line

JoelSpeed · 2024-02-13T14:45:01Z

operator/v1/types_machineconfiguration.go

+	// nodeDisruptionPolicySpec allows an admin to set granular node disruption actions for
+	// MachineConfig-based updates, such as drains, service reloads, etc. Specifying this will allow
+	// for less downtime when doing small configuration updates to the cluster.
+	// +openshift:enable:FeatureSets=TechPreviewNoUpgrade


You'll need CustomNoUpgrade in here too

JoelSpeed · 2024-02-13T14:46:46Z

operator/v1/types_machineconfiguration.go

+
+	// nodeDisruptionPolicyStatus status reflects what the latest cluster-validated policies are,
+	// and will be used by the Machine Config Daemon during future node updates.
+	// +openshift:enable:FeatureSets=TechPreviewNoUpgrade


CustomNoUpgrade in here too please

sinnykumari · 2024-02-13T21:40:05Z

operator/v1/types_machineconfiguration.go

+	// Service represents a NodeDisruption policy that is in effect for changes to a service.
+	Service NodeDisruptionPolicyType = "service"
+
+	// File represents a NodeDisruption policy that is in effect for changes to a kernel argument.


Suggested change

// File represents a NodeDisruption policy that is in effect for changes to a kernel argument.

// kernelArgument represents a NodeDisruption policy that is in effect for changes to a kernel argument.

removed kargs as per review on the enhacement

sinnykumari · 2024-02-13T21:43:17Z

operator/v1/types_machineconfiguration.go

+	Service NodeDisruptionPolicyType = "service"
+
+	// File represents a NodeDisruption policy that is in effect for changes to a kernel argument.
+	KernelArgument NodeDisruptionPolicyType = "kernelArgument"


MCO doesn't have a way today to apply kargs without reboot. Are we thinking of adding a way to apply these kargs live? If not, perhaps we may want to skip letting user skip drain/reboot in the initial implementation.

sinnykumari · 2024-02-13T21:51:38Z

operator/v1/types_machineconfiguration.go

+	None NodeDisruptionPolicyActionType = "none"
+
+	// Special represents an action that is internal to the MCO, and is not allowed in user defined NodeDisruption policies.
+	Special NodeDisruptionPolicyActionType = "special"


Wondering if we have something in mind that MCO will utilize in the beginning?

I'm open to discussion on this, I don't intend this to be user specify-able

Since removing API later is hard, maybe we can add this or similar field later on depending upon usecase?

The use case right now is that we would like to display the current cluster defaults, so the user can see (and optionally override) the image registry logic. The initial set we decided on is to have this keyword and description (that you can see via oc describe, etc.) to explain what the default is.

Happy to change this if there's a better route though! Mostly just wanted to be able to show the current cluster settings.

This right now is also tech preview only so we should be able to change it before GA too if needed

Thanks for the context. Let's keep it then and we can update it if needed while API will be in TechPreview. Maybe MCOInternal, MCODefault etc would provide more clarity.

sinnykumari · 2024-02-13T21:55:54Z

operator/v1/types_machineconfiguration.go

+	// +unionDiscriminator
+	// +kubebuilder:validation:Required
+	Type NodeDisruptionPolicyActionType `json:"type"`
+	// reload specifies the service to reload, only valid if type is reload


Admin maybe benefited with service restart as well as I beleive sometime just reloading service is not enough to apply the config changes.

added, thanks!

JoelSpeed · 2024-02-28T12:44:27Z

operator/v1/types_machineconfiguration.go

+
+	// nodeDisruptionPolicySpec allows an admin to set granular node disruption actions for
+	// MachineConfig-based updates, such as drains, service reloads, etc. Specifying this will allow
+	// for less downtime when doing small configuration updates to the cluster. This is NOT intended


I might just change the way this is worded, "intended" to me sounds like we don't want you to do that, but it might work, perhaps

// This configuration has no effect on cluster upgrades which will still incur node disruption where required.

A node upgrade always reboots right? So maybe we don't even need the where required on that

Will change, a cluster upgrade in practice almost always has an associated OS update, even if it's just a minor package bump. In all of OCP 4 I think I've seen one z-stream bump not come with an associated update, so the overlap is 0

JoelSpeed · 2024-02-28T12:45:16Z

operator/v1/types_machineconfiguration.go

+	// clusterDefaultPolicies is managed by the Machine Config Operator, and reflects the latest cluster defaults
+	// +optional
+	ClusterDefaultPolicies NodeDisruptionPolicyConfig `json:"clusterDefaultPolicies"`


You probably don't want this in spec if it's expected to be set by the operator rather than the user right?

The current design has it such that there is:

a user spec, for the user to set

a cluster spec, for the MCO to set, which may change depending on version

a status which is the validated merge of the two

The expectation being that it is easier for the user to see what the cluster defaults are.

If we remove this from the spec, then the status will hold the cluster defaults + anything overriden by the user, which... I guess still mostly achieves our goal? If that's more aligned with API conventions, I'm happy to do it that way instead.

JoelSpeed · 2024-02-28T12:46:16Z

operator/v1/types_machineconfiguration.go

+	// +patchStrategy=merge
+	// +listType=map
+	// +listMapKey=type
+	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`


By convention, this should be the first field in the status struct

Related to the discussion we had in the call, I think this makes more sense living in the MachineConfigurationStatus object directly, which currently has StaticPodOperatorStatus that we were thinking of deprecating. Could you help me point to a resource on how that should be done?

JoelSpeed · 2024-02-28T12:48:24Z

operator/v1/types_machineconfiguration.go

+}
+
+// NodeDisruptionPolicyConfig is the overall spec definition for files/units/sshkeys
+type NodeDisruptionPolicyConfig struct {


For discoverability on this API, we should drop omitempty, that way the API will be written out by the installer

nodeDisruptionPolicy: userPolicy: files: [] units: [] sshKey: ...

JoelSpeed · 2024-02-28T12:49:04Z

operator/v1/types_machineconfiguration.go

+	// userPolicies define user-provided node disruption policies
+	// +optional
+	UserPolicies NodeDisruptionPolicyConfig `json:"userPolicies"`


If we do not have cluster scoped default policies in here, wondering if we need this extra indentation. What possible future options might live alongside these?

Off the top of my head any potential future extensions would be with the new image based workflow. I think an example could be I could specify some additional non-machineconfig content (image based) and specify something here to indicate how it can be applied.

But maybe for now we can just collapse this to the higher level policy and add a new field when that comes into play?

JoelSpeed · 2024-02-28T13:07:43Z

operator/v1/types_machineconfiguration.go

+
+// ReloadService allows the user to specify the services to be reloaded
+type ReloadService struct {
+	// ServiceName is the full name (e.g. crio.service) of the service to be reloaded


Should be lower case at the start of the godoc. Similar questions above about service and the godoc here

JoelSpeed · 2024-02-28T13:08:06Z

operator/v1/types_machineconfiguration.go

+
+// RestartService allows the user to specify the services to be restarted
+type RestartService struct {
+	// ServiceName is the full name (e.g. crio.service) of the service to be restarted


Ditto for reload

JoelSpeed · 2024-02-28T13:08:16Z

operator/v1/types_machineconfiguration.go

+}
+
+// NodeDisruptionPolicyActionType is a string enum used in a NodeDisruptionPolicyAction object. They describe an action to be performed.
+// +kubebuilder:validation:Enum:="reboot";"drain";"reload";"restart";"daemon-reload";"none";"special"


PascalCase these values please

JoelSpeed · 2024-02-28T13:08:48Z

operator/v1/types_machineconfiguration.go

+	Restart NodeDisruptionPolicyActionType = "restart"
+
+	// DaemonReload represents an action that TBD
+	DaemonReload NodeDisruptionPolicyActionType = "daemon-reload"


Should be DaemonReload

JoelSpeed · 2024-02-28T13:09:04Z

operator/v1/types_machineconfiguration.go

+	// None represents an action that no handling is required by the MCO.
+	None NodeDisruptionPolicyActionType = "none"
+
+	// Special represents an action that is internal to the MCO, and is not allowed in user defined NodeDisruption policies.


Need to make sure the spec for user defined does not allow this

yuqi-zhang · 2024-02-28T23:23:46Z

Fixed based on comments and rebased on master

JoelSpeed · 2024-03-04T11:35:24Z

operator/v1/types_machineconfiguration.go

+	Path string `json:"path"`
+	// actions represents the series of commands to be executed on changes to the file at
+	// corresponding file path. This is an atomic list, which will be validated by
+	// the MachineConfigOperator, with any conflicts reflecting as an error in the


I'd be interested to see what a conflict would actually look like in this list, could you provide and example?

Sure, this would be more around if the user sets reboot and none, they shouldn't be able to set other options. And that they shouldn't be able to set special (but this point isn't really relevant since we won't expose it in spec anymore after removing cluster defaults)

So I guess we could instead have a validation for list items, and if reboot or none exists they need to be the only entry. The other consideration is that if we want to modify rules in the future and expand on the actions, we'll probably need to have some validation in the MCO? Although that's a bit vague.

Discussed in slack, we can use CEL validation to enforce that reboot or none are singletons, something along the lines of the ternary below

+kubebuilder:validation:XValidation:rule="self.exists(x, x == 'Reboot') ? size(self) == 1 : true"

yuqi-zhang · 2024-03-08T02:50:18Z

Pushed some changes based on the discussion above, now only the user spec should exist in spec, and status will reflect cluster defaults.

Added some more validation, and also removed the original StaticPodOperatorStatus. Will open a separate PR to tombstone those fields, so this might not pass CI atm.

yuqi-zhang · 2024-03-20T00:44:07Z

@JoelSpeed I believe @djoshy has covered the latest round of reviews (thanks!). There's just the open question about tombstone-ing the static operator status. Could this PR merge without that or would that have to go in first?

FWIW I did another brief check and nothing is currently depending on this (the CRD itself was added to the MCO in 4.15 but no objects exist and no other repo is referring to it outside of the API and MCO)

(latest update was to remove a testfile which was accidentally added)

hexfusion · 2024-03-21T16:41:54Z

/cc @hexfusion

JoelSpeed · 2024-03-21T17:04:38Z

.../zz_generated.crd-manifests/0000_80_machine-config_01_machineconfigurations-Default.crd.yaml

+                  - lastTransitionTime
+                  - message
+                  - reason
+                  - status


Change to required fields, has this CRD been included in payload yet?

The only place this is currently referenced is by the MCO code, where I think we technically bring in the CRD. (So I think it does exist in clusters 4.15? and up, but no objects)

What is making this generated field required now and is that an issue?

JoelSpeed · 2024-03-21T17:05:24Z

operator/v1/types_machineconfiguration.go

 }

 type MachineConfigurationStatus struct {
-	StaticPodOperatorStatus `json:",inline"`


Should we revert this for now and handle separately? We can't merge this before the handling of removing the operator status in this hybrid state

JoelSpeed · 2024-03-21T17:07:06Z

operator/v1/types_machineconfiguration.go

+// NodeDisruptionPolicyFile is a file entry and corresponding actions to take
+type NodeDisruptionPolicyFile struct {
+	// path is the file path to a file on disk managed through a MachineConfig.
+	// Actions specified will be applied when changes to the file at the path


Not resolved

JoelSpeed · 2024-03-21T17:12:06Z

operator/v1/types_machineconfiguration.go

+	// +kubebuilder:validation:XValidation:rule=`self.matches('\\.(service|socket|device|mount|automount|swap|target|path|timer|snapshot|slice|scope)$')`, message="Invalid ${SERVICETYPE} in service name. Expected format is ${NAME}${SERVICETYPE}, where ${SERVICETYPE} must be one of \".service\", \".socket\", \".device\", \".mount\", \".automount\", \".swap\", \".target\", \".path\", \".timer\",\".snapshot\", \".slice\" or \".scope\"."
+	// +kubebuilder:validation:XValidation:rule=`self.matches('^[a-zA-Z0-9:._\\\\-]+\\..')`, message="Invalid ${NAME} in service name. Expected format is ${NAME}${SERVICETYPE}, where {NAME} must be atleast 1 character long and can only consist of alphabets, digits, \":\", \"-\", \"_\", \".\", and \"\\\""
+	// +kubebuilder:validation:Required
+	// +kubebuilder:validation:MaxLength=255
+	ServiceName string `json:"serviceName"`


Since this is re-used several times, you could create a type alias for this and assign the validations to that instead, not blocking though. Similar to how you have done the action types

Ack, moved to a new type

JoelSpeed · 2024-03-21T17:13:59Z

operator/v1/techpreview.machineconfiguration.testsuite.yaml

+              actions:
+                - type: DaemonReload
+                - type: Reload
+    expectedError: "Reload is required when type is reload, and forbidden otherwise"


This is the wrong way around, think about it from an end user perspective, they see the field as lower r and the type as upper r

Suggested change

expectedError: "Reload is required when type is reload, and forbidden otherwise"

expectedError: "reload is required when type is Reload, and forbidden otherwise"

JoelSpeed · 2024-03-21T17:15:09Z

operator/v1/techpreview.machineconfiguration.testsuite.yaml

+                - type: Drain
+                - type: Restart
+                  restart:
+                    serviceName: a.b.c.d.e.snapshot


Nit, should include new line char as final character in every file

JoelSpeed · 2024-03-21T18:11:41Z

operator/v1/types_machineconfiguration.go

+	// +kubebuilder:validation:XValidation:rule=`self.matches('^[a-zA-Z0-9:._\\\\-]+\\..')`, message="Invalid ${NAME} in service name. Expected format is ${NAME}${SERVICETYPE}, where {NAME} must be atleast 1 character long and can only consist of alphabets, digits, \":\", \"-\", \"_\", \".\", and \"\\\""
+	// +kubebuilder:validation:Required
+	// +kubebuilder:validation:MaxLength=255
+	Name string `json:"name"`


NodeDisruptionPolicyServiceName?

JoelSpeed · 2024-03-21T18:11:53Z

operator/v1/types_machineconfiguration.go

+	// +kubebuilder:validation:XValidation:rule=`self.matches('^[a-zA-Z0-9:._\\\\-]+\\..')`, message="Invalid ${NAME} in service name. Expected format is ${NAME}${SERVICETYPE}, where {NAME} must be atleast 1 character long and can only consist of alphabets, digits, \":\", \"-\", \"_\", \".\", and \"\\\""
+	// +kubebuilder:validation:Required
+	// +kubebuilder:validation:MaxLength=255
+	Name string `json:"name"`


NodeDisruptionPolicyServiceName?

JoelSpeed · 2024-03-21T18:12:57Z

operator/v1/types_machineconfiguration.go

+}
+
+// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Reload' ? has(self.reload) : !has(self.reload)",message="reload is required when type is Reload, and forbidden otherwise"
+// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="Restart is required when type is restart, and forbidden otherwise"


Suggested change

// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="Restart is required when type is restart, and forbidden otherwise"

// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="restart is required when type is Restart, and forbidden otherwise"

JoelSpeed

/lgtm

Will follow up separately on the static pod status removal, build out on this first and transition to metav1.Condition before we ship.

yuqi-zhang · 2024-03-21T20:29:12Z

/test verify

Add a new sub-spec/status to the MachineConfiguration operator object, which will allow users to specify actions to take when small MachineConfig updates happen to the cluster. This will be behind a NodeDisruptionPolicy featuregate, and will be managed and consumed by the Machine Config Operator in-cluster.

The new NodeDisruption status objects contain a "special" action type that will only be used by the MCO's controller to indicate some internal actions. They are not part of the NodeDisruptionPolicyConfig object and cannot be set by the user.

- don't remove staticpodoperatorstatus for now - update godocs to be more clear - add a type alias for serviceName

yuqi-zhang · 2024-03-21T21:13:09Z

Whoops, was failing verify since it wasn't rebased on master

JoelSpeed

/lgtm

openshift-ci · 2024-03-22T12:12:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-03-23T00:34:48Z

@yuqi-zhang: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2024-03-23T04:44:56Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-config-api-container-v4.16.0-202403230015.p0.g2252c7a.assembly.stream.el9 for distgit ose-cluster-config-api.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 9, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2024

yuqi-zhang mentioned this pull request Feb 9, 2024

MCO-507: admin defined node disruption policy enhancement openshift/enhancements#1525

Merged

openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 9, 2024

deads2k reviewed Feb 13, 2024

View reviewed changes

JoelSpeed reviewed Feb 13, 2024

View reviewed changes

sinnykumari reviewed Feb 13, 2024

View reviewed changes

yuqi-zhang force-pushed the add-node-disruption-policies branch from df6de7b to a76a939 Compare February 16, 2024 00:46

yuqi-zhang force-pushed the add-node-disruption-policies branch from a76a939 to 9b02645 Compare February 26, 2024 23:14

yuqi-zhang marked this pull request as ready for review February 26, 2024 23:14

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2024

openshift-ci bot requested review from derekwaynecarr and knobunc February 26, 2024 23:15

JoelSpeed reviewed Feb 28, 2024

View reviewed changes

yuqi-zhang force-pushed the add-node-disruption-policies branch from 9b02645 to b59260a Compare February 28, 2024 23:23

JoelSpeed reviewed Mar 4, 2024

View reviewed changes

yuqi-zhang force-pushed the add-node-disruption-policies branch from d583e3b to e59c247 Compare March 8, 2024 02:47

yuqi-zhang force-pushed the add-node-disruption-policies branch from e59c247 to f455c3e Compare March 8, 2024 06:42

djoshy force-pushed the add-node-disruption-policies branch from e726779 to 4751043 Compare March 17, 2024 12:45

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2024

djoshy force-pushed the add-node-disruption-policies branch from 4751043 to 55178eb Compare March 17, 2024 12:53

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2024

yuqi-zhang force-pushed the add-node-disruption-policies branch from 55178eb to 5574447 Compare March 20, 2024 02:25

openshift-ci bot requested a review from hexfusion March 21, 2024 16:41

JoelSpeed reviewed Mar 21, 2024

View reviewed changes

yuqi-zhang force-pushed the add-node-disruption-policies branch from 6a403bb to 5609243 Compare March 21, 2024 18:21

JoelSpeed reviewed Mar 21, 2024

View reviewed changes

openshift-ci bot assigned JoelSpeed Mar 21, 2024

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 21, 2024

yuqi-zhang and others added 3 commits March 21, 2024 17:08

adds NodeDisruptionPolicy status objects

e959193

The new NodeDisruption status objects contain a "special" action type that will only be used by the MCO's controller to indicate some internal actions. They are not part of the NodeDisruptionPolicyConfig object and cannot be set by the user.

NodeDisruptionPolicy cleanups

4bdf2e3

- don't remove staticpodoperatorstatus for now - update godocs to be more clear - add a type alias for serviceName

yuqi-zhang force-pushed the add-node-disruption-policies branch from 5609243 to 4bdf2e3 Compare March 21, 2024 21:12

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2024

JoelSpeed reviewed Mar 22, 2024

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 22, 2024

openshift-merge-bot bot merged commit 2252c7a into openshift:master Mar 23, 2024
18 checks passed

This was referenced Apr 2, 2024

OCPNODE-1632: Support Cluster ImagePolicy CRD openshift/machine-config-operator#4160

Merged

OPNET-282: Configure-ovs alternative implementation openshift/machine-config-operator#4249

Merged

	// File represents a NodeDisruption policy that is in effect for changes to a kernel argument.
	// kernelArgument represents a NodeDisruption policy that is in effect for changes to a kernel argument.

	expectedError: "Reload is required when type is reload, and forbidden otherwise"
	expectedError: "reload is required when type is Reload, and forbidden otherwise"

	// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="Restart is required when type is restart, and forbidden otherwise"
	// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="restart is required when type is Restart, and forbidden otherwise"

MCO-1010: Add node disruption policies to MachineConfiguration CRD #1764

MCO-1010: Add node disruption policies to MachineConfiguration CRD #1764

Conversation

yuqi-zhang commented Feb 9, 2024 • edited

openshift-ci-robot commented Feb 9, 2024 • edited by openshift-ci bot

openshift-ci bot commented Feb 9, 2024

openshift-ci bot commented Feb 9, 2024

openshift-ci-robot commented Feb 13, 2024 • edited by openshift-ci bot

openshift-ci-robot commented Feb 13, 2024 • edited by openshift-ci bot

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi-zhang Feb 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi-zhang commented Feb 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi-zhang commented Mar 8, 2024 • edited

yuqi-zhang commented Mar 20, 2024 • edited

hexfusion commented Mar 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

yuqi-zhang commented Mar 21, 2024

yuqi-zhang commented Mar 21, 2024

JoelSpeed left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 22, 2024

openshift-ci bot commented Mar 23, 2024

openshift-bot commented Mar 23, 2024

yuqi-zhang commented Feb 9, 2024 •

edited

openshift-ci-robot commented Feb 9, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Feb 13, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Feb 13, 2024 •

edited by openshift-ci bot

yuqi-zhang Feb 28, 2024 •

edited

yuqi-zhang commented Mar 8, 2024 •

edited

yuqi-zhang commented Mar 20, 2024 •

edited