WINC-637: Set cluster-wide proxy environment variables on Windows instances #1536

saifshaikh48 · 2023-03-31T19:07:58Z

This PR expands the services ConfigMap data schema and the expected services ConfigMap to hold a new field, a list of name/value pairs representing environment variables to be set on Windows instances. If the cluster-wide proxy variables change at any point, the ConfigMap is re-created to hold the new values. It also introduces a functionality to the WICD controller to ensure all the environment variables defined in the services ConfigMap are set as expected on the Windows nodes. If there is a discrepancy, WICD updates the values and signals to WMCO to restarts the instance to allow all OpenShift managed Windows services to pick up changes.

summary of proxy vars flow:

WICD controller is running
WICD reconciles proxy vars. if it updates any of them, WICD applies a reboot-required annotation to the node
- WICD now waits for reboot to occur (polls Windows processes to see if they have inherited the correct env vars values)
WMCO's node controller reacts to setting of reboot annotation by restarting the instance
- node controller now waits for WICD to remove the reboot annotation
- after this, WICD is can see the env vars are set properly at the Windows service level
WICD continues as it currently does, reconciling Windows services and applying version annotation when node is ready

openshift-ci-robot · 2023-03-31T19:08:01Z

@saifshaikh48: This pull request references WINC-637 which is a valid jira issue.

In response to this:

TODO: testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-03-31T19:08:03Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

saifshaikh48 · 2023-04-05T17:58:42Z

/test vsphere-proxy-e2e-operator

saifshaikh48 · 2023-04-06T20:04:45Z

/test vsphere-proxy-e2e-operator

mansikulkarni96 · 2023-06-05T21:08:39Z

/test-aws-e2e-operator

mansikulkarni96 · 2023-06-05T21:31:30Z

/test vsphere-proxy-e2e-operator

mansikulkarni96 · 2023-06-05T21:31:51Z

/test aws-e2e-operator

mansikulkarni96 · 2023-06-08T14:57:28Z

/test vsphere-proxy-e2e-operator

mansikulkarni96 · 2023-06-08T15:36:43Z

/retitle WINC-637: Set cluster-wide proxy environment variables on Windows instances

openshift-ci-robot · 2023-06-08T16:00:26Z

@saifshaikh48: This pull request references WINC-637 which is a valid jira issue.

In response to this:

This PR expands the services ConfigMap data schema and the expected services ConfigMap to hold a new field, a list of name/value pairs representing environment variables to be set on Windows instances. If the cluster-wide proxy variables change at any point, the ConfigMap is re-created to hold the new values. It also introduces a functionality to the WICD controller to ensure all the environment variables defined in the services ConfigMap are set as expected on the Windows nodes. If there is a discrepancy, WICD updates the values and restarts the instance to allow all OpenShift managed Windows services to pick up changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mansikulkarni96 · 2023-06-08T20:24:47Z

/test vsphere-proxy-e2e-operator

aravindhp

Thanks for working on this, @saifshaikh48 @mansikulkarni96. Please address my comments.

aravindhp · 2023-06-08T20:35:06Z

pkg/cluster/config.go

@@ -34,14 +34,14 @@ const (
 )

 var (
+	// AcceptableProxyVars is a list of the supported proxy variables
+	AcceptableProxyVars = []string{"HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY"}


This should just be ProxyVars. The fact that it is here implies acceptance. Or if you are having naming issues, SupportedProxyVars.

aravindhp · 2023-06-08T20:37:31Z

pkg/cluster/config.go

+// getProxyVars returns a map of the proxy variables and values from the WMCO container's environment. The presence of
+// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise.
+func getProxyVars() map[string]string {
+	proxyVarsMap := make(map[string]string, 3)


Any reason to move away from proxyVars? Either way please come up with a name without Map.

aravindhp · 2023-06-08T20:38:51Z

pkg/cluster/config.go

-	for _, envVar := range proxyEnvVars {
-		// read settings from the WMCO container's environment
+// getProxyVars returns a map of the proxy variables and values from the WMCO container's environment. The presence of
+// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise.


Suggested change

// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise.

// any implies a proxy is enabled, as OLM would have injected them into the operator spec. Returns an empty map otherwise.

pkg/servicescm/servicescm.go

aravindhp · 2023-06-08T20:56:04Z

pkg/daemon/controller/controller.go

+		if err != nil {
+			return fmt.Errorf("unable to open Windows system registry key %s: %w", systemEnvVarRegistryPath, err)
+		}
+		defer registryKey.Close()


I know we have been following this but what happens if the Close() fails? I suggest using a defer function, collecting the error and returning that so that we retry the operation.

aravindhp · 2023-06-08T20:57:20Z

pkg/daemon/controller/controller.go

+			restartRequired = true
+		}
+	}
+	if restartRequired {


How can we blindly restart the instance? What if there are workloads running?

@aravindhp while I am working on resolving other comments. I wanted to get a clear understanding on how to move forward with this. Will draining and cordoning the node if this operation occurs after a node has been configured be a good remediation for this issue?

Correct. We would need to drain and cordon and the node before restarting. And then uncordon after the restart is complete.

This has the potential to take down all windows instances on the cluster at once. This needs to be addressed in some way.

Here's how we plan to handle this:

if restart is required, check if at least 1 windows node is ready+schedulable

if yes, then cordon/drain/restart/uncordon

if no, retry the check until a timeout

if the timeout is reached, output a warning

Your plan just reduces the window where all instances could be down at once. More I think about this I am not sure if WICD can make the decision to restart. You need an external entity to ensure only one node is restarted at a time.

To expand on this, say there are 3 Windows nodes in the cluster. WICD running on all of them hits this point. It checks if there are other Windows nodes in Ready state. The answer comes back as yes, and all of them get restarted.

I see, thanks for pointing that out. Here's how we propose solving this:

introduce a "RebootRequired" node annotation

WICD continues to set the proxy vars on the instance. If it changes any values, it adds the RebootRequired annotation to the node object

In ensureInstanceIsUpToDate(), WMCO looks for this annotation

benefit here is that this is the upgrade code path, which executes one node at a time as is, preventing taking down all nodes at once

If the RebootRequired annotation exists on a node, WMCO will:

drain and cordon the node

restart the underlying instance

remove the annotation

uncordon the node

That approach in general is fine with me
however that doesn't need to be shoehorned into ensureInstanceIsUpToDate.
Instead, I feel you should introduce a node controller which reacts to changes in the node object.

aravindhp · 2023-06-08T20:59:53Z

test/e2e/proxy_test.go

+		return "", fmt.Errorf("error running SSH job: %w", err)
+	}
+	lines := strings.Split(out, "\n")
+	var valueLines []string


I see a common pattern here and in getServiceProxyEnvVars(). Can that be extracted into a helper function with more inline comments. It is unclear why you are performing some of the operations.

test/e2e/proxy_test.go

pkg/cluster/config.go

saifshaikh48 · 2023-06-20T20:19:43Z

pkg/daemon/controller/controller.go

+	}
+	if restartRequired {
+		// Cordon and drain the Node before we restart the instance
+		drainHelper := nodeconfig.NewDrainHelper(sc.k8sclientset, klog.NewKlogr())


where does sc.k8sclientset get initialized? we likely have to edit the Options struct and setDefaults() func

saifshaikh48 · 2023-07-07T18:52:30Z

/test vsphere-proxy-e2e-operator

aravindhp · 2023-07-07T21:38:42Z

/test unit

aravindhp · 2023-07-07T23:29:48Z

controllers/node_controller.go

+// Reconcile is part of the main kubernetes reconciliation loop which reads that state of the cluster for a
+// CertificateSigningRequests object and aims to move the current state of the cluster closer to the desired state.
+func (r *nodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ctrl.Result, err error) {
+	_ = r.log.WithValues(NodeController, req.NamespacedName)


I assume you will revert this when squashing.

controllers/node_controller.go

Base implementation of a controller that watches Windows nodes.

saifshaikh48 · 2023-07-10T13:31:28Z

/unhold

pkg/daemon/controller/controller.go

sebsoto · 2023-07-11T16:02:36Z

controllers/node_controller.go

+			return ctrl.Result{}, fmt.Errorf("failed to create new nodeconfig: %w", err)
+		}
+
+		if err := nc.SafeReboot(ctx, node); err != nil {


I'm realizing that passing in node here doesn't make much sense. Considering Nodeconfig has node as a part of its struct. Not asking for a change here, this isn't a pattern you're introducing, just something i'm realizing now.

@sebsoto thanks for catching this bad pattern. If we are doing this in other places, we should fix it sooner otherwise we will keep copying the same bad pattern.

To be clear. It doesn't need to be fixed as part of this PR.

Agreed its a bad pattern. The nodeconfig pkg is interesting as the constructor does not initialize its node field. So in every exported function you'd need a check and do an API call to get the node if it is nil, and set the nodeconfig object's node. We can fix this by allowing an existing node to be passed into NewNodeConfig()

sebsoto · 2023-07-11T16:09:44Z

pkg/daemon/controller/controller.go

+			for _, varName := range cluster.SupportedProxyVars {
+				cmd := fmt.Sprintf("[Environment]::GetEnvironmentVariable('%s', 'Process')", varName)
+				out, err := sc.psCmdRunner.Run(cmd)
+				if err != nil {
+					return false, fmt.Errorf("error running PowerShell command %s with output %s: %w", cmd, out, err)
+				}
+				if strings.TrimSpace(out) != envVars[varName] {
+					stillNeedsReboot = true
+				}
+			}


nit: this logic could be contained within its own function which would help readability of this larger function.

Good point @sebsoto. Ideally there would be waitForEnvironmentVariablesToReconcile() function that calls the helper you are talking about.

sebsoto · 2023-07-11T16:18:03Z

pkg/daemon/controller/controller.go

+		defer func() { // always close the registry key, without swallowing any error returned before the defer call
+			closeErr := registryKey.Close()
+			if closeErr != nil {
+				if err != nil {
+					err = fmt.Errorf("multiple errors: %w, %w", err, closeErr)
+				} else {
+					err = closeErr
+				}
+			}
+		}()


Why does this need to be done this way and not just logged?
In case of this error, the handle is being leaked no matter what, triggering another reconcile is not going to fix that.

another reconcile is not going to fix that

Do we know this for sure? It could be that next time we reconcile the handle can successfully close

openshift-ci · 2023-07-11T16:20:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aravindhp, saifshaikh48

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [aravindhp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This commit adds functionality to the WICD controller to ensure all the ENV variables defined in the services ConfigMap are set as expected on Windows nodes. If there is a discrepancy, WICD updates the values and restarts the instance to allow all OpenShift managed Windows services to pick up changes. As part of this, we introduced a new node annotation, applied by WICD if there is a change in environment variables. WMCO looks for this annotation and reboots the instances in order to prevent all instances from being taken down at the same time. The node controller was updated to react to the setting of this annotation by rebooting the instance. Co-authored by: Mansi Kulkarni <mankulka@redhat.com>

This commit adds tests to validate that proxy environment variables are being properly set on each Windows node. Co-authored by: Mansi Kulkarni <mankulka@redhat.com>

sebsoto · 2023-07-11T18:56:34Z

/lgtm

openshift-ci-robot · 2023-07-12T16:01:50Z

/retest-required

Remaining retests: 0 against base HEAD 2410c2b and 2 for PR HEAD 76a512f in total

saifshaikh48 · 2023-07-12T16:36:21Z

/test vsphere-proxy-e2e-operator

openshift-ci · 2023-07-12T19:25:30Z

@saifshaikh48: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 31, 2023

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 31, 2023

saifshaikh48 force-pushed the set-proxy-vars branch 3 times, most recently from b3f5fb5 to 6201d6b Compare April 4, 2023 20:42

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2023

mansikulkarni96 force-pushed the set-proxy-vars branch from 7e751eb to 953897e Compare June 5, 2023 21:04

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 5, 2023

mansikulkarni96 force-pushed the set-proxy-vars branch 4 times, most recently from 9bd4389 to 4642e92 Compare June 8, 2023 14:57

openshift-ci bot changed the title ~~[WIP] WINC-637: Set cluster-wide proxy environment variables on Windows instances~~ WINC-637: Set cluster-wide proxy environment variables on Windows instances Jun 8, 2023

mansikulkarni96 force-pushed the set-proxy-vars branch from 4642e92 to 7871855 Compare June 8, 2023 20:19

aravindhp suggested changes Jun 8, 2023

View reviewed changes

mansikulkarni96 force-pushed the set-proxy-vars branch 2 times, most recently from 4de291f to 00b00d9 Compare June 20, 2023 18:50

saifshaikh48 commented Jun 20, 2023

View reviewed changes

mansikulkarni96 force-pushed the set-proxy-vars branch from 00b00d9 to 2bc73e4 Compare June 20, 2023 21:37

saifshaikh48 force-pushed the set-proxy-vars branch from 35c292e to 7d70ab5 Compare July 7, 2023 18:51

aravindhp suggested changes Jul 7, 2023

View reviewed changes

saifshaikh48 added 2 commits July 10, 2023 09:29

[node_controller] Introduce Windows node controller

89d3984

Base implementation of a controller that watches Windows nodes.

[controllers] Use isWindowsNode() where possible

cec3f91

saifshaikh48 force-pushed the set-proxy-vars branch from 7d70ab5 to f5bcfca Compare July 10, 2023 13:30

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 10, 2023

aravindhp reviewed Jul 10, 2023

View reviewed changes

pkg/daemon/controller/controller.go Outdated Show resolved Hide resolved

saifshaikh48 force-pushed the set-proxy-vars branch 2 times, most recently from ac765d7 to 384d0fc Compare July 10, 2023 17:44

aravindhp reviewed Jul 10, 2023

View reviewed changes

pkg/daemon/controller/controller.go Outdated Show resolved Hide resolved

pkg/daemon/controller/controller.go Show resolved Hide resolved

saifshaikh48 force-pushed the set-proxy-vars branch from 384d0fc to a4a37f0 Compare July 10, 2023 19:43

aravindhp approved these changes Jul 10, 2023

View reviewed changes

sebsoto requested changes Jul 11, 2023

View reviewed changes

saifshaikh48 added 2 commits July 11, 2023 14:38

[e2e] Test instance proxy environment variables

76a512f

This commit adds tests to validate that proxy environment variables are being properly set on each Windows node. Co-authored by: Mansi Kulkarni <mankulka@redhat.com>

saifshaikh48 force-pushed the set-proxy-vars branch from a4a37f0 to 76a512f Compare July 11, 2023 18:39

openshift-ci bot assigned sebsoto Jul 11, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2023

saifshaikh48 marked this pull request as ready for review July 12, 2023 13:55

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023

openshift-ci bot requested a review from sebsoto July 12, 2023 13:56

openshift-merge-robot merged commit 3b1a8eb into openshift:master Jul 12, 2023
16 checks passed

	// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise.
	// any implies a proxy is enabled, as OLM would have injected them into the operator spec. Returns an empty map otherwise.

WINC-637: Set cluster-wide proxy environment variables on Windows instances #1536

WINC-637: Set cluster-wide proxy environment variables on Windows instances #1536

Conversation

saifshaikh48 commented Mar 31, 2023 • edited

openshift-ci-robot commented Mar 31, 2023 • edited by openshift-ci bot

openshift-ci bot commented Mar 31, 2023

saifshaikh48 commented Apr 5, 2023

saifshaikh48 commented Apr 6, 2023

mansikulkarni96 commented Jun 5, 2023

mansikulkarni96 commented Jun 5, 2023

mansikulkarni96 commented Jun 5, 2023

mansikulkarni96 commented Jun 8, 2023

mansikulkarni96 commented Jun 8, 2023 • edited by openshift-ci bot

openshift-ci-robot commented Jun 8, 2023 • edited by openshift-ci bot

mansikulkarni96 commented Jun 8, 2023

aravindhp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saifshaikh48 Jun 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saifshaikh48 commented Jul 7, 2023

aravindhp commented Jul 7, 2023

Choose a reason for hiding this comment

saifshaikh48 commented Jul 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saifshaikh48 Jul 11, 2023 • edited

Choose a reason for hiding this comment

openshift-ci bot commented Jul 11, 2023

sebsoto commented Jul 11, 2023

openshift-ci-robot commented Jul 12, 2023

saifshaikh48 commented Jul 12, 2023

openshift-ci bot commented Jul 12, 2023

saifshaikh48 commented Mar 31, 2023 •

edited

openshift-ci-robot commented Mar 31, 2023 •

edited by openshift-ci bot

mansikulkarni96 commented Jun 8, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 8, 2023 •

edited by openshift-ci bot

saifshaikh48 Jun 29, 2023 •

edited

saifshaikh48 Jul 11, 2023 •

edited