Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WINC-637: Set cluster-wide proxy environment variables on Windows instances #1536

Merged
merged 6 commits into from
Jul 12, 2023

Conversation

saifshaikh48
Copy link
Contributor

@saifshaikh48 saifshaikh48 commented Mar 31, 2023

This PR expands the services ConfigMap data schema and the expected services ConfigMap to hold a new field, a list of name/value pairs representing environment variables to be set on Windows instances. If the cluster-wide proxy variables change at any point, the ConfigMap is re-created to hold the new values. It also introduces a functionality to the WICD controller to ensure all the environment variables defined in the services ConfigMap are set as expected on the Windows nodes. If there is a discrepancy, WICD updates the values and signals to WMCO to restarts the instance to allow all OpenShift managed Windows services to pick up changes.

summary of proxy vars flow:

  • WICD controller is running
  • WICD reconciles proxy vars. if it updates any of them, WICD applies a reboot-required annotation to the node
    • WICD now waits for reboot to occur (polls Windows processes to see if they have inherited the correct env vars values)
  • WMCO's node controller reacts to setting of reboot annotation by restarting the instance
    • node controller now waits for WICD to remove the reboot annotation
    • after this, WICD is can see the env vars are set properly at the Windows service level
  • WICD continues as it currently does, reconciling Windows services and applying version annotation when node is ready

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 31, 2023

@saifshaikh48: This pull request references WINC-637 which is a valid jira issue.

In response to this:

TODO: testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 31, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 31, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 31, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@saifshaikh48 saifshaikh48 force-pushed the set-proxy-vars branch 3 times, most recently from b3f5fb5 to 6201d6b Compare April 4, 2023 20:42
@saifshaikh48
Copy link
Contributor Author

/test vsphere-proxy-e2e-operator

1 similar comment
@saifshaikh48
Copy link
Contributor Author

/test vsphere-proxy-e2e-operator

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 5, 2023
@mansikulkarni96
Copy link
Member

/test-aws-e2e-operator

@mansikulkarni96
Copy link
Member

/test vsphere-proxy-e2e-operator

@mansikulkarni96
Copy link
Member

/test aws-e2e-operator

@mansikulkarni96 mansikulkarni96 force-pushed the set-proxy-vars branch 4 times, most recently from 9bd4389 to 4642e92 Compare June 8, 2023 14:57
@mansikulkarni96
Copy link
Member

/test vsphere-proxy-e2e-operator

@mansikulkarni96
Copy link
Member

mansikulkarni96 commented Jun 8, 2023

/retitle WINC-637: Set cluster-wide proxy environment variables on Windows instances

@openshift-ci openshift-ci bot changed the title [WIP] WINC-637: Set cluster-wide proxy environment variables on Windows instances WINC-637: Set cluster-wide proxy environment variables on Windows instances Jun 8, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 8, 2023

@saifshaikh48: This pull request references WINC-637 which is a valid jira issue.

In response to this:

This PR expands the services ConfigMap data schema and the expected services ConfigMap to hold a new field, a list of name/value pairs representing environment variables to be set on Windows instances. If the cluster-wide proxy variables change at any point, the ConfigMap is re-created to hold the new values. It also introduces a functionality to the WICD controller to ensure all the environment variables defined in the services ConfigMap are set as expected on the Windows nodes. If there is a discrepancy, WICD updates the values and restarts the instance to allow all OpenShift managed Windows services to pick up changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mansikulkarni96
Copy link
Member

/test vsphere-proxy-e2e-operator

Copy link
Contributor

@aravindhp aravindhp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @saifshaikh48 @mansikulkarni96. Please address my comments.

@@ -34,14 +34,14 @@ const (
)

var (
// AcceptableProxyVars is a list of the supported proxy variables
AcceptableProxyVars = []string{"HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be ProxyVars. The fact that it is here implies acceptance. Or if you are having naming issues, SupportedProxyVars.

// getProxyVars returns a map of the proxy variables and values from the WMCO container's environment. The presence of
// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise.
func getProxyVars() map[string]string {
proxyVarsMap := make(map[string]string, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to move away from proxyVars? Either way please come up with a name without Map.

for _, envVar := range proxyEnvVars {
// read settings from the WMCO container's environment
// getProxyVars returns a map of the proxy variables and values from the WMCO container's environment. The presence of
// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise.
// any implies a proxy is enabled, as OLM would have injected them into the operator spec. Returns an empty map otherwise.

pkg/servicescm/servicescm.go Show resolved Hide resolved
pkg/servicescm/servicescm.go Show resolved Hide resolved
if err != nil {
return fmt.Errorf("unable to open Windows system registry key %s: %w", systemEnvVarRegistryPath, err)
}
defer registryKey.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have been following this but what happens if the Close() fails? I suggest using a defer function, collecting the error and returning that so that we retry the operation.

restartRequired = true
}
}
if restartRequired {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we blindly restart the instance? What if there are workloads running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aravindhp while I am working on resolving other comments. I wanted to get a clear understanding on how to move forward with this. Will draining and cordoning the node if this operation occurs after a node has been configured be a good remediation for this issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We would need to drain and cordon and the node before restarting. And then uncordon after the restart is complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to take down all windows instances on the cluster at once. This needs to be addressed in some way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how we plan to handle this:

  • if restart is required, check if at least 1 windows node is ready+schedulable
  • if yes, then cordon/drain/restart/uncordon
  • if no, retry the check until a timeout
    • if the timeout is reached, output a warning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your plan just reduces the window where all instances could be down at once. More I think about this I am not sure if WICD can make the decision to restart. You need an external entity to ensure only one node is restarted at a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on this, say there are 3 Windows nodes in the cluster. WICD running on all of them hits this point. It checks if there are other Windows nodes in Ready state. The answer comes back as yes, and all of them get restarted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for pointing that out. Here's how we propose solving this:

  • introduce a "RebootRequired" node annotation
  • WICD continues to set the proxy vars on the instance. If it changes any values, it adds the RebootRequired annotation to the node object
  • In ensureInstanceIsUpToDate(), WMCO looks for this annotation
    • benefit here is that this is the upgrade code path, which executes one node at a time as is, preventing taking down all nodes at once
  • If the RebootRequired annotation exists on a node, WMCO will:
    • drain and cordon the node
    • restart the underlying instance
    • remove the annotation
    • uncordon the node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach in general is fine with me
however that doesn't need to be shoehorned into ensureInstanceIsUpToDate.
Instead, I feel you should introduce a node controller which reacts to changes in the node object.

return "", fmt.Errorf("error running SSH job: %w", err)
}
lines := strings.Split(out, "\n")
var valueLines []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a common pattern here and in getServiceProxyEnvVars(). Can that be extracted into a helper function with more inline comments. It is unclear why you are performing some of the operations.

test/e2e/proxy_test.go Show resolved Hide resolved
test/e2e/proxy_test.go Outdated Show resolved Hide resolved
@mansikulkarni96 mansikulkarni96 force-pushed the set-proxy-vars branch 2 times, most recently from 4de291f to 00b00d9 Compare June 20, 2023 18:50
pkg/cluster/config.go Show resolved Hide resolved
}
if restartRequired {
// Cordon and drain the Node before we restart the instance
drainHelper := nodeconfig.NewDrainHelper(sc.k8sclientset, klog.NewKlogr())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does sc.k8sclientset get initialized? we likely have to edit the Options struct and setDefaults() func

@saifshaikh48
Copy link
Contributor Author

/test vsphere-proxy-e2e-operator

@aravindhp
Copy link
Contributor

/test unit

// Reconcile is part of the main kubernetes reconciliation loop which reads that state of the cluster for a
// CertificateSigningRequests object and aims to move the current state of the cluster closer to the desired state.
func (r *nodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ctrl.Result, err error) {
_ = r.log.WithValues(NodeController, req.NamespacedName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you will revert this when squashing.

controllers/node_controller.go Outdated Show resolved Hide resolved
controllers/node_controller.go Outdated Show resolved Hide resolved
@saifshaikh48
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 10, 2023
@saifshaikh48 saifshaikh48 force-pushed the set-proxy-vars branch 2 times, most recently from ac765d7 to 384d0fc Compare July 10, 2023 17:44
return ctrl.Result{}, fmt.Errorf("failed to create new nodeconfig: %w", err)
}

if err := nc.SafeReboot(ctx, node); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing that passing in node here doesn't make much sense. Considering Nodeconfig has node as a part of its struct. Not asking for a change here, this isn't a pattern you're introducing, just something i'm realizing now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebsoto thanks for catching this bad pattern. If we are doing this in other places, we should fix it sooner otherwise we will keep copying the same bad pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear. It doesn't need to be fixed as part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed its a bad pattern. The nodeconfig pkg is interesting as the constructor does not initialize its node field. So in every exported function you'd need a check and do an API call to get the node if it is nil, and set the nodeconfig object's node. We can fix this by allowing an existing node to be passed into NewNodeConfig()

Comment on lines +298 to +307
for _, varName := range cluster.SupportedProxyVars {
cmd := fmt.Sprintf("[Environment]::GetEnvironmentVariable('%s', 'Process')", varName)
out, err := sc.psCmdRunner.Run(cmd)
if err != nil {
return false, fmt.Errorf("error running PowerShell command %s with output %s: %w", cmd, out, err)
}
if strings.TrimSpace(out) != envVars[varName] {
stillNeedsReboot = true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this logic could be contained within its own function which would help readability of this larger function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @sebsoto. Ideally there would be waitForEnvironmentVariablesToReconcile() function that calls the helper you are talking about.

Comment on lines 327 to 333
defer func() { // always close the registry key, without swallowing any error returned before the defer call
closeErr := registryKey.Close()
if closeErr != nil {
if err != nil {
err = fmt.Errorf("multiple errors: %w, %w", err, closeErr)
} else {
err = closeErr
}
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be done this way and not just logged?
In case of this error, the handle is being leaked no matter what, triggering another reconcile is not going to fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another reconcile is not going to fix that

Do we know this for sure? It could be that next time we reconcile the handle can successfully close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aravindhp, saifshaikh48

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This commit adds functionality to the WICD controller to ensure all the
ENV variables defined in the services ConfigMap are set as expected on
Windows nodes. If there is a discrepancy, WICD updates the values and
restarts the instance to allow all OpenShift managed Windows services
to pick up changes.

As part of this, we introduced a new node annotation, applied by WICD
if there is a change in environment variables. WMCO looks for this
annotation and reboots the instances in order to prevent all instances
from being taken down at the same time. The node controller was updated
to react to the setting of this annotation by rebooting the instance.

Co-authored by: Mansi Kulkarni <mankulka@redhat.com>
This commit adds tests to validate that proxy environment variables are
being properly set on each Windows node.

Co-authored by: Mansi Kulkarni <mankulka@redhat.com>
@sebsoto
Copy link
Contributor

sebsoto commented Jul 11, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2023
@saifshaikh48 saifshaikh48 marked this pull request as ready for review July 12, 2023 13:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023
@openshift-ci openshift-ci bot requested a review from sebsoto July 12, 2023 13:56
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 2410c2b and 2 for PR HEAD 76a512f in total

@saifshaikh48
Copy link
Contributor Author

/test vsphere-proxy-e2e-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 12, 2023

@saifshaikh48: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 3b1a8eb into openshift:master Jul 12, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants