-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WINC-637: Set cluster-wide proxy environment variables on Windows instances #1536
WINC-637: Set cluster-wide proxy environment variables on Windows instances #1536
Conversation
@saifshaikh48: This pull request references WINC-637 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Skipping CI for Draft Pull Request. |
b3f5fb5
to
6201d6b
Compare
/test vsphere-proxy-e2e-operator |
1 similar comment
/test vsphere-proxy-e2e-operator |
7e751eb
to
953897e
Compare
/test-aws-e2e-operator |
/test vsphere-proxy-e2e-operator |
/test aws-e2e-operator |
9bd4389
to
4642e92
Compare
/test vsphere-proxy-e2e-operator |
/retitle WINC-637: Set cluster-wide proxy environment variables on Windows instances |
@saifshaikh48: This pull request references WINC-637 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
4642e92
to
7871855
Compare
/test vsphere-proxy-e2e-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @saifshaikh48 @mansikulkarni96. Please address my comments.
pkg/cluster/config.go
Outdated
@@ -34,14 +34,14 @@ const ( | |||
) | |||
|
|||
var ( | |||
// AcceptableProxyVars is a list of the supported proxy variables | |||
AcceptableProxyVars = []string{"HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be ProxyVars
. The fact that it is here implies acceptance. Or if you are having naming issues, SupportedProxyVars
.
pkg/cluster/config.go
Outdated
// getProxyVars returns a map of the proxy variables and values from the WMCO container's environment. The presence of | ||
// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise. | ||
func getProxyVars() map[string]string { | ||
proxyVarsMap := make(map[string]string, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to move away from proxyVars
? Either way please come up with a name without Map
.
pkg/cluster/config.go
Outdated
for _, envVar := range proxyEnvVars { | ||
// read settings from the WMCO container's environment | ||
// getProxyVars returns a map of the proxy variables and values from the WMCO container's environment. The presence of | ||
// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// any implies a proxy is enabled, since OLM would inject into the operator spec. Returns an empty map otherwise. | |
// any implies a proxy is enabled, as OLM would have injected them into the operator spec. Returns an empty map otherwise. |
pkg/daemon/controller/controller.go
Outdated
if err != nil { | ||
return fmt.Errorf("unable to open Windows system registry key %s: %w", systemEnvVarRegistryPath, err) | ||
} | ||
defer registryKey.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we have been following this but what happens if the Close()
fails? I suggest using a defer function, collecting the error and returning that so that we retry the operation.
pkg/daemon/controller/controller.go
Outdated
restartRequired = true | ||
} | ||
} | ||
if restartRequired { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we blindly restart the instance? What if there are workloads running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aravindhp while I am working on resolving other comments. I wanted to get a clear understanding on how to move forward with this. Will draining and cordoning the node if this operation occurs after a node has been configured be a good remediation for this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. We would need to drain and cordon and the node before restarting. And then uncordon after the restart is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the potential to take down all windows instances on the cluster at once. This needs to be addressed in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's how we plan to handle this:
- if restart is required, check if at least 1 windows node is ready+schedulable
- if yes, then cordon/drain/restart/uncordon
- if no, retry the check until a timeout
- if the timeout is reached, output a warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your plan just reduces the window where all instances could be down at once. More I think about this I am not sure if WICD can make the decision to restart. You need an external entity to ensure only one node is restarted at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on this, say there are 3 Windows nodes in the cluster. WICD running on all of them hits this point. It checks if there are other Windows nodes in Ready
state. The answer comes back as yes, and all of them get restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for pointing that out. Here's how we propose solving this:
- introduce a "RebootRequired" node annotation
- WICD continues to set the proxy vars on the instance. If it changes any values, it adds the RebootRequired annotation to the node object
- In ensureInstanceIsUpToDate(), WMCO looks for this annotation
- benefit here is that this is the upgrade code path, which executes one node at a time as is, preventing taking down all nodes at once
- If the RebootRequired annotation exists on a node, WMCO will:
- drain and cordon the node
- restart the underlying instance
- remove the annotation
- uncordon the node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That approach in general is fine with me
however that doesn't need to be shoehorned into ensureInstanceIsUpToDate
.
Instead, I feel you should introduce a node controller which reacts to changes in the node object.
test/e2e/proxy_test.go
Outdated
return "", fmt.Errorf("error running SSH job: %w", err) | ||
} | ||
lines := strings.Split(out, "\n") | ||
var valueLines []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a common pattern here and in getServiceProxyEnvVars(). Can that be extracted into a helper function with more inline comments. It is unclear why you are performing some of the operations.
4de291f
to
00b00d9
Compare
pkg/daemon/controller/controller.go
Outdated
} | ||
if restartRequired { | ||
// Cordon and drain the Node before we restart the instance | ||
drainHelper := nodeconfig.NewDrainHelper(sc.k8sclientset, klog.NewKlogr()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does sc.k8sclientset
get initialized? we likely have to edit the Options
struct and setDefaults()
func
00b00d9
to
2bc73e4
Compare
35c292e
to
7d70ab5
Compare
/test vsphere-proxy-e2e-operator |
/test unit |
controllers/node_controller.go
Outdated
// Reconcile is part of the main kubernetes reconciliation loop which reads that state of the cluster for a | ||
// CertificateSigningRequests object and aims to move the current state of the cluster closer to the desired state. | ||
func (r *nodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ctrl.Result, err error) { | ||
_ = r.log.WithValues(NodeController, req.NamespacedName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you will revert this when squashing.
Base implementation of a controller that watches Windows nodes.
7d70ab5
to
f5bcfca
Compare
/unhold |
ac765d7
to
384d0fc
Compare
384d0fc
to
a4a37f0
Compare
return ctrl.Result{}, fmt.Errorf("failed to create new nodeconfig: %w", err) | ||
} | ||
|
||
if err := nc.SafeReboot(ctx, node); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm realizing that passing in node here doesn't make much sense. Considering Nodeconfig
has node as a part of its struct. Not asking for a change here, this isn't a pattern you're introducing, just something i'm realizing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebsoto thanks for catching this bad pattern. If we are doing this in other places, we should fix it sooner otherwise we will keep copying the same bad pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear. It doesn't need to be fixed as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed its a bad pattern. The nodeconfig pkg is interesting as the constructor does not initialize its node field. So in every exported function you'd need a check and do an API call to get the node if it is nil, and set the nodeconfig object's node. We can fix this by allowing an existing node to be passed into NewNodeConfig()
for _, varName := range cluster.SupportedProxyVars { | ||
cmd := fmt.Sprintf("[Environment]::GetEnvironmentVariable('%s', 'Process')", varName) | ||
out, err := sc.psCmdRunner.Run(cmd) | ||
if err != nil { | ||
return false, fmt.Errorf("error running PowerShell command %s with output %s: %w", cmd, out, err) | ||
} | ||
if strings.TrimSpace(out) != envVars[varName] { | ||
stillNeedsReboot = true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this logic could be contained within its own function which would help readability of this larger function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @sebsoto. Ideally there would be waitForEnvironmentVariablesToReconcile()
function that calls the helper you are talking about.
pkg/daemon/controller/controller.go
Outdated
defer func() { // always close the registry key, without swallowing any error returned before the defer call | ||
closeErr := registryKey.Close() | ||
if closeErr != nil { | ||
if err != nil { | ||
err = fmt.Errorf("multiple errors: %w, %w", err, closeErr) | ||
} else { | ||
err = closeErr | ||
} | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be done this way and not just logged?
In case of this error, the handle is being leaked no matter what, triggering another reconcile is not going to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another reconcile is not going to fix that
Do we know this for sure? It could be that next time we reconcile the handle can successfully close
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aravindhp, saifshaikh48 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This commit adds functionality to the WICD controller to ensure all the ENV variables defined in the services ConfigMap are set as expected on Windows nodes. If there is a discrepancy, WICD updates the values and restarts the instance to allow all OpenShift managed Windows services to pick up changes. As part of this, we introduced a new node annotation, applied by WICD if there is a change in environment variables. WMCO looks for this annotation and reboots the instances in order to prevent all instances from being taken down at the same time. The node controller was updated to react to the setting of this annotation by rebooting the instance. Co-authored by: Mansi Kulkarni <mankulka@redhat.com>
This commit adds tests to validate that proxy environment variables are being properly set on each Windows node. Co-authored by: Mansi Kulkarni <mankulka@redhat.com>
a4a37f0
to
76a512f
Compare
/lgtm |
/test vsphere-proxy-e2e-operator |
@saifshaikh48: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR expands the services ConfigMap data schema and the expected services ConfigMap to hold a new field, a list of name/value pairs representing environment variables to be set on Windows instances. If the cluster-wide proxy variables change at any point, the ConfigMap is re-created to hold the new values. It also introduces a functionality to the WICD controller to ensure all the environment variables defined in the services ConfigMap are set as expected on the Windows nodes. If there is a discrepancy, WICD updates the values and signals to WMCO to restarts the instance to allow all OpenShift managed Windows services to pick up changes.
summary of proxy vars flow: