title | authors | reviewers | approvers | api-approvers | creation-date | last-updated | tracking-link | see-also | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
windows-node-egress-proxy |
|
|
|
|
2023-02-16 |
2023-08-29 |
|
- Enhancement is
implementable
- Design details are appropriately documented from clear requirements
- Test plan is defined
- Operational readiness criteria is defined
- Graduation criteria for dev preview, tech preview, GA
- User-facing documentation is created in openshift-docs
The goal of this enhancement proposal is to allow Windows nodes to consume and use global egress proxy configuration when making external requests outside the cluster's internal network. OpenShift customers may require that external traffic is passed through a proxy for security reasons, and Windows instances are no exception. There already exists a protocol for publishing cluster-wide proxy settings, which is consumed by different OpenShift components (Linux worker nodes and infra nodes, CVO and OLM managed operators) but Windows worker nodes do not currently consume or respect proxy settings. This effort will work to plug feature disparity by making the Windows Machine Config Operator (WMCO) aware of cluster proxy settings at install time and reactive during runtime.
The motivation here is to expand the Windows containers production use case, enabling users to add Windows nodes and run workloads easily and successfully in a proxy-enabled cluster. This is an extremely important ask for customer environments where Windows nodes need to pull images from registries secured behind the client's proxy server or make requests to off-cluster services, and those that use a custom public key infrastructure.
- Create an automated mechanism for WMCO to consume global egress proxy config from existing platform resources, including:
- Proxy connection information
- Additional certificate authorities required to validate the proxy's certificate
- Configure the proxy settings in WMCO-managed components on Windows nodes (kubelet, containerd runtime)
- React to changes to the cluster-wide proxy settings during WMCO runtime
- Synchronize environment variables
NO_PROXY
,HTTP_PROXY
, andHTTPS_PROXY
on Windows nodes with cluster-wide proxy in both shells CMD and PowerShell - Maintain normal functionality in non-proxied clusters
- First-class support/enablement of proxy utilization for user-provided applications
- ingress and reverse proxy settings are out of scope
- Monitor cert expiration dates or automatically replace expired CAs in the cluster's trust bundle
- Windows workloads created in Windows nodes with cluster-wide proxy enabled do not inherit proxy settings from the node. This is the default behavior on Linux Nodes side as well.
- PowerShell's shell sessions do not inherit proxy settings by default on Windows nodes with cluster-wide proxy enabled (See Risks and Mitigations section for recommendations)
There are two major undertakings:
- Adding proxy environment variables (
NO_PROXY
,HTTP_PROXY
, andHTTPS_PROXY
) to Windows nodes and WMCO-managed Windows services. - Adding the proxy’s trusted CA certificate bundle to each Windows instance's local trust store.
Since WMCO is a day 2 operator, it will pick up proxy settings during runtime regardless of if proxy settings were set during cluster install time or at some point during the cluster's lifetime. When global proxy settings are updated, WMCO will react by:
- overriding proxy vars on the instance with the new values
- copying over the new trust bundle to Windows instances and updating each instance's local trust store (old certs should be removed)
All changes detailed in this enhancement proposal will be limited to the Windows Machine Config Operator and its sub-component, Windows Instance Config Daemon (WICD).
User stories can also be found within the node proxy epic: WINC-802
cluster creator is a human user responsible for deploying a cluster. cluster administrator is a human user responsible for managing cluster settings including network egress policies.
There are 3 different workflows that affect the cluster-wide proxy use case.
- A cluster creator specifies global proxy settings at install time
- A cluster administrator introduces new global proxy settings during runtime in a proxy-less cluster
- A cluster administrator changes or removes existing global proxy settings during cluster runtime
The first scenario would occur through their install-config.yaml.
The latter 2 scenarios occur through changing the Proxy
object named cluster
or by modifying certificates present in their trustedCA ConfigMap.
In all cases, Windows nodes can be joined to the cluster after altering proxy settings, which would result in WMCO applying proxy settings during initial node configuration. In the latter 2 scenarios, Windows nodes may already exist in the cluster, in which case WMCO will react to the changes by updating the state of each instance.
The risks and mitigations are similar to those on the Linux side of the cluster-wide proxy. Although cluster infra resources already do a best effort validation on the user-provided proxy URL schema and CAs, a user could provide non-functional proxy settings/certs. This would be propagated to their Windows nodes and workloads, taking down existing application connectivity and preventing new Windows nodes from being bootstrapped.
In case users use Windows nodes with PowerShell as the default shell, then there is a risk that outbound traffic from the PowerShell CLI wouldn't go through the cluster-wide proxy by default. Although this does not affect WMCO/OpenShift's view of the Node, this is different on-instance behavior than an admin would see if they were using CMD Prompt. To mitigate this enable global HTTP proxy by default on Windows nodes with cluster-wide proxy for all PowerShell's sessions, the cluster administrator must create a default PowerShell profile script that reads the proxy environment variables maintained by WMCO and populate the DefaultWebProxy property
The Powershell profile file location for all users is $PROFILE.AllUsersCurrentHost
and, the proxy settings can be
updated with:
[System.Net.WebRequest]::DefaultWebProxy = New-Object System.Net.Webproxy("<PROXY_URL>")
where PROXY_URL
is the URI of the cluster-wide proxy.
Run the following commands in the Windows Node with cluster-wide proxy enabled to create a default profile for the
Powershell sessions that reads the HTTP_PROXY
environment variable maintained by WMCO and populate the DefaultWebProxy
property:
Set-Content -Path $PROFILE.AllUsersCurrentHost -Value '$proxyValue=[Environment]::GetEnvironmentVariable("HTTP_PROXY", "Process")' -Force
Add-Content -Path $PROFILE.AllUsersCurrentHost -Value '[System.Net.WebRequest]::DefaultWebProxy = New-Object System.Net.Webproxy("$proxyValue")' -Force
A similar approach can be used to set the proxy settings for HTTPS traffic and /custom certificates. See official Microsoft documentation.
The only drawbacks are the increased complexity of WMCO and the potential complexity of debugging customer cases that involve a proxy setup, since it would be extremely difficult to set up an accurate replication environment. This can be mitigated by proactively getting the development team, QE, and support folks familiar with the expected behavior of Windows nodes/workloads in proxied clusters, and comfortable spinning up their own proxied clusters.
In general the support procedures for WMCO will remain the same. There are two underlying mechanisms we rely on, the publishing of proxy config to cluster resources and the consuming of the published config. If either of the underlying mechanisms fail, Windows nodes will become proxy unaware. This could involve an issue with the user-provided proxy settings, the cluster network operator, OLM, or WMCO. This would result in all future egress traffic circumventing the proxy, which could affect inter-pod communication, existing application availability, and security. Also, the pause image may not be able to be fetched, preventing new Windows nodes from running workloads. This would require manual intervention from the cluster admin or a new release fixing whatever bug is causing the problem.
N/A, as no CRDs, admission and conversion webhooks, aggregated API servers, or finalizers will be added or modified. Only the WMCO will be extended which is an optional operator with its own lifecycle and SLO/SLAs, a tier 3 OpenShift API.
N/A
N/A
As it stands today, the source of truth for cluster-wide proxy settings is the Proxy
resource with the name cluster
.
The contents of the resource are both user-defined, as well as adjusted by the cluster network operator (CNO). Some
platforms require instances to access certain endpoints to retrieve metadata for bootstrapping. CNO has logic to inject
additional no-proxy
entries such as 169.254.169.254
and .${REGION}.compute.internal
into the Proxy
resource.
OLM is a subscriber to these Proxy
settings -- it forwards the settings to the CSV of managed operators, so the WMCO
container will automatically get the required NO_PROXY
, HTTP_PROXY
, and HTTPS_PROXY
environment variables on startup.
In fact, OLM will update and restart the operator pod with proper environment variables if the Proxy
resource changes.
In proxy-enabled clusters, WMCO will read the values of the 3 proxy variables from its own environment and store them in
the name-value pairs within the EnvironmentVars
key found in the windows-services
ConfigMap spec.
However, in clusters without a global proxy, these variables will not be present in the services ConfigMap.
WICD will periodically poll to check if the proxy vars have changed by comparing each variable's value on the node to
the expected value in the ConfigMap. If there is a discrepancy, WICD controller will reconcile and update the
proxy environment variable on the Windows instances.
WMCO also specifies a list of environment variables monitored by WICD through the WatchedEnvironmentVars
key in the
services ConfigMap spec. This list will now include NO_PROXY
, HTTP_PROXY
, and HTTPS_PROXY
as the names of proxy
specific environment variables watched by WICD.
When a proxy variable is removed from the cluster-wide proxy settings in the Proxy
resource, WICD will take corrective
action to remove the proxy variable from the Windows OS registry.
- In proxied clusters, WMCO will create a new ConfigMap on operator startup. This resource will contain a trusted CA
injection request label so it will be updated by CNO when the global
Proxy
resource changes. If the resource is deleted at any point during operator runtime, WMCO will re-create it to make sure CNO can provide up-to-date proxy settings.
apiVersion: v1
kind: ConfigMap
metadata:
labels:
config.openshift.io/inject-trusted-cabundle: "true"
name: trusted-ca
namespace: openshift-windows-machine-config-operator
Note that we cannot add this ConfigMap into WMCO's bundle manifests because OLM treats bundle resources as static manifests and would actively kick back any changes, including the CA injections from CNO.
-
For Windows instances that have not yet been configured, WMCO reads the trusted CA ConfigMap data during node configuration and uses it to update the local trust store of all Windows instances.
-
For existing Windows nodes, WMCO reacts to changes in the custom CA bundle by reconciling Windows nodes. This will be done through a Kubernetes controller that watches the
trusted-ca
ConfigMap for create/update/delete events. On change, copy the new trust bundle to Windows instances, deleting old certificates (i.e. not present in the current trust bundle) off the instance and importing new ones.
How-to references:
In addition to unit testing individual WMCO packages and controllers, an e2e job will be added to the release repo for WMCO's master/release-4.14 branches. A new CI workflow will be created using existing step-registry steps, which creates a vSphere cluster with hybrid-overlay networking and an HTTPS proxy secured by additional certs. This workflow will be used to run the existing WMCO e2e test suite to ensure the addition of the egress proxy feature does not break existing functionality. We will add a few test cases to explicitly check the state of proxy settings on Windows nodes. When we release a community offering with this feature, we will add a similar CI job using a cluster-wide proxy on OKD. QE should cover all platforms when validating this feature.
The feature associated with this enhancement is targeted to land in the official Red Hat operator version of WMCO 9.0.0 within OpenShift 4.14 timeframe. The normal WMCO release process will be followed as the functionality described in this enhancement is integrated into the product.
A community version of WMCO 8 or 9 will be released with incremental additions to Windows proxy support functionality, giving users an opportunity to get an early preview of the feature using OKD/OCP 4.13 or 4.14. It will also allow us to collect feedback to troubleshoot common pain points and learn if there are any shortcomings.
An Openshift docs update announcing Windows cluster-wide proxy support will be required as part of GA. The new docs should list any Windows-specific info, but linking to existing docs should be enough for overarching proxy/PKI details.
N/A
See Release Plan above.
N/A, as this is a new feature that does not supersede an existing one.
The relevant upgrade path is from WMCO 8.y.z in OCP 4.13 to WMCO 9.y.z in OCP 4.14. There will be no changes to the
current WMCO upgrade strategy. Once customers are on WMCO 9.0.0, they can configure a cluster-wide proxy and the Windows
nodes will be automatically updated by the operator to use the Proxy
settings for egress traffic.
When deconfiguring Windows instances, proxy settings will be cleared from the node. This involves undoing some node config steps i.e. removing proxy variables and deleting additional certificates from the machine's local trust store. This scenario will occur when upgrading both BYOH and Machine-backed Windows nodes.
Downgrades are generally not supported by OLM, which manages WMCO. In case of breaking changes, please see the WMCO Upgrades enhancement document for guidance.
N/A. There will be no version skew since this work is all within 1 product sub-component (WMCO). The 8.y.z version of the official Red Hat operator will not have cluster-wide egress proxy support for Windows enabled. Then, when customers move to WMCO 9.0.0, the proxy support will be available.
The implementation history can be tracked by following the associated work items in Jira and source code improvements in the WMCO Github repo.
-
Another possible way for WMCO to retrieve the proxy variables is to watch the
rendered-worker
MachineConfig
for changes and parse info from theproxy.env
file. MCO re-renders thisMachineConfig
when CVO injects new proxy variables into its pod spec. The difficulty of this approach comes from figuring out when we need to update node's env vars. Ideally such reconfiguring happens only when the values change in the pod spec, but how can we detect if the proxy env vars changed or therendered-worker
MachineConfig
was updated for some other reason? We want to avoid kicking polling of all nodes every time therendered-worker
spec updates. -
There are a few other ways to set the required environment variables on the node.
- Using Powershell instead of Windows API calls, e.g.
But since WICD runs directly on the node, syscalls are more direct and efficient.
[Environment]::SetEnvironmentVariable('HTTP_PROXY', 'http://<username>:<pswd>@<ip>:<port>', 'Machine') [Environment]::SetEnvironmentVariable('NO_PROXY', '123.example.com;10.88.0.0/16', 'Machine')
- In order to avoid a system reboot after setting node environment variables, we can reconcile services by setting
their process-level environment variables, and then restarting the individual services. This can be done by
adding a Powershell pre-script to the config for each service in the windows-services ConfigMap.
But since pre-scripts run each time WICD checks for changes in service spec, a constant polling operation during operator runtime, this would be run unnecessarily often and bloat each service's configuration.
[string[]] $envVars = @("HTTP_PROXY=http://<username>:<pswd>@<ip>:<port>", "NO_PROXY=123.example.com,10.88.0.0/16") Set-ItemProperty HKLM:SYSTEM\CurrentControlSet\Services\<$SERVICE_NAME> -Name Environment -Value $envVars Restart-Service <$SERVICE_NAME>
- Using Powershell instead of Windows API calls, e.g.
-
Also note that there is another way to get the trusted CA data required rather than accessing the ConfigMap directly, but it leaves open the same concern around unnecessary reconciliations -- how to detect if the operator restart was due to a trust bundle file change or the pod just restarted for another reason? For completeness, the approach is listed out here:
- Update the operator’s Deployment to support trusted CA injection by mounting the trusted CA ConfigMap
apiVersion: apps/v1 kind: Deployment metadata: name: windows-machine-config-operator namespace: openshift-windows-machine-config-operator annotations: config.openshift.io/inject-proxy: windows-machine-config-operator spec: ... containers: - name: windows-machine-config-operator volumeMounts: - name: trusted-ca mountPath: /etc/pki/ca-trust/extracted/pem readOnly: true - name: trusted-ca configMap: name: trusted-ca items: - key: ca-bundle.crt path: tls-ca-bundle.pem ...
- Create a file watcher that watches changes to the mounted trust bundle that kills the main operator process and allows the backing k8s Deployment to start a new Pod that mounts the updated trust bundle. Implementation example: cluster-ingress-operator
-
A workaround that would deliver the same value proposed by this enhancement would be to validate and provide guidance to make cluster administrators responsible for manually propagating proxy settings to each of their Windows nodes, and underlying OpenShift managed components. This is not a feasible alternative as even manual node changes can be ephemeral. WMCO would reset config changes to OpenShift managed Windows services in the event of a node reconciliation.
- Instead of adding a new vSphere job, we can leverage an existing proxy test workflow on AWS. However, this workflow does not test an HTTPS proxy requiring an additional trust bundle, so we would need to make improvements to the pre-install steps. Since vSphere is our most used platform, testing would be better suited there. The required proxy config steps already exist in the release repo for vSphere anyway.