Skip to content

Conversation

@jcpowermac
Copy link
Contributor

This commit fixes a critical issue where the machine-api-operator was creating and destroying vCenter REST API sessions on every machine reconciliation, causing excessive login/logout cycles that pollute vCenter audit logs and create unnecessary session churn.

Root Cause:
The WithRestClient() and WithCachingTagsManager() wrapper functions were creating new REST sessions, performing operations, and immediately logging out on every invocation. With hundreds of machines reconciling periodically, this created a constant stream of login/logout events.

Solution (inspired by cluster-api-provider-vsphere):

  • Add TagManager field to Session struct to cache REST client
  • Initialize and cache REST client during session creation (GetOrCreate)
  • Validate both SOAP and REST session health before reusing cached sessions
  • Add GetCachingTagsManager() helper for direct access to cached tag manager
  • Update reconcileRegionAndZoneLabels() to use cached tag manager
  • Update reconcileTags() to use cached tag manager
  • Deprecate WithRestClient() and WithCachingTagsManager() for backward compatibility

Key Changes:

  1. pkg/controller/vsphere/session/session.go:

    • Added TagManager *tags.Manager field to Session struct
    • Modified GetOrCreate() to create and cache REST client once
    • Added dual session validation (SOAP + REST) before reusing sessions
    • Added GetCachingTagsManager() method for direct access
    • Deprecated old wrapper functions with migration guidance
  2. pkg/controller/vsphere/reconciler.go:

    • Updated reconcileRegionAndZoneLabels() to use GetCachingTagsManager()
    • Updated reconcileTags() to use GetCachingTagsManager()
    • Eliminated callback pattern in favor of direct access

Impact:

  • Eliminates excessive vCenter login/logout cycles
  • Reduces vCenter session churn from O(reconciliations) to O(1) per MAPI instance
  • Improves performance by removing authentication overhead on every tag operation
  • REST session now lives as long as SOAP session (until invalidation)

Reference Implementation:
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/pkg/session/session.go

Backward Compatibility:
The deprecated wrapper functions are maintained with warning logs to support existing test code. All production code paths now use the new pattern.

Fixes: Excessive vCenter logout events reported by customers

This commit fixes a critical issue where the machine-api-operator was
creating and destroying vCenter REST API sessions on every machine
reconciliation, causing excessive login/logout cycles that pollute
vCenter audit logs and create unnecessary session churn.

Root Cause:
The WithRestClient() and WithCachingTagsManager() wrapper functions
were creating new REST sessions, performing operations, and immediately
logging out on every invocation. With hundreds of machines reconciling
periodically, this created a constant stream of login/logout events.

Solution (inspired by cluster-api-provider-vsphere):
- Add TagManager field to Session struct to cache REST client
- Initialize and cache REST client during session creation (GetOrCreate)
- Validate both SOAP and REST session health before reusing cached sessions
- Add GetCachingTagsManager() helper for direct access to cached tag manager
- Update reconcileRegionAndZoneLabels() to use cached tag manager
- Update reconcileTags() to use cached tag manager
- Deprecate WithRestClient() and WithCachingTagsManager() for backward compatibility

Key Changes:
1. pkg/controller/vsphere/session/session.go:
   - Added TagManager *tags.Manager field to Session struct
   - Modified GetOrCreate() to create and cache REST client once
   - Added dual session validation (SOAP + REST) before reusing sessions
   - Added GetCachingTagsManager() method for direct access
   - Deprecated old wrapper functions with migration guidance

2. pkg/controller/vsphere/reconciler.go:
   - Updated reconcileRegionAndZoneLabels() to use GetCachingTagsManager()
   - Updated reconcileTags() to use GetCachingTagsManager()
   - Eliminated callback pattern in favor of direct access

Impact:
- Eliminates excessive vCenter login/logout cycles
- Reduces vCenter session churn from O(reconciliations) to O(1) per MAPI instance
- Improves performance by removing authentication overhead on every tag operation
- REST session now lives as long as SOAP session (until invalidation)

Reference Implementation:
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/pkg/session/session.go

Backward Compatibility:
The deprecated wrapper functions are maintained with warning logs to support
existing test code. All production code paths now use the new pattern.

Fixes: Excessive vCenter logout events reported by customers
Signed-off-by: Claude Code Assistant <noreply@anthropic.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcpowermac
Copy link
Contributor Author

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

@jcpowermac: The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-virtualmedia
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-crds-sync
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-aws-operator-techpreview
/test e2e-azure-manual-oidc
/test e2e-azure-operator
/test e2e-azure-operator-techpreview
/test e2e-azure-ovn
/test e2e-gcp-operator
/test e2e-gcp-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-upgrade
/test e2e-nutanix
/test e2e-nutanix-operator-multi-subnet
/test e2e-openstack
/test e2e-vsphere-host-groups-ovn-techpreview
/test e2e-vsphere-operator
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-multi-vcenter
/test e2e-vsphere-ovn-serial
/test e2e-vsphere-ovn-techpreview
/test e2e-vsphere-ovn-techpreview-serial
/test e2e-vsphere-ovn-upgrade
/test e2e-vsphere-static-ovn
/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-operator-main-e2e-aws-ovn
pull-ci-openshift-machine-api-operator-main-goimports
pull-ci-openshift-machine-api-operator-main-golint
pull-ci-openshift-machine-api-operator-main-govet
pull-ci-openshift-machine-api-operator-main-images
pull-ci-openshift-machine-api-operator-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-operator-main-okd-scos-images
pull-ci-openshift-machine-api-operator-main-unit
pull-ci-openshift-machine-api-operator-main-verify-crds-sync
pull-ci-openshift-machine-api-operator-main-verify-deps

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jcpowermac
Copy link
Contributor Author

/test e2e-vsphere-ovn-serial

@jcpowermac
Copy link
Contributor Author

/test e2e-vsphere-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 15, 2025

@jcpowermac: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-serial 2caad48 link false /test e2e-vsphere-ovn-serial
ci/prow/e2e-vsphere-ovn 2caad48 link false /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant