Skip to content

feat(autoscaler): add CPU-based control plane autoscaling#7783

Draft
csrwng wants to merge 1 commit intoopenshift:mainfrom
csrwng:cpu-autoscaling
Draft

feat(autoscaler): add CPU-based control plane autoscaling#7783
csrwng wants to merge 1 commit intoopenshift:mainfrom
csrwng:cpu-autoscaling

Conversation

@csrwng
Copy link
Contributor

@csrwng csrwng commented Feb 23, 2026

Summary

  • Extend the ResourceBasedControlPlaneAutoscaler to consider both CPU and memory VPA recommendations when sizing hosted clusters, preventing under-provisioning when CPU is the bottleneck (OCPSTRAT-2915)
  • Add global kubeAPIServerCPUFraction and per-size fraction overrides for both memory and CPU in ClusterSizingConfiguration
  • Final recommended size is max(memory-driven size, CPU-driven size), with full backward compatibility when CPU data is absent

Details

API Changes (ClusterSizingConfiguration)

  • ResourceBasedAutoscalingConfiguration: added kubeAPIServerCPUFraction (global CPU fraction, default 0.65)
  • SizeCapacity: added kubeAPIServerMemoryFraction and kubeAPIServerCPUFraction per-size overrides

Controller Changes

  • recommendedClusterSize extracts both memory and CPU from VPA UncappedTarget
  • Routes to recommendedSizeByBoth, recommendedSize (memory-only), or recommendedSizeByCPU depending on available data
  • Enhanced logging includes CPU values and effective fractions

Cache Changes

  • Effective fraction resolution: per-size override > global fraction > default (0.65)
  • recommendedSizeByCPU: analogous to existing memory-based sizing
  • recommendedSizeByBoth: returns the larger of the two independent size recommendations
  • Per-size fraction validation (0 < fraction <= 1)

Backward Compatibility

  • No CPU configuration required — defaults to memory-only sizing
  • Existing ClusterSizingConfiguration resources work without changes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added CPU-based sizing support for Kubernetes API server autoscaling alongside existing memory-based sizing
    • Introduced configurable CPU and memory fraction overrides at both global and per-cluster-size levels
    • Enhanced cluster sizing recommendations to consider both CPU and memory requirements simultaneously for improved resource allocation

…size resource fractions

Extend the ResourceBasedControlPlaneAutoscaler to consider both CPU and memory
VPA recommendations when determining hosted cluster size. Previously, sizing was
based solely on kube-apiserver memory recommendations, which could lead to
under-provisioning when CPU was the bottleneck.

Changes:
- Add global kubeAPIServerCPUFraction to ResourceBasedAutoscalingConfiguration
- Add per-size fraction overrides (memory and CPU) to SizeCapacity
- Implement effective fraction resolution: per-size > global > default (0.65)
- Add recommendedSizeByCPU and recommendedSizeByBoth methods to sizing cache
- Update controller to extract both CPU and memory from VPA UncappedTarget
- Final size is max(memory-driven size, CPU-driven size)
- Maintain backward compatibility: memory-only sizing when CPU data is absent

OCPSTRAT-2915

Signed-off-by: Cesar Wong <cewong@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci bot added do-not-merge/needs-area area/api Indicates the PR includes changes for the API labels Feb 23, 2026
@csrwng csrwng marked this pull request as draft February 23, 2026 18:58
@openshift-ci openshift-ci bot added area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/needs-area labels Feb 23, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Walkthrough

The changes introduce CPU-based sizing support for Kube API Server configuration alongside existing memory-based sizing. New fields for CPU and memory fractions are added to type definitions and CRD schemas, while controller and caching logic are extended to compute sizing recommendations based on both memory and CPU alongside per-size overrides.

Changes

Cohort / File(s) Summary
Type Definitions and Schema
api/scheduling/v1alpha1/clustersizingconfiguration_types.go, cmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yaml
Added KubeAPIServerMemoryFraction and KubeAPIServerCPUFraction fields to SizeCapacity and ResourceBasedAutoscalingConfiguration types; added corresponding properties to CRD OpenAPI schema with validation patterns and default values.
Machine Sizes Cache Logic
hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go
Reworked size recommendation logic to support both memory and CPU with per-size fraction overrides. Added per-size fraction tracking, global CPU fraction validation, and dual-resource recommendation functions (recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, recommendedSizeByBoth) with precedence resolution (per-size > global > default).
Controller Logic
hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go
Extended sizing decisions to consider both memory and CPU recommendations from VPA. Updated recommendation capture, logging, and size selection logic to conditionally use memory-only, CPU-only, or largest-requirement sizing based on available recommendations.
Test Coverage
hypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.go, hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go
Added helpers (defaultSizeCacheWithCPU, vpaWithMemoryAndCPURecommendation), new test function TestRecommendedClusterSize, and comprehensive unit tests for CPU fraction validation, per-size overrides, dual-resource recommendation logic, and backward compatibility.

Sequence Diagram

sequenceDiagram
    participant VPA as VPA Recommendations
    participant Controller as CPU Autoscaler Controller
    participant Cache as Machine Sizes Cache
    participant ClusterSize as Cluster Sizing Decision

    VPA->>Controller: Provide memory & CPU recommendations<br/>for kube-apiserver
    
    Controller->>Cache: Query recommendedSizeByMemory()<br/>with memory recommendation
    Cache-->>Controller: Return size for memory requirement
    
    Controller->>Cache: Query recommendedSizeByCPU()<br/>with CPU recommendation
    Cache-->>Controller: Return size for CPU requirement
    
    alt Both memory and CPU recommendations exist
        Controller->>Cache: Call recommendedSizeByBoth()<br/>(memory size, CPU size)
        Cache-->>Controller: Return largest requirement size
        Cache->>Cache: Check per-size overrides<br/>using effectiveMemoryFraction()<br/>& effectiveCPUFraction()
    else Only memory exists
        Controller->>ClusterSize: Use memory-based sizing
    else Only CPU exists
        Controller->>ClusterSize: Use CPU-based sizing
    end
    
    Controller->>ClusterSize: Apply effective fractions<br/>(per-size > global > default)
    ClusterSize-->>Controller: Return recommended cluster size
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests demonstrate good organization with proper single responsibility, setup patterns, and no timeout issues, but assertion messages are inconsistently applied throughout the test suite. Add descriptive failure messages to all assertions lacking them to improve maintainability and provide clearer diagnostic information when tests fail.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the primary change: adding CPU-based autoscaling to the control plane, which aligns with the main objective of extending the autoscaler to consider both CPU and memory recommendations.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic with no dynamic values like timestamps, UUIDs, or generated identifiers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2026
@openshift-ci openshift-ci bot requested review from bryan-cox and jparrill February 23, 2026 18:58
@csrwng
Copy link
Contributor Author

csrwng commented Feb 23, 2026

/test verify
/test unit

@csrwng
Copy link
Contributor Author

csrwng commented Feb 23, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`:
- Around line 227-234: The log reads fractions from r.sizeCache via
effectiveMemoryFraction and effectiveCPUFraction without holding the cache
mutex, risking race with concurrent update() calls; fix by adding locking inside
sizeCache.effectiveMemoryFraction and sizeCache.effectiveCPUFraction (acquire
the same mutex used by sizeCache.update/recommendedSizeByBoth), or alternatively
add a new sizeCache method (e.g., recommendedWithFractions or
getSizeAndFractions) that returns the chosen size and the memory/CPU fractions
while holding the lock so the logged fractions are consistent with the sizing
decision.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 359-378: The code assumes sizesInOrderByMemory() produces a
memory-sorted list where increasing memory implies non-decreasing CPU, which can
be violated by real instance types; add validation in updateSizesFromConfig()
and updateSizesFromMachineSets() after sizes and sizesInOrderByMemory() are
populated: iterate the memory-ordered list and ensure each subsequent size's CPU
(resources.CPU.AsApproximateFloat64() * effectiveCPUFraction(size)) is >= the
previous CPU value, and if not return/log a clear error (or remove/skip invalid
sizes) so recommendedSizeByBoth() and recommendedSizeByCPULocked() are never
given inconsistent orderings; alternatively (if you prefer refactor) change
recommendedSizeByBoth() to compute recommendedSizeByCPULocked(cpu) and
recommendedSizeByMemoryLocked(mem) independently and return the larger of the
two results instead of relying on sizesInOrderByMemory().

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 79364b4 and 75c092d.

⛔ Files ignored due to path filters (2)
  • api/scheduling/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • vendor/github.com/openshift/hypershift/api/scheduling/v1alpha1/clustersizingconfiguration_types.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (6)
  • api/scheduling/v1alpha1/clustersizingconfiguration_types.go
  • cmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yaml
  • hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go
  • hypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.go
  • hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go
  • hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go

Comment on lines +227 to +234
if kubeAPIServerMemory != nil {
logKVs = append(logKVs, "kubeAPIServerMemory", kubeAPIServerMemory.String())
logKVs = append(logKVs, "effectiveMemoryFraction", r.sizeCache.effectiveMemoryFraction(recommendedSize))
}
if kubeAPIServerCPU != nil {
logKVs = append(logKVs, "kubeAPIServerCPU", kubeAPIServerCPU.String())
logKVs = append(logKVs, "effectiveCPUFraction", r.sizeCache.effectiveCPUFraction(recommendedSize))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential data race when accessing cache without holding lock.

The effectiveMemoryFraction and effectiveCPUFraction methods are called on r.sizeCache without acquiring the mutex. Looking at the cache implementation, these methods read from perSizeFractions, kasMemorySizeFraction, and kasCPUSizeFraction which can be modified by concurrent update() calls.

While the sizing decision itself is protected by the lock inside recommendedSizeByBoth, the fraction values logged here could be inconsistent with the actual values used for sizing if an update occurs between these calls.

Consider either:

  1. Adding lock acquisition to effectiveMemoryFraction/effectiveCPUFraction methods, or
  2. Exposing a method that returns sizing result along with the fractions used, ensuring consistency
🔒 Proposed fix: Add locking to the effective fraction methods in machine_sizes_cache.go
 func (s *machineSizesCache) effectiveMemoryFraction(sizeName string) float64 {
+	s.m.Lock()
+	defer s.m.Unlock()
+	return s.effectiveMemoryFractionLocked(sizeName)
+}
+
+func (s *machineSizesCache) effectiveMemoryFractionLocked(sizeName string) float64 {
 	if s.perSizeFractions != nil {
 		if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.memoryFraction != nil {
 			return fractions.memoryFraction.AsApproximateFloat64()
 		}
 	}
 	return s.kasMemoryFraction()
 }

 func (s *machineSizesCache) effectiveCPUFraction(sizeName string) float64 {
+	s.m.Lock()
+	defer s.m.Unlock()
+	return s.effectiveCPUFractionLocked(sizeName)
+}
+
+func (s *machineSizesCache) effectiveCPUFractionLocked(sizeName string) float64 {
 	if s.perSizeFractions != nil {
 		if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.cpuFraction != nil {
 			return fractions.cpuFraction.AsApproximateFloat64()
 		}
 	}
 	return s.kasCPUFraction()
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if kubeAPIServerMemory != nil {
logKVs = append(logKVs, "kubeAPIServerMemory", kubeAPIServerMemory.String())
logKVs = append(logKVs, "effectiveMemoryFraction", r.sizeCache.effectiveMemoryFraction(recommendedSize))
}
if kubeAPIServerCPU != nil {
logKVs = append(logKVs, "kubeAPIServerCPU", kubeAPIServerCPU.String())
logKVs = append(logKVs, "effectiveCPUFraction", r.sizeCache.effectiveCPUFraction(recommendedSize))
}
func (s *machineSizesCache) effectiveMemoryFraction(sizeName string) float64 {
s.m.Lock()
defer s.m.Unlock()
return s.effectiveMemoryFractionLocked(sizeName)
}
func (s *machineSizesCache) effectiveMemoryFractionLocked(sizeName string) float64 {
if s.perSizeFractions != nil {
if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.memoryFraction != nil {
return fractions.memoryFraction.AsApproximateFloat64()
}
}
return s.kasMemoryFraction()
}
func (s *machineSizesCache) effectiveCPUFraction(sizeName string) float64 {
s.m.Lock()
defer s.m.Unlock()
return s.effectiveCPUFractionLocked(sizeName)
}
func (s *machineSizesCache) effectiveCPUFractionLocked(sizeName string) float64 {
if s.perSizeFractions != nil {
if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.cpuFraction != nil {
return fractions.cpuFraction.AsApproximateFloat64()
}
}
return s.kasCPUFraction()
}
🧰 Tools
🪛 golangci-lint (2.5.0)

[error] 228-228: : # github.com/openshift/hypershift/sync-global-pullsecret [github.com/openshift/hypershift/sync-global-pullsecret.test]
sync-global-pullsecret/sync-global-pullsecret_test.go:228:23: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:234:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:247:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:257:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:270:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:283:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:296:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:309:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:327:12: undefined: NewMockdbusConn

(typecheck)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`
around lines 227 - 234, The log reads fractions from r.sizeCache via
effectiveMemoryFraction and effectiveCPUFraction without holding the cache
mutex, risking race with concurrent update() calls; fix by adding locking inside
sizeCache.effectiveMemoryFraction and sizeCache.effectiveCPUFraction (acquire
the same mutex used by sizeCache.update/recommendedSizeByBoth), or alternatively
add a new sizeCache method (e.g., recommendedWithFractions or
getSizeAndFractions) that returns the chosen size and the memory/CPU fractions
while holding the lock so the logged fractions are consistent with the sizing
decision.

Comment on lines +359 to +378
// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
// the given CPU requirement. Must be called with the mutex held.
func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
sizesInOrder := s.sizesInOrderByMemory()
if len(sizesInOrder) == 0 {
return ""
}
for _, size := range sizesInOrder {
resources, hasSize := s.sizes[size]
if !hasSize {
continue
}
containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
if containerCPUCapacity >= cpu {
return size
}
}
// Best effort: return the largest cluster size
return sizesInOrder[len(sizesInOrder)-1]
}
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && wc -l machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 90


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '359,390p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1545


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "sizesInOrderBy" machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 344


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "more memory also has more CPU" machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 152


🏁 Script executed:

cd hypershift-operator && find . -name "*test*.go" -o -name "*_test.go" | head -20

Repository: openshift/hypershift

Length of output: 1334


🏁 Script executed:

cd hypershift-operator && rg -n "sizes\[|SizeCapacity|Memory:|CPU:" --type go -A 2 | head -100

Repository: openshift/hypershift

Length of output: 7495


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '284,310p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1095


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '375,410p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1498


🏁 Script executed:

cd hypershift-operator && find . -name "*.go" | xargs grep -l "compute-optimized\|memory-optimized\|instance.*type" | head -10

Repository: openshift/hypershift

Length of output: 546


🏁 Script executed:

cd hypershift-operator && rg -n "AddSizeCapacity|SetSizeCapacity" --type go | head -20

Repository: openshift/hypershift

Length of output: 46


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '320,350p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1407


🏁 Script executed:

cd hypershift-operator && rg -n "machineSizesCache|NewMachineSizesCache|machineSizesCache{" --type go | head -20

Repository: openshift/hypershift

Length of output: 241


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "func (s \*machineSizesCache)" machine_sizes_cache.go | head -20

Repository: openshift/hypershift

Length of output: 1289


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '1,100p' machine_sizes_cache.go | head -80

Repository: openshift/hypershift

Length of output: 3038


🏁 Script executed:

cd hypershift-operator/controllers/nodepool/instancetype && cat -n aws/provider.go | head -150

Repository: openshift/hypershift

Length of output: 5654


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '143,180p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 2108


🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '181,244p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 2390


🏁 Script executed:

cd hypershift-operator && rg -n "compute.*optimized|memory.*optimized|t3\.|c5\.|r5\." --type go | head -20

Repository: openshift/hypershift

Length of output: 817


🏁 Script executed:

cd hypershift-operator && rg -n "validation|Validation|assert|Assert" controllers/resourcebasedcpautoscaler/machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 46


Add validation to enforce the memory-CPU ordering assumption or refactor to handle inconsistent orderings.

The assumption stated at lines 384-385 that "a size with more memory also has more CPU" is critical to recommendedSizeByBoth(), which uses sizesInOrderByMemory() as the canonical ordering for both resources. However, no validation enforces this assumption when sizes are loaded.

Real-world AWS instance types can violate this: compute-optimized instances (e.g., c5.2xlarge: 8 vCPU, 16 GiB memory) have high CPU but low memory, while memory-optimized instances (e.g., r5.xlarge: 4 vCPU, 32 GiB memory) have low CPU but high memory. If such instances are used as cluster sizes, recommendedSizeByBoth() will return an incorrectly-sized cluster by assuming the memory ordering matches CPU ordering.

Either validate the assumption in updateSizesFromConfig() and updateSizesFromMachineSets() with an error if violated, or refactor the sizing logic to independently order by CPU and memory, then select the larger result by explicit comparison rather than relying on a shared ordering.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`
around lines 359 - 378, The code assumes sizesInOrderByMemory() produces a
memory-sorted list where increasing memory implies non-decreasing CPU, which can
be violated by real instance types; add validation in updateSizesFromConfig()
and updateSizesFromMachineSets() after sizes and sizesInOrderByMemory() are
populated: iterate the memory-ordered list and ensure each subsequent size's CPU
(resources.CPU.AsApproximateFloat64() * effectiveCPUFraction(size)) is >= the
previous CPU value, and if not return/log a clear error (or remove/skip invalid
sizes) so recommendedSizeByBoth() and recommendedSizeByCPULocked() are never
given inconsistent orderings; alternatively (if you prefer refactor) change
recommendedSizeByBoth() to compute recommendedSizeByCPULocked(cpu) and
recommendedSizeByMemoryLocked(mem) independently and return the larger of the
two results instead of relying on sizesInOrderByMemory().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HyperShift Agents caught this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryan-cox, thank you for confirming! I'm glad the issue was independently identified.

Since this is a known concern, would you like me to:

  1. Create a tracking issue for this validation/refactoring work if one doesn't exist yet?
  2. Help plan the code changes to address it (either adding validation or refactoring the sizing logic)?

Let me know how I can assist! Otherwise, I'll leave this for you and the team to handle as you see fit.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2026

@csrwng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify 75c092d link true /test verify

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go (1)

184-235: ⚠️ Potential issue | 🟡 Minor

Potential data race on fraction reads in logging.

effectiveMemoryFraction/effectiveCPUFraction are read without holding the cache mutex, so logging can race with concurrent updates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`
around lines 184 - 235, The log section reads
r.sizeCache.effectiveMemoryFraction(...) and
r.sizeCache.effectiveCPUFraction(...) without synchronizing the sizeCache, which
can race with concurrent updates; update the code around the logging (the block
that builds logKVs and calls log.Info) to call thread-safe accessors or acquire
the same mutex used by sizeCache before invoking
effectiveMemoryFraction/effectiveCPUFraction (or add and use exported getter
methods on sizeCache that lock internally) so the fraction reads are protected.
hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go (1)

380-409: ⚠️ Potential issue | 🟠 Major

Ordering assumption between memory and CPU still unchecked.

recommendedSizeByBoth (and the CPU path it depends on) assumes memory ordering implies non-decreasing CPU; real instance types can violate this and yield incorrect sizing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`
around lines 380 - 409, The current recommendedSizeByBoth assumes sizes sorted
by memory also have non-decreasing CPU which can be false; update
recommendedSizeByBoth to pick a size that actually satisfies both resources
rather than relying on index ordering: call recommendedSizeByMemoryLocked and
recommendedSizeByCPULocked to get initial candidates, then use
sizesInOrderByMemory() to start from the larger (later) index but verify the
chosen size actually meets both memory and cpu by consulting the size spec in
s.sizes (or the struct that holds memory/cpu for each size); if the selected
size doesn't satisfy both, scan forward in sizesInOrderByMemory until you find
the first size that meets both constraints (or return "" if none), keeping the
existing locking and function names (recommendedSizeByBoth,
recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, sizesInOrderByMemory,
s.sizes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 359-377: The function recommendedSizeByCPULocked currently treats
zero CPU capacity as a valid value and will pick the largest size; change it so
that if the cache lacks CPU capacity data for all sizes it returns "" to trigger
the caller's memory-only fallback. Concretely, in recommendedSizeByCPULocked use
resources.CPU.AsApproximateFloat64() (and/or s.effectiveCPUFraction(size)) to
detect "CPU unset" (e.g. total CPU == 0 or effective fraction == 0) and skip
those sizes; if no size with non-zero CPU capacity is found return "" instead of
the largest size. Also ensure updateSizesFromConfig continues to set CPU=0 for
legacy configs and the new logic above treats that as "missing" rather than a
capacity of zero.

---

Duplicate comments:
In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`:
- Around line 184-235: The log section reads
r.sizeCache.effectiveMemoryFraction(...) and
r.sizeCache.effectiveCPUFraction(...) without synchronizing the sizeCache, which
can race with concurrent updates; update the code around the logging (the block
that builds logKVs and calls log.Info) to call thread-safe accessors or acquire
the same mutex used by sizeCache before invoking
effectiveMemoryFraction/effectiveCPUFraction (or add and use exported getter
methods on sizeCache that lock internally) so the fraction reads are protected.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 380-409: The current recommendedSizeByBoth assumes sizes sorted by
memory also have non-decreasing CPU which can be false; update
recommendedSizeByBoth to pick a size that actually satisfies both resources
rather than relying on index ordering: call recommendedSizeByMemoryLocked and
recommendedSizeByCPULocked to get initial candidates, then use
sizesInOrderByMemory() to start from the larger (later) index but verify the
chosen size actually meets both memory and cpu by consulting the size spec in
s.sizes (or the struct that holds memory/cpu for each size); if the selected
size doesn't satisfy both, scan forward in sizesInOrderByMemory until you find
the first size that meets both constraints (or return "" if none), keeping the
existing locking and function names (recommendedSizeByBoth,
recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, sizesInOrderByMemory,
s.sizes).

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 79364b4 and 75c092d.

⛔ Files ignored due to path filters (2)
  • api/scheduling/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • vendor/github.com/openshift/hypershift/api/scheduling/v1alpha1/clustersizingconfiguration_types.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (6)
  • api/scheduling/v1alpha1/clustersizingconfiguration_types.go
  • cmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yaml
  • hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go
  • hypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.go
  • hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go
  • hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go

Comment on lines +359 to +377
// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
// the given CPU requirement. Must be called with the mutex held.
func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
sizesInOrder := s.sizesInOrderByMemory()
if len(sizesInOrder) == 0 {
return ""
}
for _, size := range sizesInOrder {
resources, hasSize := s.sizes[size]
if !hasSize {
continue
}
containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
if containerCPUCapacity >= cpu {
return size
}
}
// Best effort: return the largest cluster size
return sizesInOrder[len(sizesInOrder)-1]
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Backward-compat risk when CPU capacity is unset.

If a legacy ClusterSizingConfiguration only provides memory capacity, updateSizesFromConfig sets CPU to 0 for all sizes. When VPA includes CPU recommendations, recommendedSizeByCPULocked will return the largest size, even if memory would fit — which breaks the stated memory-only fallback.

Consider treating “no CPU capacity data” as “CPU not available” and returning "" so the caller falls back to memory-only sizing.

🛠️ Suggested fix to preserve memory-only behavior when CPU capacity is absent
 func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
 	sizesInOrder := s.sizesInOrderByMemory()
 	if len(sizesInOrder) == 0 {
 		return ""
 	}
+	hasCPUCapacity := false
 	for _, size := range sizesInOrder {
 		resources, hasSize := s.sizes[size]
 		if !hasSize {
 			continue
 		}
+		if resources.CPU.AsApproximateFloat64() > 0 {
+			hasCPUCapacity = true
+		}
 		containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
 		if containerCPUCapacity >= cpu {
 			return size
 		}
 	}
+	if !hasCPUCapacity {
+		return ""
+	}
 	// Best effort: return the largest cluster size
 	return sizesInOrder[len(sizesInOrder)-1]
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`
around lines 359 - 377, The function recommendedSizeByCPULocked currently treats
zero CPU capacity as a valid value and will pick the largest size; change it so
that if the cache lacks CPU capacity data for all sizes it returns "" to trigger
the caller's memory-only fallback. Concretely, in recommendedSizeByCPULocked use
resources.CPU.AsApproximateFloat64() (and/or s.effectiveCPUFraction(size)) to
detect "CPU unset" (e.g. total CPU == 0 or effective fraction == 0) and skip
those sizes; if no size with non-zero CPU capacity is found return "" instead of
the largest size. Also ensure updateSizesFromConfig continues to set CPU=0 for
legacy configs and the new logic above treats that as "missing" rather than a
capacity of zero.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyperShift SME agents caught this too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryan-cox, thanks for confirming! It's good to see convergence across reviewers on this backward-compatibility concern.

If you'd like, I can help plan the code changes to address this issue—just let me know. Otherwise, feel free to proceed with your own implementation. The suggested fix in the review comment should preserve the memory-only fallback behavior when CPU capacity data is absent.


// kubeAPIServerMemoryFraction is a number between 0 and 1 that overrides the global
// kubeAPIServerMemoryFraction for this specific size. If not specified, the global
// fraction (or its default) is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we specify that the default is?

// kubeAPIServerCPUFraction for this specific size. If not specified, the global
// fraction (or its default) is used.
// +optional
KubeAPIServerCPUFraction *resource.Quantity `json:"kubeAPIServerCPUFraction,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we specify that the default is?

Comment on lines +359 to +378
// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
// the given CPU requirement. Must be called with the mutex held.
func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
sizesInOrder := s.sizesInOrderByMemory()
if len(sizesInOrder) == 0 {
return ""
}
for _, size := range sizesInOrder {
resources, hasSize := s.sizes[size]
if !hasSize {
continue
}
containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
if containerCPUCapacity >= cpu {
return size
}
}
// Best effort: return the largest cluster size
return sizesInOrder[len(sizesInOrder)-1]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HyperShift Agents caught this as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add kubebuilder validations for these two fields?

Comment on lines +359 to +377
// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
// the given CPU requirement. Must be called with the mutex held.
func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
sizesInOrder := s.sizesInOrderByMemory()
if len(sizesInOrder) == 0 {
return ""
}
for _, size := range sizesInOrder {
resources, hasSize := s.sizes[size]
if !hasSize {
continue
}
containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
if containerCPUCapacity >= cpu {
return size
}
}
// Best effort: return the largest cluster size
return sizesInOrder[len(sizesInOrder)-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyperShift SME agents caught this too.


// Log the size change with detailed information
log.Info("Updating cluster size recommendation",
logKVs := []interface{}{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Prometheus metrics be emitted when sizes change?

controller := &ControlPlaneAutoscalerController{
Client: mgr.GetClient(),
}
if err := ctrl.NewControllerManagedBy(mgr).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ClusterSizingConfiguration changes also be watched?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add t.Parallel() to tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add t.Parallel() to tests

})
}

func TestRecommendedSizeByCPU(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add tests that test the ordering from a different direction to make sure that validates expected behavior?

Like

  cache := machineSizesCache{
      sizes: map[string]machineResources{
          "compute": {Memory: resource.MustParse("4Gi"), CPU: resource.MustParse("8")},
          "memory":  {Memory: resource.MustParse("16Gi"), CPU: resource.MustParse("4")},
          "large":   {Memory: resource.MustParse("32Gi"), CPU: resource.MustParse("16")},
      },
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants