feat(autoscaler): add CPU-based control plane autoscaling by csrwng · Pull Request #7783 · openshift/hypershift

csrwng · 2026-02-23T18:57:42Z

Summary

Extend the ResourceBasedControlPlaneAutoscaler to consider both CPU and memory VPA recommendations when sizing hosted clusters, preventing under-provisioning when CPU is the bottleneck (OCPSTRAT-2915)
Add global kubeAPIServerCPUFraction and per-size fraction overrides for both memory and CPU in ClusterSizingConfiguration
Final recommended size is max(memory-driven size, CPU-driven size), with full backward compatibility when CPU data is absent

Details

API Changes (`ClusterSizingConfiguration`)

ResourceBasedAutoscalingConfiguration: added kubeAPIServerCPUFraction (global CPU fraction, default 0.65)
SizeCapacity: added kubeAPIServerMemoryFraction and kubeAPIServerCPUFraction per-size overrides

Controller Changes

recommendedClusterSize extracts both memory and CPU from VPA UncappedTarget
Routes to recommendedSizeByBoth, recommendedSize (memory-only), or recommendedSizeByCPU depending on available data
Enhanced logging includes CPU values and effective fractions

Cache Changes

Effective fraction resolution: per-size override > global fraction > default (0.65)
recommendedSizeByCPU: analogous to existing memory-based sizing
recommendedSizeByBoth: returns the larger of the two independent size recommendations
Per-size fraction validation (0 < fraction <= 1)

Backward Compatibility

No CPU configuration required — defaults to memory-only sizing
Existing ClusterSizingConfiguration resources work without changes

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added CPU-based sizing support for Kubernetes API server autoscaling alongside existing memory-based sizing
- Introduced configurable CPU and memory fraction overrides at both global and per-cluster-size levels
- Enhanced cluster sizing recommendations to consider both CPU and memory requirements simultaneously for improved resource allocation

…size resource fractions Extend the ResourceBasedControlPlaneAutoscaler to consider both CPU and memory VPA recommendations when determining hosted cluster size. Previously, sizing was based solely on kube-apiserver memory recommendations, which could lead to under-provisioning when CPU was the bottleneck. Changes: - Add global kubeAPIServerCPUFraction to ResourceBasedAutoscalingConfiguration - Add per-size fraction overrides (memory and CPU) to SizeCapacity - Implement effective fraction resolution: per-size > global > default (0.65) - Add recommendedSizeByCPU and recommendedSizeByBoth methods to sizing cache - Update controller to extract both CPU and memory from VPA UncappedTarget - Final size is max(memory-driven size, CPU-driven size) - Maintain backward compatibility: memory-only sizing when CPU data is absent OCPSTRAT-2915 Signed-off-by: Cesar Wong <cewong@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-02-23T18:57:45Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-02-23T18:58:16Z

Walkthrough

The changes introduce CPU-based sizing support for Kube API Server configuration alongside existing memory-based sizing. New fields for CPU and memory fractions are added to type definitions and CRD schemas, while controller and caching logic are extended to compute sizing recommendations based on both memory and CPU alongside per-size overrides.

Changes

Cohort / File(s)	Summary
Type Definitions and Schema `api/scheduling/v1alpha1/clustersizingconfiguration_types.go`, `cmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yaml`	Added KubeAPIServerMemoryFraction and KubeAPIServerCPUFraction fields to SizeCapacity and ResourceBasedAutoscalingConfiguration types; added corresponding properties to CRD OpenAPI schema with validation patterns and default values.
Machine Sizes Cache Logic `hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`	Reworked size recommendation logic to support both memory and CPU with per-size fraction overrides. Added per-size fraction tracking, global CPU fraction validation, and dual-resource recommendation functions (recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, recommendedSizeByBoth) with precedence resolution (per-size > global > default).
Controller Logic `hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`	Extended sizing decisions to consider both memory and CPU recommendations from VPA. Updated recommendation capture, logging, and size selection logic to conditionally use memory-only, CPU-only, or largest-requirement sizing based on available recommendations.
Test Coverage `hypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.go`, `hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go`	Added helpers (defaultSizeCacheWithCPU, vpaWithMemoryAndCPURecommendation), new test function TestRecommendedClusterSize, and comprehensive unit tests for CPU fraction validation, per-size overrides, dual-resource recommendation logic, and backward compatibility.

Sequence Diagram

sequenceDiagram
    participant VPA as VPA Recommendations
    participant Controller as CPU Autoscaler Controller
    participant Cache as Machine Sizes Cache
    participant ClusterSize as Cluster Sizing Decision

    VPA->>Controller: Provide memory & CPU recommendations<br/>for kube-apiserver
    
    Controller->>Cache: Query recommendedSizeByMemory()<br/>with memory recommendation
    Cache-->>Controller: Return size for memory requirement
    
    Controller->>Cache: Query recommendedSizeByCPU()<br/>with CPU recommendation
    Cache-->>Controller: Return size for CPU requirement
    
    alt Both memory and CPU recommendations exist
        Controller->>Cache: Call recommendedSizeByBoth()<br/>(memory size, CPU size)
        Cache-->>Controller: Return largest requirement size
        Cache->>Cache: Check per-size overrides<br/>using effectiveMemoryFraction()<br/>& effectiveCPUFraction()
    else Only memory exists
        Controller->>ClusterSize: Use memory-based sizing
    else Only CPU exists
        Controller->>ClusterSize: Use CPU-based sizing
    end
    
    Controller->>ClusterSize: Apply effective fractions<br/>(per-size > global > default)
    ClusterSize-->>Controller: Return recommended cluster size

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Tests demonstrate good organization with proper single responsibility, setup patterns, and no timeout issues, but assertion messages are inconsistently applied throughout the test suite.	Add descriptive failure messages to all assertions lacking them to improve maintainability and provide clearer diagnostic information when tests fail.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the primary change: adding CPU-based autoscaling to the control plane, which aligns with the main objective of extending the autoscaler to consider both CPU and memory recommendations.
Stable And Deterministic Test Names	✅ Passed	All test names in the PR are stable and deterministic with no dynamic values like timestamps, UUIDs, or generated identifiers.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-02-23T18:58:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

csrwng · 2026-02-23T19:05:43Z

/test verify
/test unit

csrwng · 2026-02-23T19:06:16Z

@coderabbitai review

coderabbitai · 2026-02-23T19:06:22Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`:
- Around line 227-234: The log reads fractions from r.sizeCache via
effectiveMemoryFraction and effectiveCPUFraction without holding the cache
mutex, risking race with concurrent update() calls; fix by adding locking inside
sizeCache.effectiveMemoryFraction and sizeCache.effectiveCPUFraction (acquire
the same mutex used by sizeCache.update/recommendedSizeByBoth), or alternatively
add a new sizeCache method (e.g., recommendedWithFractions or
getSizeAndFractions) that returns the chosen size and the memory/CPU fractions
while holding the lock so the logged fractions are consistent with the sizing
decision.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 359-378: The code assumes sizesInOrderByMemory() produces a
memory-sorted list where increasing memory implies non-decreasing CPU, which can
be violated by real instance types; add validation in updateSizesFromConfig()
and updateSizesFromMachineSets() after sizes and sizesInOrderByMemory() are
populated: iterate the memory-ordered list and ensure each subsequent size's CPU
(resources.CPU.AsApproximateFloat64() * effectiveCPUFraction(size)) is >= the
previous CPU value, and if not return/log a clear error (or remove/skip invalid
sizes) so recommendedSizeByBoth() and recommendedSizeByCPULocked() are never
given inconsistent orderings; alternatively (if you prefer refactor) change
recommendedSizeByBoth() to compute recommendedSizeByCPULocked(cpu) and
recommendedSizeByMemoryLocked(mem) independently and return the larger of the
two results instead of relying on sizesInOrderByMemory().

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 79364b4 and 75c092d.

⛔ Files ignored due to path filters (2)

api/scheduling/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
vendor/github.com/openshift/hypershift/api/scheduling/v1alpha1/clustersizingconfiguration_types.go is excluded by !**/vendor/**, !vendor/**

📒 Files selected for processing (6)

api/scheduling/v1alpha1/clustersizingconfiguration_types.go
cmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yaml
hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go
hypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.go
hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go
hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go

coderabbitai · 2026-02-23T19:11:00Z

hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go

+	if kubeAPIServerMemory != nil {
+		logKVs = append(logKVs, "kubeAPIServerMemory", kubeAPIServerMemory.String())
+		logKVs = append(logKVs, "effectiveMemoryFraction", r.sizeCache.effectiveMemoryFraction(recommendedSize))
+	}
+	if kubeAPIServerCPU != nil {
+		logKVs = append(logKVs, "kubeAPIServerCPU", kubeAPIServerCPU.String())
+		logKVs = append(logKVs, "effectiveCPUFraction", r.sizeCache.effectiveCPUFraction(recommendedSize))
+	}


⚠️ Potential issue | 🟡 Minor

Potential data race when accessing cache without holding lock.

The effectiveMemoryFraction and effectiveCPUFraction methods are called on r.sizeCache without acquiring the mutex. Looking at the cache implementation, these methods read from perSizeFractions, kasMemorySizeFraction, and kasCPUSizeFraction which can be modified by concurrent update() calls.

While the sizing decision itself is protected by the lock inside recommendedSizeByBoth, the fraction values logged here could be inconsistent with the actual values used for sizing if an update occurs between these calls.

Consider either:

Adding lock acquisition to effectiveMemoryFraction/effectiveCPUFraction methods, or

Exposing a method that returns sizing result along with the fractions used, ensuring consistency

🔒 Proposed fix: Add locking to the effective fraction methods in machine_sizes_cache.go

func (s *machineSizesCache) effectiveMemoryFraction(sizeName string) float64 { + s.m.Lock() + defer s.m.Unlock() + return s.effectiveMemoryFractionLocked(sizeName) +} + +func (s *machineSizesCache) effectiveMemoryFractionLocked(sizeName string) float64 { if s.perSizeFractions != nil { if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.memoryFraction != nil { return fractions.memoryFraction.AsApproximateFloat64() } } return s.kasMemoryFraction() } func (s *machineSizesCache) effectiveCPUFraction(sizeName string) float64 { + s.m.Lock() + defer s.m.Unlock() + return s.effectiveCPUFractionLocked(sizeName) +} + +func (s *machineSizesCache) effectiveCPUFractionLocked(sizeName string) float64 { if s.perSizeFractions != nil { if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.cpuFraction != nil { return fractions.cpuFraction.AsApproximateFloat64() } } return s.kasCPUFraction() }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if kubeAPIServerMemory != nil {

logKVs = append(logKVs, "kubeAPIServerMemory", kubeAPIServerMemory.String())

logKVs = append(logKVs, "effectiveMemoryFraction", r.sizeCache.effectiveMemoryFraction(recommendedSize))

}

if kubeAPIServerCPU != nil {

logKVs = append(logKVs, "kubeAPIServerCPU", kubeAPIServerCPU.String())

logKVs = append(logKVs, "effectiveCPUFraction", r.sizeCache.effectiveCPUFraction(recommendedSize))

}

func (s *machineSizesCache) effectiveMemoryFraction(sizeName string) float64 {

s.m.Lock()

defer s.m.Unlock()

return s.effectiveMemoryFractionLocked(sizeName)

}

func (s *machineSizesCache) effectiveMemoryFractionLocked(sizeName string) float64 {

if s.perSizeFractions != nil {

if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.memoryFraction != nil {

return fractions.memoryFraction.AsApproximateFloat64()

}

}

return s.kasMemoryFraction()

}

func (s *machineSizesCache) effectiveCPUFraction(sizeName string) float64 {

s.m.Lock()

defer s.m.Unlock()

return s.effectiveCPUFractionLocked(sizeName)

}

func (s *machineSizesCache) effectiveCPUFractionLocked(sizeName string) float64 {

if s.perSizeFractions != nil {

if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.cpuFraction != nil {

return fractions.cpuFraction.AsApproximateFloat64()

}

}

return s.kasCPUFraction()

}

🧰 Tools

🪛 golangci-lint (2.5.0)

[error] 228-228: : # github.com/openshift/hypershift/sync-global-pullsecret [github.com/openshift/hypershift/sync-global-pullsecret.test]
sync-global-pullsecret/sync-global-pullsecret_test.go:228:23: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:234:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:247:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:257:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:270:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:283:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:296:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:309:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:327:12: undefined: NewMockdbusConn

(typecheck)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go` around lines 227 - 234, The log reads fractions from r.sizeCache via effectiveMemoryFraction and effectiveCPUFraction without holding the cache mutex, risking race with concurrent update() calls; fix by adding locking inside sizeCache.effectiveMemoryFraction and sizeCache.effectiveCPUFraction (acquire the same mutex used by sizeCache.update/recommendedSizeByBoth), or alternatively add a new sizeCache method (e.g., recommendedWithFractions or getSizeAndFractions) that returns the chosen size and the memory/CPU fractions while holding the lock so the logged fractions are consistent with the sizing decision.

coderabbitai · 2026-02-23T19:11:01Z

hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go

+// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
+// the given CPU requirement. Must be called with the mutex held.
+func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
+	sizesInOrder := s.sizesInOrderByMemory()
+	if len(sizesInOrder) == 0 {
+		return ""
+	}
+	for _, size := range sizesInOrder {
+		resources, hasSize := s.sizes[size]
+		if !hasSize {
+			continue
+		}
+		containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
+		if containerCPUCapacity >= cpu {
+			return size
+		}
+	}
+	// Best effort: return the largest cluster size
+	return sizesInOrder[len(sizesInOrder)-1]
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && wc -l machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 90

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '359,390p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1545

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "sizesInOrderBy" machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 344

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "more memory also has more CPU" machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 152

🏁 Script executed:

cd hypershift-operator && find . -name "*test*.go" -o -name "*_test.go" | head -20

Repository: openshift/hypershift

Length of output: 1334

🏁 Script executed:

cd hypershift-operator && rg -n "sizes\[|SizeCapacity|Memory:|CPU:" --type go -A 2 | head -100

Repository: openshift/hypershift

Length of output: 7495

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '284,310p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1095

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '375,410p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1498

🏁 Script executed:

cd hypershift-operator && find . -name "*.go" | xargs grep -l "compute-optimized\|memory-optimized\|instance.*type" | head -10

Repository: openshift/hypershift

Length of output: 546

🏁 Script executed:

cd hypershift-operator && rg -n "AddSizeCapacity|SetSizeCapacity" --type go | head -20

Repository: openshift/hypershift

Length of output: 46

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '320,350p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 1407

🏁 Script executed:

cd hypershift-operator && rg -n "machineSizesCache|NewMachineSizesCache|machineSizesCache{" --type go | head -20

Repository: openshift/hypershift

Length of output: 241

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "func (s \*machineSizesCache)" machine_sizes_cache.go | head -20

Repository: openshift/hypershift

Length of output: 1289

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '1,100p' machine_sizes_cache.go | head -80

Repository: openshift/hypershift

Length of output: 3038

🏁 Script executed:

cd hypershift-operator/controllers/nodepool/instancetype && cat -n aws/provider.go | head -150

Repository: openshift/hypershift

Length of output: 5654

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '143,180p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 2108

🏁 Script executed:

cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '181,244p' machine_sizes_cache.go | cat -n

Repository: openshift/hypershift

Length of output: 2390

🏁 Script executed:

cd hypershift-operator && rg -n "compute.*optimized|memory.*optimized|t3\.|c5\.|r5\." --type go | head -20

Repository: openshift/hypershift

Length of output: 817

🏁 Script executed:

cd hypershift-operator && rg -n "validation|Validation|assert|Assert" controllers/resourcebasedcpautoscaler/machine_sizes_cache.go

Repository: openshift/hypershift

Length of output: 46

Add validation to enforce the memory-CPU ordering assumption or refactor to handle inconsistent orderings.

The assumption stated at lines 384-385 that "a size with more memory also has more CPU" is critical to recommendedSizeByBoth(), which uses sizesInOrderByMemory() as the canonical ordering for both resources. However, no validation enforces this assumption when sizes are loaded.

Real-world AWS instance types can violate this: compute-optimized instances (e.g., c5.2xlarge: 8 vCPU, 16 GiB memory) have high CPU but low memory, while memory-optimized instances (e.g., r5.xlarge: 4 vCPU, 32 GiB memory) have low CPU but high memory. If such instances are used as cluster sizes, recommendedSizeByBoth() will return an incorrectly-sized cluster by assuming the memory ordering matches CPU ordering.

Either validate the assumption in updateSizesFromConfig() and updateSizesFromMachineSets() with an error if violated, or refactor the sizing logic to independently order by CPU and memory, then select the larger result by explicit comparison rather than relying on a shared ordering.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go` around lines 359 - 378, The code assumes sizesInOrderByMemory() produces a memory-sorted list where increasing memory implies non-decreasing CPU, which can be violated by real instance types; add validation in updateSizesFromConfig() and updateSizesFromMachineSets() after sizes and sizesInOrderByMemory() are populated: iterate the memory-ordered list and ensure each subsequent size's CPU (resources.CPU.AsApproximateFloat64() * effectiveCPUFraction(size)) is >= the previous CPU value, and if not return/log a clear error (or remove/skip invalid sizes) so recommendedSizeByBoth() and recommendedSizeByCPULocked() are never given inconsistent orderings; alternatively (if you prefer refactor) change recommendedSizeByBoth() to compute recommendedSizeByCPULocked(cpu) and recommendedSizeByMemoryLocked(mem) independently and return the larger of the two results instead of relying on sizesInOrderByMemory().

The HyperShift Agents caught this as well.

@bryan-cox, thank you for confirming! I'm glad the issue was independently identified.

Since this is a known concern, would you like me to:

Create a tracking issue for this validation/refactoring work if one doesn't exist yet?

Help plan the code changes to address it (either adding validation or refactoring the sizing logic)?

Let me know how I can assist! Otherwise, I'll leave this for you and the team to handle as you see fit.

openshift-ci · 2026-02-23T19:23:42Z

@csrwng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/verify	`75c092d`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go (1)

184-235: ⚠️ Potential issue | 🟡 Minor

Potential data race on fraction reads in logging.

effectiveMemoryFraction/effectiveCPUFraction are read without holding the cache mutex, so logging can race with concurrent updates.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`
around lines 184 - 235, The log section reads
r.sizeCache.effectiveMemoryFraction(...) and
r.sizeCache.effectiveCPUFraction(...) without synchronizing the sizeCache, which
can race with concurrent updates; update the code around the logging (the block
that builds logKVs and calls log.Info) to call thread-safe accessors or acquire
the same mutex used by sizeCache before invoking
effectiveMemoryFraction/effectiveCPUFraction (or add and use exported getter
methods on sizeCache that lock internally) so the fraction reads are protected.

hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go (1)

380-409: ⚠️ Potential issue | 🟠 Major

Ordering assumption between memory and CPU still unchecked.

recommendedSizeByBoth (and the CPU path it depends on) assumes memory ordering implies non-decreasing CPU; real instance types can violate this and yield incorrect sizing.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`
around lines 380 - 409, The current recommendedSizeByBoth assumes sizes sorted
by memory also have non-decreasing CPU which can be false; update
recommendedSizeByBoth to pick a size that actually satisfies both resources
rather than relying on index ordering: call recommendedSizeByMemoryLocked and
recommendedSizeByCPULocked to get initial candidates, then use
sizesInOrderByMemory() to start from the larger (later) index but verify the
chosen size actually meets both memory and cpu by consulting the size spec in
s.sizes (or the struct that holds memory/cpu for each size); if the selected
size doesn't satisfy both, scan forward in sizesInOrderByMemory until you find
the first size that meets both constraints (or return "" if none), keeping the
existing locking and function names (recommendedSizeByBoth,
recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, sizesInOrderByMemory,
s.sizes).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 359-377: The function recommendedSizeByCPULocked currently treats
zero CPU capacity as a valid value and will pick the largest size; change it so
that if the cache lacks CPU capacity data for all sizes it returns "" to trigger
the caller's memory-only fallback. Concretely, in recommendedSizeByCPULocked use
resources.CPU.AsApproximateFloat64() (and/or s.effectiveCPUFraction(size)) to
detect "CPU unset" (e.g. total CPU == 0 or effective fraction == 0) and skip
those sizes; if no size with non-zero CPU capacity is found return "" instead of
the largest size. Also ensure updateSizesFromConfig continues to set CPU=0 for
legacy configs and the new logic above treats that as "missing" rather than a
capacity of zero.

---

Duplicate comments:
In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`:
- Around line 184-235: The log section reads
r.sizeCache.effectiveMemoryFraction(...) and
r.sizeCache.effectiveCPUFraction(...) without synchronizing the sizeCache, which
can race with concurrent updates; update the code around the logging (the block
that builds logKVs and calls log.Info) to call thread-safe accessors or acquire
the same mutex used by sizeCache before invoking
effectiveMemoryFraction/effectiveCPUFraction (or add and use exported getter
methods on sizeCache that lock internally) so the fraction reads are protected.

In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 380-409: The current recommendedSizeByBoth assumes sizes sorted by
memory also have non-decreasing CPU which can be false; update
recommendedSizeByBoth to pick a size that actually satisfies both resources
rather than relying on index ordering: call recommendedSizeByMemoryLocked and
recommendedSizeByCPULocked to get initial candidates, then use
sizesInOrderByMemory() to start from the larger (later) index but verify the
chosen size actually meets both memory and cpu by consulting the size spec in
s.sizes (or the struct that holds memory/cpu for each size); if the selected
size doesn't satisfy both, scan forward in sizesInOrderByMemory until you find
the first size that meets both constraints (or return "" if none), keeping the
existing locking and function names (recommendedSizeByBoth,
recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, sizesInOrderByMemory,
s.sizes).

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 79364b4 and 75c092d.

⛔ Files ignored due to path filters (2)

api/scheduling/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
vendor/github.com/openshift/hypershift/api/scheduling/v1alpha1/clustersizingconfiguration_types.go is excluded by !**/vendor/**, !vendor/**

📒 Files selected for processing (6)

api/scheduling/v1alpha1/clustersizingconfiguration_types.go
cmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yaml
hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go
hypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.go
hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go
hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go

coderabbitai · 2026-02-23T19:28:23Z

hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go

+// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
+// the given CPU requirement. Must be called with the mutex held.
+func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
+	sizesInOrder := s.sizesInOrderByMemory()
+	if len(sizesInOrder) == 0 {
+		return ""
+	}
+	for _, size := range sizesInOrder {
+		resources, hasSize := s.sizes[size]
+		if !hasSize {
+			continue
+		}
+		containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
+		if containerCPUCapacity >= cpu {
+			return size
+		}
+	}
+	// Best effort: return the largest cluster size
+	return sizesInOrder[len(sizesInOrder)-1]


⚠️ Potential issue | 🟠 Major

Backward-compat risk when CPU capacity is unset.

If a legacy ClusterSizingConfiguration only provides memory capacity, updateSizesFromConfig sets CPU to 0 for all sizes. When VPA includes CPU recommendations, recommendedSizeByCPULocked will return the largest size, even if memory would fit — which breaks the stated memory-only fallback.

Consider treating “no CPU capacity data” as “CPU not available” and returning "" so the caller falls back to memory-only sizing.

🛠️ Suggested fix to preserve memory-only behavior when CPU capacity is absent

func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string { sizesInOrder := s.sizesInOrderByMemory() if len(sizesInOrder) == 0 { return "" } + hasCPUCapacity := false for _, size := range sizesInOrder { resources, hasSize := s.sizes[size] if !hasSize { continue } + if resources.CPU.AsApproximateFloat64() > 0 { + hasCPUCapacity = true + } containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size) if containerCPUCapacity >= cpu { return size } } + if !hasCPUCapacity { + return "" + } // Best effort: return the largest cluster size return sizesInOrder[len(sizesInOrder)-1] }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go` around lines 359 - 377, The function recommendedSizeByCPULocked currently treats zero CPU capacity as a valid value and will pick the largest size; change it so that if the cache lacks CPU capacity data for all sizes it returns "" to trigger the caller's memory-only fallback. Concretely, in recommendedSizeByCPULocked use resources.CPU.AsApproximateFloat64() (and/or s.effectiveCPUFraction(size)) to detect "CPU unset" (e.g. total CPU == 0 or effective fraction == 0) and skip those sizes; if no size with non-zero CPU capacity is found return "" instead of the largest size. Also ensure updateSizesFromConfig continues to set CPU=0 for legacy configs and the new logic above treats that as "missing" rather than a capacity of zero.

HyperShift SME agents caught this too.

@bryan-cox, thanks for confirming! It's good to see convergence across reviewers on this backward-compatibility concern.

If you'd like, I can help plan the code changes to address this issue—just let me know. Otherwise, feel free to proceed with your own implementation. The suggested fix in the review comment should preserve the memory-only fallback behavior when CPU capacity data is absent.

bryan-cox · 2026-02-24T12:21:01Z

api/scheduling/v1alpha1/clustersizingconfiguration_types.go

+
+	// kubeAPIServerMemoryFraction is a number between 0 and 1 that overrides the global
+	// kubeAPIServerMemoryFraction for this specific size. If not specified, the global
+	// fraction (or its default) is used.


Could we specify that the default is?

bryan-cox · 2026-02-24T12:21:07Z

api/scheduling/v1alpha1/clustersizingconfiguration_types.go

+	// kubeAPIServerCPUFraction for this specific size. If not specified, the global
+	// fraction (or its default) is used.
+	// +optional
+	KubeAPIServerCPUFraction *resource.Quantity `json:"kubeAPIServerCPUFraction,omitempty"`


Could we specify that the default is?

bryan-cox · 2026-02-24T12:32:59Z

hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go

+// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
+// the given CPU requirement. Must be called with the mutex held.
+func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
+	sizesInOrder := s.sizesInOrderByMemory()
+	if len(sizesInOrder) == 0 {
+		return ""
+	}
+	for _, size := range sizesInOrder {
+		resources, hasSize := s.sizes[size]
+		if !hasSize {
+			continue
+		}
+		containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
+		if containerCPUCapacity >= cpu {
+			return size
+		}
+	}
+	// Best effort: return the largest cluster size
+	return sizesInOrder[len(sizesInOrder)-1]
+}


The HyperShift Agents caught this as well.

bryan-cox · 2026-02-24T12:35:10Z

api/scheduling/v1alpha1/clustersizingconfiguration_types.go

Could you add kubebuilder validations for these two fields?

bryan-cox · 2026-02-24T12:51:48Z

hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go

+// recommendedSizeByCPULocked returns the smallest cluster size that can accommodate
+// the given CPU requirement. Must be called with the mutex held.
+func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
+	sizesInOrder := s.sizesInOrderByMemory()
+	if len(sizesInOrder) == 0 {
+		return ""
+	}
+	for _, size := range sizesInOrder {
+		resources, hasSize := s.sizes[size]
+		if !hasSize {
+			continue
+		}
+		containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
+		if containerCPUCapacity >= cpu {
+			return size
+		}
+	}
+	// Best effort: return the largest cluster size
+	return sizesInOrder[len(sizesInOrder)-1]


HyperShift SME agents caught this too.

bryan-cox · 2026-02-24T12:52:57Z

hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go


 	// Log the size change with detailed information
-	log.Info("Updating cluster size recommendation",
+	logKVs := []interface{}{


Should Prometheus metrics be emitted when sizes change?

bryan-cox · 2026-02-24T12:53:38Z

hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go

 	controller := &ControlPlaneAutoscalerController{
 		Client: mgr.GetClient(),
 	}
 	if err := ctrl.NewControllerManagedBy(mgr).


Should ClusterSizingConfiguration changes also be watched?

bryan-cox · 2026-02-24T12:54:26Z

hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go

nit: Add t.Parallel() to tests

bryan-cox · 2026-02-24T12:54:34Z

hypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.go

nit: Add t.Parallel() to tests

bryan-cox · 2026-02-24T12:59:04Z

hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go

+	})
+}
+
+func TestRecommendedSizeByCPU(t *testing.T) {


Could we add tests that test the ordering from a different direction to make sure that validates expected behavior?

Like

cache := machineSizesCache{ sizes: map[string]machineResources{ "compute": {Memory: resource.MustParse("4Gi"), CPU: resource.MustParse("8")}, "memory": {Memory: resource.MustParse("16Gi"), CPU: resource.MustParse("4")}, "large": {Memory: resource.MustParse("32Gi"), CPU: resource.MustParse("16")}, }, }

openshift-ci bot added do-not-merge/needs-area area/api Indicates the PR includes changes for the API labels Feb 23, 2026

csrwng marked this pull request as draft February 23, 2026 18:58

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2026

openshift-ci bot requested review from bryan-cox and jparrill February 23, 2026 18:58

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

bryan-cox reviewed Feb 24, 2026

View reviewed changes

-	if kubeAPIServerMemory != nil {
-		logKVs = append(logKVs, "kubeAPIServerMemory", kubeAPIServerMemory.String())
-		logKVs = append(logKVs, "effectiveMemoryFraction", r.sizeCache.effectiveMemoryFraction(recommendedSize))
-	}
-	if kubeAPIServerCPU != nil {
-		logKVs = append(logKVs, "kubeAPIServerCPU", kubeAPIServerCPU.String())
-		logKVs = append(logKVs, "effectiveCPUFraction", r.sizeCache.effectiveCPUFraction(recommendedSize))
-	}
+func (s *machineSizesCache) effectiveMemoryFraction(sizeName string) float64 {
+	s.m.Lock()
+	defer s.m.Unlock()
+	return s.effectiveMemoryFractionLocked(sizeName)
+}
+func (s *machineSizesCache) effectiveMemoryFractionLocked(sizeName string) float64 {
+	if s.perSizeFractions != nil {
+		if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.memoryFraction != nil {
+			return fractions.memoryFraction.AsApproximateFloat64()
+		}
+	}
+	return s.kasMemoryFraction()
+}
+func (s *machineSizesCache) effectiveCPUFraction(sizeName string) float64 {
+	s.m.Lock()
+	defer s.m.Unlock()
+	return s.effectiveCPUFractionLocked(sizeName)
+}
+func (s *machineSizesCache) effectiveCPUFractionLocked(sizeName string) float64 {
+	if s.perSizeFractions != nil {
+		if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.cpuFraction != nil {
+			return fractions.cpuFraction.AsApproximateFloat64()
+		}
+	}
+	return s.kasCPUFraction()
+}

Conversation

csrwng commented Feb 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

API Changes (ClusterSizingConfiguration)

Controller Changes

Cache Changes

Backward Compatibility

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Feb 23, 2026

Uh oh!

coderabbitai bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci bot commented Feb 23, 2026

Uh oh!

csrwng commented Feb 23, 2026

Uh oh!

csrwng commented Feb 23, 2026

Uh oh!

coderabbitai bot commented Feb 23, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Feb 23, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

csrwng commented Feb 23, 2026 •

edited by coderabbitai bot

Loading

API Changes (`ClusterSizingConfiguration`)

coderabbitai bot commented Feb 23, 2026 •

edited

Loading

coderabbitai bot Feb 23, 2026 •

edited

Loading

coderabbitai bot Feb 23, 2026 •

edited

Loading