feat(autoscaler): add CPU-based control plane autoscaling#7783
feat(autoscaler): add CPU-based control plane autoscaling#7783csrwng wants to merge 1 commit intoopenshift:mainfrom
Conversation
…size resource fractions Extend the ResourceBasedControlPlaneAutoscaler to consider both CPU and memory VPA recommendations when determining hosted cluster size. Previously, sizing was based solely on kube-apiserver memory recommendations, which could lead to under-provisioning when CPU was the bottleneck. Changes: - Add global kubeAPIServerCPUFraction to ResourceBasedAutoscalingConfiguration - Add per-size fraction overrides (memory and CPU) to SizeCapacity - Implement effective fraction resolution: per-size > global > default (0.65) - Add recommendedSizeByCPU and recommendedSizeByBoth methods to sizing cache - Update controller to extract both CPU and memory from VPA UncappedTarget - Final size is max(memory-driven size, CPU-driven size) - Maintain backward compatibility: memory-only sizing when CPU data is absent OCPSTRAT-2915 Signed-off-by: Cesar Wong <cewong@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
WalkthroughThe changes introduce CPU-based sizing support for Kube API Server configuration alongside existing memory-based sizing. New fields for CPU and memory fractions are added to type definitions and CRD schemas, while controller and caching logic are extended to compute sizing recommendations based on both memory and CPU alongside per-size overrides. Changes
Sequence DiagramsequenceDiagram
participant VPA as VPA Recommendations
participant Controller as CPU Autoscaler Controller
participant Cache as Machine Sizes Cache
participant ClusterSize as Cluster Sizing Decision
VPA->>Controller: Provide memory & CPU recommendations<br/>for kube-apiserver
Controller->>Cache: Query recommendedSizeByMemory()<br/>with memory recommendation
Cache-->>Controller: Return size for memory requirement
Controller->>Cache: Query recommendedSizeByCPU()<br/>with CPU recommendation
Cache-->>Controller: Return size for CPU requirement
alt Both memory and CPU recommendations exist
Controller->>Cache: Call recommendedSizeByBoth()<br/>(memory size, CPU size)
Cache-->>Controller: Return largest requirement size
Cache->>Cache: Check per-size overrides<br/>using effectiveMemoryFraction()<br/>& effectiveCPUFraction()
else Only memory exists
Controller->>ClusterSize: Use memory-based sizing
else Only CPU exists
Controller->>ClusterSize: Use CPU-based sizing
end
Controller->>ClusterSize: Apply effective fractions<br/>(per-size > global > default)
ClusterSize-->>Controller: Return recommended cluster size
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test verify |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`:
- Around line 227-234: The log reads fractions from r.sizeCache via
effectiveMemoryFraction and effectiveCPUFraction without holding the cache
mutex, risking race with concurrent update() calls; fix by adding locking inside
sizeCache.effectiveMemoryFraction and sizeCache.effectiveCPUFraction (acquire
the same mutex used by sizeCache.update/recommendedSizeByBoth), or alternatively
add a new sizeCache method (e.g., recommendedWithFractions or
getSizeAndFractions) that returns the chosen size and the memory/CPU fractions
while holding the lock so the logged fractions are consistent with the sizing
decision.
In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 359-378: The code assumes sizesInOrderByMemory() produces a
memory-sorted list where increasing memory implies non-decreasing CPU, which can
be violated by real instance types; add validation in updateSizesFromConfig()
and updateSizesFromMachineSets() after sizes and sizesInOrderByMemory() are
populated: iterate the memory-ordered list and ensure each subsequent size's CPU
(resources.CPU.AsApproximateFloat64() * effectiveCPUFraction(size)) is >= the
previous CPU value, and if not return/log a clear error (or remove/skip invalid
sizes) so recommendedSizeByBoth() and recommendedSizeByCPULocked() are never
given inconsistent orderings; alternatively (if you prefer refactor) change
recommendedSizeByBoth() to compute recommendedSizeByCPULocked(cpu) and
recommendedSizeByMemoryLocked(mem) independently and return the larger of the
two results instead of relying on sizesInOrderByMemory().
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (2)
api/scheduling/v1alpha1/zz_generated.deepcopy.gois excluded by!**/zz_generated*vendor/github.com/openshift/hypershift/api/scheduling/v1alpha1/clustersizingconfiguration_types.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (6)
api/scheduling/v1alpha1/clustersizingconfiguration_types.gocmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yamlhypershift-operator/controllers/resourcebasedcpautoscaler/controller.gohypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.gohypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.gohypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go
| if kubeAPIServerMemory != nil { | ||
| logKVs = append(logKVs, "kubeAPIServerMemory", kubeAPIServerMemory.String()) | ||
| logKVs = append(logKVs, "effectiveMemoryFraction", r.sizeCache.effectiveMemoryFraction(recommendedSize)) | ||
| } | ||
| if kubeAPIServerCPU != nil { | ||
| logKVs = append(logKVs, "kubeAPIServerCPU", kubeAPIServerCPU.String()) | ||
| logKVs = append(logKVs, "effectiveCPUFraction", r.sizeCache.effectiveCPUFraction(recommendedSize)) | ||
| } |
There was a problem hiding this comment.
Potential data race when accessing cache without holding lock.
The effectiveMemoryFraction and effectiveCPUFraction methods are called on r.sizeCache without acquiring the mutex. Looking at the cache implementation, these methods read from perSizeFractions, kasMemorySizeFraction, and kasCPUSizeFraction which can be modified by concurrent update() calls.
While the sizing decision itself is protected by the lock inside recommendedSizeByBoth, the fraction values logged here could be inconsistent with the actual values used for sizing if an update occurs between these calls.
Consider either:
- Adding lock acquisition to
effectiveMemoryFraction/effectiveCPUFractionmethods, or - Exposing a method that returns sizing result along with the fractions used, ensuring consistency
🔒 Proposed fix: Add locking to the effective fraction methods in machine_sizes_cache.go
func (s *machineSizesCache) effectiveMemoryFraction(sizeName string) float64 {
+ s.m.Lock()
+ defer s.m.Unlock()
+ return s.effectiveMemoryFractionLocked(sizeName)
+}
+
+func (s *machineSizesCache) effectiveMemoryFractionLocked(sizeName string) float64 {
if s.perSizeFractions != nil {
if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.memoryFraction != nil {
return fractions.memoryFraction.AsApproximateFloat64()
}
}
return s.kasMemoryFraction()
}
func (s *machineSizesCache) effectiveCPUFraction(sizeName string) float64 {
+ s.m.Lock()
+ defer s.m.Unlock()
+ return s.effectiveCPUFractionLocked(sizeName)
+}
+
+func (s *machineSizesCache) effectiveCPUFractionLocked(sizeName string) float64 {
if s.perSizeFractions != nil {
if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.cpuFraction != nil {
return fractions.cpuFraction.AsApproximateFloat64()
}
}
return s.kasCPUFraction()
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if kubeAPIServerMemory != nil { | |
| logKVs = append(logKVs, "kubeAPIServerMemory", kubeAPIServerMemory.String()) | |
| logKVs = append(logKVs, "effectiveMemoryFraction", r.sizeCache.effectiveMemoryFraction(recommendedSize)) | |
| } | |
| if kubeAPIServerCPU != nil { | |
| logKVs = append(logKVs, "kubeAPIServerCPU", kubeAPIServerCPU.String()) | |
| logKVs = append(logKVs, "effectiveCPUFraction", r.sizeCache.effectiveCPUFraction(recommendedSize)) | |
| } | |
| func (s *machineSizesCache) effectiveMemoryFraction(sizeName string) float64 { | |
| s.m.Lock() | |
| defer s.m.Unlock() | |
| return s.effectiveMemoryFractionLocked(sizeName) | |
| } | |
| func (s *machineSizesCache) effectiveMemoryFractionLocked(sizeName string) float64 { | |
| if s.perSizeFractions != nil { | |
| if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.memoryFraction != nil { | |
| return fractions.memoryFraction.AsApproximateFloat64() | |
| } | |
| } | |
| return s.kasMemoryFraction() | |
| } | |
| func (s *machineSizesCache) effectiveCPUFraction(sizeName string) float64 { | |
| s.m.Lock() | |
| defer s.m.Unlock() | |
| return s.effectiveCPUFractionLocked(sizeName) | |
| } | |
| func (s *machineSizesCache) effectiveCPUFractionLocked(sizeName string) float64 { | |
| if s.perSizeFractions != nil { | |
| if fractions, ok := s.perSizeFractions[sizeName]; ok && fractions.cpuFraction != nil { | |
| return fractions.cpuFraction.AsApproximateFloat64() | |
| } | |
| } | |
| return s.kasCPUFraction() | |
| } |
🧰 Tools
🪛 golangci-lint (2.5.0)
[error] 228-228: : # github.com/openshift/hypershift/sync-global-pullsecret [github.com/openshift/hypershift/sync-global-pullsecret.test]
sync-global-pullsecret/sync-global-pullsecret_test.go:228:23: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:234:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:247:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:257:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:270:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:283:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:296:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:309:26: undefined: MockdbusConn
sync-global-pullsecret/sync-global-pullsecret_test.go:327:12: undefined: NewMockdbusConn
(typecheck)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`
around lines 227 - 234, The log reads fractions from r.sizeCache via
effectiveMemoryFraction and effectiveCPUFraction without holding the cache
mutex, risking race with concurrent update() calls; fix by adding locking inside
sizeCache.effectiveMemoryFraction and sizeCache.effectiveCPUFraction (acquire
the same mutex used by sizeCache.update/recommendedSizeByBoth), or alternatively
add a new sizeCache method (e.g., recommendedWithFractions or
getSizeAndFractions) that returns the chosen size and the memory/CPU fractions
while holding the lock so the logged fractions are consistent with the sizing
decision.
| // recommendedSizeByCPULocked returns the smallest cluster size that can accommodate | ||
| // the given CPU requirement. Must be called with the mutex held. | ||
| func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string { | ||
| sizesInOrder := s.sizesInOrderByMemory() | ||
| if len(sizesInOrder) == 0 { | ||
| return "" | ||
| } | ||
| for _, size := range sizesInOrder { | ||
| resources, hasSize := s.sizes[size] | ||
| if !hasSize { | ||
| continue | ||
| } | ||
| containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size) | ||
| if containerCPUCapacity >= cpu { | ||
| return size | ||
| } | ||
| } | ||
| // Best effort: return the largest cluster size | ||
| return sizesInOrder[len(sizesInOrder)-1] | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && wc -l machine_sizes_cache.goRepository: openshift/hypershift
Length of output: 90
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '359,390p' machine_sizes_cache.go | cat -nRepository: openshift/hypershift
Length of output: 1545
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "sizesInOrderBy" machine_sizes_cache.goRepository: openshift/hypershift
Length of output: 344
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "more memory also has more CPU" machine_sizes_cache.goRepository: openshift/hypershift
Length of output: 152
🏁 Script executed:
cd hypershift-operator && find . -name "*test*.go" -o -name "*_test.go" | head -20Repository: openshift/hypershift
Length of output: 1334
🏁 Script executed:
cd hypershift-operator && rg -n "sizes\[|SizeCapacity|Memory:|CPU:" --type go -A 2 | head -100Repository: openshift/hypershift
Length of output: 7495
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '284,310p' machine_sizes_cache.go | cat -nRepository: openshift/hypershift
Length of output: 1095
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '375,410p' machine_sizes_cache.go | cat -nRepository: openshift/hypershift
Length of output: 1498
🏁 Script executed:
cd hypershift-operator && find . -name "*.go" | xargs grep -l "compute-optimized\|memory-optimized\|instance.*type" | head -10Repository: openshift/hypershift
Length of output: 546
🏁 Script executed:
cd hypershift-operator && rg -n "AddSizeCapacity|SetSizeCapacity" --type go | head -20Repository: openshift/hypershift
Length of output: 46
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '320,350p' machine_sizes_cache.go | cat -nRepository: openshift/hypershift
Length of output: 1407
🏁 Script executed:
cd hypershift-operator && rg -n "machineSizesCache|NewMachineSizesCache|machineSizesCache{" --type go | head -20Repository: openshift/hypershift
Length of output: 241
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && grep -n "func (s \*machineSizesCache)" machine_sizes_cache.go | head -20Repository: openshift/hypershift
Length of output: 1289
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '1,100p' machine_sizes_cache.go | head -80Repository: openshift/hypershift
Length of output: 3038
🏁 Script executed:
cd hypershift-operator/controllers/nodepool/instancetype && cat -n aws/provider.go | head -150Repository: openshift/hypershift
Length of output: 5654
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '143,180p' machine_sizes_cache.go | cat -nRepository: openshift/hypershift
Length of output: 2108
🏁 Script executed:
cd hypershift-operator/controllers/resourcebasedcpautoscaler && sed -n '181,244p' machine_sizes_cache.go | cat -nRepository: openshift/hypershift
Length of output: 2390
🏁 Script executed:
cd hypershift-operator && rg -n "compute.*optimized|memory.*optimized|t3\.|c5\.|r5\." --type go | head -20Repository: openshift/hypershift
Length of output: 817
🏁 Script executed:
cd hypershift-operator && rg -n "validation|Validation|assert|Assert" controllers/resourcebasedcpautoscaler/machine_sizes_cache.goRepository: openshift/hypershift
Length of output: 46
Add validation to enforce the memory-CPU ordering assumption or refactor to handle inconsistent orderings.
The assumption stated at lines 384-385 that "a size with more memory also has more CPU" is critical to recommendedSizeByBoth(), which uses sizesInOrderByMemory() as the canonical ordering for both resources. However, no validation enforces this assumption when sizes are loaded.
Real-world AWS instance types can violate this: compute-optimized instances (e.g., c5.2xlarge: 8 vCPU, 16 GiB memory) have high CPU but low memory, while memory-optimized instances (e.g., r5.xlarge: 4 vCPU, 32 GiB memory) have low CPU but high memory. If such instances are used as cluster sizes, recommendedSizeByBoth() will return an incorrectly-sized cluster by assuming the memory ordering matches CPU ordering.
Either validate the assumption in updateSizesFromConfig() and updateSizesFromMachineSets() with an error if violated, or refactor the sizing logic to independently order by CPU and memory, then select the larger result by explicit comparison rather than relying on a shared ordering.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`
around lines 359 - 378, The code assumes sizesInOrderByMemory() produces a
memory-sorted list where increasing memory implies non-decreasing CPU, which can
be violated by real instance types; add validation in updateSizesFromConfig()
and updateSizesFromMachineSets() after sizes and sizesInOrderByMemory() are
populated: iterate the memory-ordered list and ensure each subsequent size's CPU
(resources.CPU.AsApproximateFloat64() * effectiveCPUFraction(size)) is >= the
previous CPU value, and if not return/log a clear error (or remove/skip invalid
sizes) so recommendedSizeByBoth() and recommendedSizeByCPULocked() are never
given inconsistent orderings; alternatively (if you prefer refactor) change
recommendedSizeByBoth() to compute recommendedSizeByCPULocked(cpu) and
recommendedSizeByMemoryLocked(mem) independently and return the larger of the
two results instead of relying on sizesInOrderByMemory().
There was a problem hiding this comment.
The HyperShift Agents caught this as well.
There was a problem hiding this comment.
@bryan-cox, thank you for confirming! I'm glad the issue was independently identified.
Since this is a known concern, would you like me to:
- Create a tracking issue for this validation/refactoring work if one doesn't exist yet?
- Help plan the code changes to address it (either adding validation or refactoring the sizing logic)?
Let me know how I can assist! Otherwise, I'll leave this for you and the team to handle as you see fit.
|
@csrwng: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go (1)
184-235:⚠️ Potential issue | 🟡 MinorPotential data race on fraction reads in logging.
effectiveMemoryFraction/effectiveCPUFractionare read without holding the cache mutex, so logging can race with concurrent updates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go` around lines 184 - 235, The log section reads r.sizeCache.effectiveMemoryFraction(...) and r.sizeCache.effectiveCPUFraction(...) without synchronizing the sizeCache, which can race with concurrent updates; update the code around the logging (the block that builds logKVs and calls log.Info) to call thread-safe accessors or acquire the same mutex used by sizeCache before invoking effectiveMemoryFraction/effectiveCPUFraction (or add and use exported getter methods on sizeCache that lock internally) so the fraction reads are protected.hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go (1)
380-409:⚠️ Potential issue | 🟠 MajorOrdering assumption between memory and CPU still unchecked.
recommendedSizeByBoth(and the CPU path it depends on) assumes memory ordering implies non-decreasing CPU; real instance types can violate this and yield incorrect sizing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go` around lines 380 - 409, The current recommendedSizeByBoth assumes sizes sorted by memory also have non-decreasing CPU which can be false; update recommendedSizeByBoth to pick a size that actually satisfies both resources rather than relying on index ordering: call recommendedSizeByMemoryLocked and recommendedSizeByCPULocked to get initial candidates, then use sizesInOrderByMemory() to start from the larger (later) index but verify the chosen size actually meets both memory and cpu by consulting the size spec in s.sizes (or the struct that holds memory/cpu for each size); if the selected size doesn't satisfy both, scan forward in sizesInOrderByMemory until you find the first size that meets both constraints (or return "" if none), keeping the existing locking and function names (recommendedSizeByBoth, recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, sizesInOrderByMemory, s.sizes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 359-377: The function recommendedSizeByCPULocked currently treats
zero CPU capacity as a valid value and will pick the largest size; change it so
that if the cache lacks CPU capacity data for all sizes it returns "" to trigger
the caller's memory-only fallback. Concretely, in recommendedSizeByCPULocked use
resources.CPU.AsApproximateFloat64() (and/or s.effectiveCPUFraction(size)) to
detect "CPU unset" (e.g. total CPU == 0 or effective fraction == 0) and skip
those sizes; if no size with non-zero CPU capacity is found return "" instead of
the largest size. Also ensure updateSizesFromConfig continues to set CPU=0 for
legacy configs and the new logic above treats that as "missing" rather than a
capacity of zero.
---
Duplicate comments:
In `@hypershift-operator/controllers/resourcebasedcpautoscaler/controller.go`:
- Around line 184-235: The log section reads
r.sizeCache.effectiveMemoryFraction(...) and
r.sizeCache.effectiveCPUFraction(...) without synchronizing the sizeCache, which
can race with concurrent updates; update the code around the logging (the block
that builds logKVs and calls log.Info) to call thread-safe accessors or acquire
the same mutex used by sizeCache before invoking
effectiveMemoryFraction/effectiveCPUFraction (or add and use exported getter
methods on sizeCache that lock internally) so the fraction reads are protected.
In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`:
- Around line 380-409: The current recommendedSizeByBoth assumes sizes sorted by
memory also have non-decreasing CPU which can be false; update
recommendedSizeByBoth to pick a size that actually satisfies both resources
rather than relying on index ordering: call recommendedSizeByMemoryLocked and
recommendedSizeByCPULocked to get initial candidates, then use
sizesInOrderByMemory() to start from the larger (later) index but verify the
chosen size actually meets both memory and cpu by consulting the size spec in
s.sizes (or the struct that holds memory/cpu for each size); if the selected
size doesn't satisfy both, scan forward in sizesInOrderByMemory until you find
the first size that meets both constraints (or return "" if none), keeping the
existing locking and function names (recommendedSizeByBoth,
recommendedSizeByMemoryLocked, recommendedSizeByCPULocked, sizesInOrderByMemory,
s.sizes).
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (2)
api/scheduling/v1alpha1/zz_generated.deepcopy.gois excluded by!**/zz_generated*vendor/github.com/openshift/hypershift/api/scheduling/v1alpha1/clustersizingconfiguration_types.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (6)
api/scheduling/v1alpha1/clustersizingconfiguration_types.gocmd/install/assets/hypershift-operator/scheduling.hypershift.openshift.io_clustersizingconfigurations.yamlhypershift-operator/controllers/resourcebasedcpautoscaler/controller.gohypershift-operator/controllers/resourcebasedcpautoscaler/controller_test.gohypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.gohypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache_test.go
| // recommendedSizeByCPULocked returns the smallest cluster size that can accommodate | ||
| // the given CPU requirement. Must be called with the mutex held. | ||
| func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string { | ||
| sizesInOrder := s.sizesInOrderByMemory() | ||
| if len(sizesInOrder) == 0 { | ||
| return "" | ||
| } | ||
| for _, size := range sizesInOrder { | ||
| resources, hasSize := s.sizes[size] | ||
| if !hasSize { | ||
| continue | ||
| } | ||
| containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size) | ||
| if containerCPUCapacity >= cpu { | ||
| return size | ||
| } | ||
| } | ||
| // Best effort: return the largest cluster size | ||
| return sizesInOrder[len(sizesInOrder)-1] |
There was a problem hiding this comment.
Backward-compat risk when CPU capacity is unset.
If a legacy ClusterSizingConfiguration only provides memory capacity, updateSizesFromConfig sets CPU to 0 for all sizes. When VPA includes CPU recommendations, recommendedSizeByCPULocked will return the largest size, even if memory would fit — which breaks the stated memory-only fallback.
Consider treating “no CPU capacity data” as “CPU not available” and returning "" so the caller falls back to memory-only sizing.
🛠️ Suggested fix to preserve memory-only behavior when CPU capacity is absent
func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string {
sizesInOrder := s.sizesInOrderByMemory()
if len(sizesInOrder) == 0 {
return ""
}
+ hasCPUCapacity := false
for _, size := range sizesInOrder {
resources, hasSize := s.sizes[size]
if !hasSize {
continue
}
+ if resources.CPU.AsApproximateFloat64() > 0 {
+ hasCPUCapacity = true
+ }
containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size)
if containerCPUCapacity >= cpu {
return size
}
}
+ if !hasCPUCapacity {
+ return ""
+ }
// Best effort: return the largest cluster size
return sizesInOrder[len(sizesInOrder)-1]
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hypershift-operator/controllers/resourcebasedcpautoscaler/machine_sizes_cache.go`
around lines 359 - 377, The function recommendedSizeByCPULocked currently treats
zero CPU capacity as a valid value and will pick the largest size; change it so
that if the cache lacks CPU capacity data for all sizes it returns "" to trigger
the caller's memory-only fallback. Concretely, in recommendedSizeByCPULocked use
resources.CPU.AsApproximateFloat64() (and/or s.effectiveCPUFraction(size)) to
detect "CPU unset" (e.g. total CPU == 0 or effective fraction == 0) and skip
those sizes; if no size with non-zero CPU capacity is found return "" instead of
the largest size. Also ensure updateSizesFromConfig continues to set CPU=0 for
legacy configs and the new logic above treats that as "missing" rather than a
capacity of zero.
There was a problem hiding this comment.
HyperShift SME agents caught this too.
There was a problem hiding this comment.
@bryan-cox, thanks for confirming! It's good to see convergence across reviewers on this backward-compatibility concern.
If you'd like, I can help plan the code changes to address this issue—just let me know. Otherwise, feel free to proceed with your own implementation. The suggested fix in the review comment should preserve the memory-only fallback behavior when CPU capacity data is absent.
|
|
||
| // kubeAPIServerMemoryFraction is a number between 0 and 1 that overrides the global | ||
| // kubeAPIServerMemoryFraction for this specific size. If not specified, the global | ||
| // fraction (or its default) is used. |
There was a problem hiding this comment.
Could we specify that the default is?
| // kubeAPIServerCPUFraction for this specific size. If not specified, the global | ||
| // fraction (or its default) is used. | ||
| // +optional | ||
| KubeAPIServerCPUFraction *resource.Quantity `json:"kubeAPIServerCPUFraction,omitempty"` |
There was a problem hiding this comment.
Could we specify that the default is?
| // recommendedSizeByCPULocked returns the smallest cluster size that can accommodate | ||
| // the given CPU requirement. Must be called with the mutex held. | ||
| func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string { | ||
| sizesInOrder := s.sizesInOrderByMemory() | ||
| if len(sizesInOrder) == 0 { | ||
| return "" | ||
| } | ||
| for _, size := range sizesInOrder { | ||
| resources, hasSize := s.sizes[size] | ||
| if !hasSize { | ||
| continue | ||
| } | ||
| containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size) | ||
| if containerCPUCapacity >= cpu { | ||
| return size | ||
| } | ||
| } | ||
| // Best effort: return the largest cluster size | ||
| return sizesInOrder[len(sizesInOrder)-1] | ||
| } |
There was a problem hiding this comment.
The HyperShift Agents caught this as well.
There was a problem hiding this comment.
Could you add kubebuilder validations for these two fields?
| // recommendedSizeByCPULocked returns the smallest cluster size that can accommodate | ||
| // the given CPU requirement. Must be called with the mutex held. | ||
| func (s *machineSizesCache) recommendedSizeByCPULocked(cpu float64) string { | ||
| sizesInOrder := s.sizesInOrderByMemory() | ||
| if len(sizesInOrder) == 0 { | ||
| return "" | ||
| } | ||
| for _, size := range sizesInOrder { | ||
| resources, hasSize := s.sizes[size] | ||
| if !hasSize { | ||
| continue | ||
| } | ||
| containerCPUCapacity := resources.CPU.AsApproximateFloat64() * s.effectiveCPUFraction(size) | ||
| if containerCPUCapacity >= cpu { | ||
| return size | ||
| } | ||
| } | ||
| // Best effort: return the largest cluster size | ||
| return sizesInOrder[len(sizesInOrder)-1] |
There was a problem hiding this comment.
HyperShift SME agents caught this too.
|
|
||
| // Log the size change with detailed information | ||
| log.Info("Updating cluster size recommendation", | ||
| logKVs := []interface{}{ |
There was a problem hiding this comment.
Should Prometheus metrics be emitted when sizes change?
| controller := &ControlPlaneAutoscalerController{ | ||
| Client: mgr.GetClient(), | ||
| } | ||
| if err := ctrl.NewControllerManagedBy(mgr). |
There was a problem hiding this comment.
Should ClusterSizingConfiguration changes also be watched?
There was a problem hiding this comment.
nit: Add t.Parallel() to tests
There was a problem hiding this comment.
nit: Add t.Parallel() to tests
| }) | ||
| } | ||
|
|
||
| func TestRecommendedSizeByCPU(t *testing.T) { |
There was a problem hiding this comment.
Could we add tests that test the ordering from a different direction to make sure that validates expected behavior?
Like
cache := machineSizesCache{
sizes: map[string]machineResources{
"compute": {Memory: resource.MustParse("4Gi"), CPU: resource.MustParse("8")},
"memory": {Memory: resource.MustParse("16Gi"), CPU: resource.MustParse("4")},
"large": {Memory: resource.MustParse("32Gi"), CPU: resource.MustParse("16")},
},
}
Summary
ResourceBasedControlPlaneAutoscalerto consider both CPU and memory VPA recommendations when sizing hosted clusters, preventing under-provisioning when CPU is the bottleneck (OCPSTRAT-2915)kubeAPIServerCPUFractionand per-size fraction overrides for both memory and CPU inClusterSizingConfigurationmax(memory-driven size, CPU-driven size), with full backward compatibility when CPU data is absentDetails
API Changes (
ClusterSizingConfiguration)ResourceBasedAutoscalingConfiguration: addedkubeAPIServerCPUFraction(global CPU fraction, default 0.65)SizeCapacity: addedkubeAPIServerMemoryFractionandkubeAPIServerCPUFractionper-size overridesController Changes
recommendedClusterSizeextracts both memory and CPU from VPAUncappedTargetrecommendedSizeByBoth,recommendedSize(memory-only), orrecommendedSizeByCPUdepending on available dataCache Changes
recommendedSizeByCPU: analogous to existing memory-based sizingrecommendedSizeByBoth: returns the larger of the two independent size recommendationsBackward Compatibility
ClusterSizingConfigurationresources work without changes🤖 Generated with Claude Code
Summary by CodeRabbit