Skip to content

OCPCRT-450: Improve ClusterBot's cloud platform dispursment#604

Open
thiagoalessio wants to merge 4 commits intoopenshift:mainfrom
thiagoalessio:OCPCRT-450
Open

OCPCRT-450: Improve ClusterBot's cloud platform dispursment#604
thiagoalessio wants to merge 4 commits intoopenshift:mainfrom
thiagoalessio:OCPCRT-450

Conversation

@thiagoalessio
Copy link
Member

Previously, selecting which cloud account to use for a platform (e.g. aws vs aws-2) required maintaining two nearly identical switch blocks: one in manager.go to query Boskos metrics and pick the best account, and another in prow.go to apply the corresponding cluster profile to the ProwJob. Adding or removing a cloud account meant editing both switches and adding/removing wrapper functions.

Now, all account information lives in a single platformQuotaSlices map. A generic selectCloudAccountProfile function loops over entries, queries Boskos, and picks the one with the most free resources. The result is passed directly to applyClusterProfile.

Why the quota-slice names are still listed explicitly

Ideally, I wish we could dynamically discover available quota-slice resource types from Boskos, but both Claude and I couldn't find something that would allow us to fetch a list of resources in https://github.com/kubernetes-sigs/boskos/blob/master/client/client.go.

This could have been refactored even further, extracting these values to a config file or something, but it is already better than before, so after these changes, adding a third AWS account for example can be done by simply by appending a new entry in platformQuotaSlices.

Suggestion on how to review

I'd recommend to check the individual commits, I tried to make each one small and self-contained, each advancing one refactoring step towards the end goal.

thiagoalessio and others added 4 commits March 17, 2026 14:04
Renaming this function to better reflect its purpose,
especially if at some point we have more than two accounts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce the data structures that will replace the hardcoded
switch block for quota-slice selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract the quota-slice selection logic into a standalone function that
loops over platformQuotaSlices entries, queries Boskos metrics for each,
and returns the profile with the most free resources. The old switch
block still exists for now.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction

Got rid of the switch block in favor of the newly introduced `selectCloudAccountProfile`.
Also replaced the other switch block with those wrapper functions in `prow.go` with a single
nil check that calls applyClusterProfile directly from the profile struct fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 17, 2026

@thiagoalessio: This pull request references OCPCRT-450 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, selecting which cloud account to use for a platform (e.g. aws vs aws-2) required maintaining two nearly identical switch blocks: one in manager.go to query Boskos metrics and pick the best account, and another in prow.go to apply the corresponding cluster profile to the ProwJob. Adding or removing a cloud account meant editing both switches and adding/removing wrapper functions.

Now, all account information lives in a single platformQuotaSlices map. A generic selectCloudAccountProfile function loops over entries, queries Boskos, and picks the one with the most free resources. The result is passed directly to applyClusterProfile.

Why the quota-slice names are still listed explicitly

Ideally, I wish we could dynamically discover available quota-slice resource types from Boskos, but both Claude and I couldn't find something that would allow us to fetch a list of resources in https://github.com/kubernetes-sigs/boskos/blob/master/client/client.go.

This could have been refactored even further, extracting these values to a config file or something, but it is already better than before, so after these changes, adding a third AWS account for example can be done by simply by appending a new entry in platformQuotaSlices.

Suggestion on how to review

I'd recommend to check the individual commits, I tried to make each one small and self-contained, each advancing one refactoring step towards the end goal.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 17, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Walkthrough

This PR introduces a quota-slice mechanism for cloud platform account management. It adds a CloudAccountProfile structure and selection logic based on Boskos metrics, replaces per-platform boolean flags with structured account profiles, and consolidates platform-specific conversion logic into a generic profile application function.

Changes

Cohort / File(s) Summary
Core Quota-Slice Mechanism
pkg/manager/manager.go
Introduces platformQuotaSlices map and selectCloudAccountProfile function to query Boskos metrics and select cloud accounts with available quota; integrates selection into LaunchJobForUser for amd64 architecture.
Type Definition Update
pkg/manager/types.go
Adds public CloudAccountProfile struct with QuotaSlice, ProfileName, ProfileSecret, and AccountDomain fields; replaces Job.UseSecondaryAccount boolean with CloudAccountProfile pointer field.
Platform Profile Application
pkg/manager/prow.go
Consolidates per-platform conversion functions (convertAWSToAWS2, convertAzureToAzure2, convertGCPToGCP2) into single generic applyClusterProfile function; removes platform-specific branching and conversion helpers.
Test Coverage
pkg/manager/manager_test.go
Adds mockLeaseClient type and Test_selectCloudAccountProfile test suite validating profile selection logic, nil returns, and error handling across quota-slice scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can generate a title for your PR based on the changes with custom instructions.

Set the reviews.auto_title_instructions setting to generate a title for your PR based on the changes in the PR with custom instructions.

@openshift-ci openshift-ci bot requested review from AlexNPavel and jupierce March 17, 2026 13:24
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: thiagoalessio
Once this PR has been reviewed and has the lgtm label, please assign hoxhaeris for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2026

@thiagoalessio: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/manager/prow.go (1)

1732-1734: ⚠️ Potential issue | 🟠 Major

Initialize Environment before setting BASE_DOMAIN.

Many launch configs have no environment: block. When that happens, Lines 1732-1734 silently drop the alternate-domain override, so aws-2 / azure-2 jobs still use the default base domain.

Suggested fix
-	if accountDomain != "" && matchedTarget.MultiStageTestConfiguration != nil && matchedTarget.MultiStageTestConfiguration.Environment != nil {
-		matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] = accountDomain
-	}
+	if accountDomain != "" {
+		if matchedTarget.MultiStageTestConfiguration.Environment == nil {
+			matchedTarget.MultiStageTestConfiguration.Environment = make(citools.TestEnvironment)
+		}
+		matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] = accountDomain
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/manager/prow.go` around lines 1732 - 1734, The code sets
matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] without
ensuring Environment is initialized; modify the block that checks accountDomain
and matchedTarget.MultiStageTestConfiguration (the lines manipulating
matchedTarget.MultiStageTestConfiguration.Environment) to allocate a new map if
Environment is nil (e.g., set
matchedTarget.MultiStageTestConfiguration.Environment = make(map[string]string))
before assigning the "BASE_DOMAIN" key so alternate-domain overrides are applied
even when no environment: block exists in the launch config.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/manager/manager.go`:
- Around line 154-156: The current code returns an error when
lClient.Metrics(accounts[i].QuotaSlice) fails, which aborts scheduling; change
this to log the error and fall back to using the primary/default account instead
of returning nil. Specifically, in the block that calls lClient.Metrics for
accounts[i].QuotaSlice, catch the error, emit a warning (including the error)
and set metrics to the primary account's metrics (or mark this candidate as
using the primary quota) so the caller can continue; do this change around the
lClient.Metrics call and its error handling to avoid returning fmt.Errorf and
instead continue with a fallback metrics value.

---

Outside diff comments:
In `@pkg/manager/prow.go`:
- Around line 1732-1734: The code sets
matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] without
ensuring Environment is initialized; modify the block that checks accountDomain
and matchedTarget.MultiStageTestConfiguration (the lines manipulating
matchedTarget.MultiStageTestConfiguration.Environment) to allocate a new map if
Environment is nil (e.g., set
matchedTarget.MultiStageTestConfiguration.Environment = make(map[string]string))
before assigning the "BASE_DOMAIN" key so alternate-domain overrides are applied
even when no environment: block exists in the launch config.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0abbf168-63b8-4399-859d-1a757db3749a

📥 Commits

Reviewing files that changed from the base of the PR and between 9504e68 and 0be8b07.

📒 Files selected for processing (4)
  • pkg/manager/manager.go
  • pkg/manager/manager_test.go
  • pkg/manager/prow.go
  • pkg/manager/types.go

Comment on lines +154 to +156
metrics, err := lClient.Metrics(accounts[i].QuotaSlice)
if err != nil {
return nil, fmt.Errorf("failed to get metrics for %q leases: %v", accounts[i].QuotaSlice, err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fall back to the primary account when Boskos metrics fail.

This is a best-effort balancing check. Returning an error here means a transient Boskos issue blocks every amd64 aws / azure / gcp launch instead of just using the default account.

Suggested fix
 		metrics, err := lClient.Metrics(accounts[i].QuotaSlice)
 		if err != nil {
-			return nil, fmt.Errorf("failed to get metrics for %q leases: %v", accounts[i].QuotaSlice, err)
+			klog.Warningf("failed to get metrics for %q leases, using primary account: %v", accounts[i].QuotaSlice, err)
+			return nil, nil
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metrics, err := lClient.Metrics(accounts[i].QuotaSlice)
if err != nil {
return nil, fmt.Errorf("failed to get metrics for %q leases: %v", accounts[i].QuotaSlice, err)
metrics, err := lClient.Metrics(accounts[i].QuotaSlice)
if err != nil {
klog.Warningf("failed to get metrics for %q leases, using primary account: %v", accounts[i].QuotaSlice, err)
return nil, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/manager/manager.go` around lines 154 - 156, The current code returns an
error when lClient.Metrics(accounts[i].QuotaSlice) fails, which aborts
scheduling; change this to log the error and fall back to using the
primary/default account instead of returning nil. Specifically, in the block
that calls lClient.Metrics for accounts[i].QuotaSlice, catch the error, emit a
warning (including the error) and set metrics to the primary account's metrics
(or mark this candidate as using the primary quota) so the caller can continue;
do this change around the lClient.Metrics call and its error handling to avoid
returning fmt.Errorf and instead continue with a fallback metrics value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants