Skip to content

Fix prometheus pods not scheduling to infra nodes after rebalance#79335

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
Sandeepyadav93:hcp_fix
May 18, 2026
Merged

Fix prometheus pods not scheduling to infra nodes after rebalance#79335
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
Sandeepyadav93:hcp_fix

Conversation

@Sandeepyadav93
Copy link
Copy Markdown
Contributor

@Sandeepyadav93 Sandeepyadav93 commented May 15, 2026

Fix prometheus pods not scheduling to infra nodes after rebalance

The rebalanceInfra function was restarting prometheus-k8s statefulset
without configuring node placement, causing pods to randomly land on
worker nodes instead of infra nodes. This led to OOM issues on workers
as prometheus workload is resource-intensive.

Root cause: Missing nodeSelector and tolerations configuration for
prometheus before pod restart. Previously, topologySpreadConstraints
helped ensure at least one prometheus pod landed on infra nodes (as
described in RFE-5107), but topologySpreadConstraints is no longer
present in the current prometheus-k8s StatefulSet. Without explicit
nodeSelector and tolerations, prometheus pods schedule to workers.

Changes:

  • Apply cluster-monitoring-config ConfigMap with nodeSelector and
    tolerations for prometheusK8s to explicitly target infra nodes
    (other monitoring components consume minimal resources and remain
    on workers)
  • Wait for cluster-monitoring-operator to reconcile the StatefulSet
    template spec before restarting pods (poll up to 5 minutes using jq
    to verify nodeSelector and tolerations are present in the spec)
  • Add inline verification after rollout to ensure prometheus pods
    actually land on infra nodes (12 retries over 2 minutes)
  • Fail fast with explicit error if StatefulSet reconciliation times out
    or pods don't schedule to infra nodes, preventing silent OOM failures
    on workers

Related: https://redhat.atlassian.net/browse/RFE-510

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Walkthrough

The script expands the Prometheus migration in rebalanceInfra: it logs current state, applies a cluster-monitoring-config, waits for operator reconciliation, restarts the prometheus-k8s StatefulSet, verifies pods land on infra nodes with retries, and updates the HCP flow to call checkInfra for prometheus-k8s.

Changes

Prometheus HyperShift Migration

Layer / File(s) Summary
Pre-migration logging and state
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
Log migration start and print current prometheus-k8s pods and StatefulSet prior to changes.
Apply cluster-monitoring-config and wait for reconciliation
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
Apply a cluster-monitoring-config ConfigMap in openshift-monitoring to set prometheusK8s nodeSelector/tolerations and poll the prometheus-k8s StatefulSet JSON until the template reflects the expected nodeSelector/tolerations or timeout with failure logging.
Restart StatefulSet and wait for rollout
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
Restart the prometheus-k8s StatefulSet and block until oc rollout status completes.
Verify pods scheduled on infra nodes
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
Run a bounded retry loop verifying all running prometheus-k8s-* pods are scheduled on nodes labeled as infra; log warnings for mismatches and fail (listing pods) if not achieved within retries.
HCP cluster flow call site change
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
In the HCP hypershift flow, the post-“Re-balance infra components” step now calls checkInfra "prometheus-k8s" "openshift-monitoring" instead of rebalanceInfra "prometheus-k8s".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: fixing prometheus pods scheduling to infra nodes after rebalance, which is the core issue addressed in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The check is not applicable to this PR. The modified file is a shell script, not a Ginkgo test file. No Go test files with Ginkgo test definitions were modified.
Test Structure And Quality ✅ Passed The custom check reviews Ginkgo test code quality (Go tests). This PR modifies only bash shell scripts and contains no Ginkgo tests. Not applicable.
Microshift Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. The modified file is a bash shell script for CI infrastructure setup, not a Go test file. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. It only modifies a bash script (openshift-qe-hypershift-infra-commands.sh) for CI infrastructure operations. The custom check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed CI test script for HyperShift. Checks topology at runtime and exits if not HyperShift. Scheduling constraints are HyperShift-specific.
Ote Binary Stdout Contract ✅ Passed Check not applicable. PR modifies bash scripts, config files, and documentation—not OTE binary source code. OTE stdout contract check applies only to OTE binaries with process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added. The PR modifies only a bash script in the CI step registry, not test code. This check applies only to new Ginkgo e2e tests.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Sandeepyadav93
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 54-118: The current heredoc creates/replaces the entire ConfigMap
"cluster-monitoring-config" (data.config.yaml) which wipes unrelated settings;
instead, modify this step to merge only the node-placement stanza into the
existing ConfigMap: fetch the existing "cluster-monitoring-config" (namespace
openshift-monitoring), parse data.config.yaml, inject/merge the nodeSelector and
tolerations into each component (alertmanagerMain, prometheusK8s,
prometheusOperator, k8sPrometheusAdapter, kubeStateMetrics, telemeterClient,
openshiftStateMetrics, thanosQuerier) and then update the ConfigMap (e.g., via
oc get -> merge YAML -> oc apply/oc patch) rather than replacing
data.config.yaml via the heredoc used with "cat << 'EOF' | oc apply -f -".
- Around line 120-121: The current sleep 30 after "Wait for
cluster-monitoring-operator to reconcile the configuration" is insufficient;
replace the fixed sleep with a polling loop that queries the prometheus-k8s
StatefulSet spec.template (using kubectl -n openshift-monitoring get statefulset
prometheus-k8s -o jsonpath=... or equivalent) and waits until the infra
nodeSelector/tolerations (the infra placement) are present in
spec.template.spec.template.spec.nodeSelector and/or
spec.template.spec.template.spec.tolerations, then proceed to perform the
rollout restart of prometheus-k8s; ensure the loop has a timeout and sleeps
between polls to avoid tight looping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e4b84612-9189-4f54-aed5-aeffcd5bc337

📥 Commits

Reviewing files that changed from the base of the PR and between 26eb294 and c4e854e.

📒 Files selected for processing (1)
  • ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2026
@Sandeepyadav93
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh (1)

154-180: 💤 Low value

Outer checkInfra retry is now a no-op for prometheus-k8s.

rebalanceInfra now performs its own 12-retry verification and exit 1 on failure (lines 176-180). Combined with set -o errexit, that means the wrapping checkInfra loop on line 259 cannot ever retry for prometheus-k8s, and its post-call verification at lines 202-210 just duplicates what rebalanceInfra already proved. If "fail fast on placement failure" is the intent (per the PR description), this is fine — but consider either dropping the redundant outer pass for prometheus-k8s, or returning a non-zero status from rebalanceInfra so checkInfra's TRY loop can actually exercise the retries it advertises. Right now the script presents two retry layers but only the inner one ever runs.

Also applies to: 259-259

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`
around lines 154 - 180, The inner rebalanceInfra verification for prometheus-k8s
currently calls exit 1 on failure (and sets VERIFY_SUCCESS), which with set -o
errexit makes the outer checkInfra retry loop (RETRY/TRY/MAX_RETRIES) a no-op;
either remove the redundant prometheus-k8s verification from checkInfra or make
rebalanceInfra return a non-zero status instead of exiting so the outer loop can
actually retry: change the exit 1 in rebalanceInfra to return 1 (and ensure
VERIFY_SUCCESS is set appropriately), and update the caller (checkInfra) to test
the rebalanceInfra return code and continue its RETRY loop (or delete the outer
prometheus-k8s branch if you prefer fail-fast behavior).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 86-92: The cluster-monitoring-config uses the removed
k8sPrometheusAdapter key; update the manifest to use metricsServer instead:
replace the top-level k8sPrometheusAdapter mapping with metricsServer and keep
the nested nodeSelector and tolerations (the node-role.kubernetes.io/infra
selector and the NoSchedule toleration with key node-role.kubernetes.io/infra
and operator Exists) so the Cluster Monitoring Operator on OCP 4.22 will accept
and apply the configuration.

---

Nitpick comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 154-180: The inner rebalanceInfra verification for prometheus-k8s
currently calls exit 1 on failure (and sets VERIFY_SUCCESS), which with set -o
errexit makes the outer checkInfra retry loop (RETRY/TRY/MAX_RETRIES) a no-op;
either remove the redundant prometheus-k8s verification from checkInfra or make
rebalanceInfra return a non-zero status instead of exiting so the outer loop can
actually retry: change the exit 1 in rebalanceInfra to return 1 (and ensure
VERIFY_SUCCESS is set appropriately), and update the caller (checkInfra) to test
the rebalanceInfra return code and continue its RETRY loop (or delete the outer
prometheus-k8s branch if you prefer fail-fast behavior).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4d255d7c-6c5f-42e2-b51d-6172c5690787

📥 Commits

Reviewing files that changed from the base of the PR and between c4e854e and 61f4ed6.

📒 Files selected for processing (1)
  • ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh

@Sandeepyadav93 Sandeepyadav93 force-pushed the hcp_fix branch 2 times, most recently from de86cf0 to 1b7df99 Compare May 15, 2026 13:47
The rebalanceInfra function was restarting prometheus-k8s statefulset
without configuring node placement, causing pods to randomly land on
worker nodes instead of infra nodes. This led to OOM issues on workers
as prometheus workload is resource-intensive.

Root cause: Missing nodeSelector and tolerations configuration for
prometheus before pod restart. Previously, topologySpreadConstraints
helped ensure at least one prometheus pod landed on infra nodes (as
described in RFE-5107), but topologySpreadConstraints is no longer
present in the current prometheus-k8s StatefulSet. Without explicit
nodeSelector and tolerations, prometheus pods schedule to workers.

Changes:
- Apply cluster-monitoring-config ConfigMap with nodeSelector and
  tolerations for prometheusK8s to explicitly target infra nodes
  (other monitoring components consume minimal resources and remain
  on workers)
- Wait for cluster-monitoring-operator to reconcile the StatefulSet
  template spec before restarting pods (poll up to 5 minutes using jq
  to verify nodeSelector and tolerations are present in the spec)
- Add inline verification after rollout to ensure prometheus pods
  actually land on infra nodes (12 retries over 2 minutes)
- Fail fast with explicit error if StatefulSet reconciliation times out
  or pods don't schedule to infra nodes, preventing silent OOM failures
  on workers

Related: https://redhat.atlassian.net/browse/RFE-5107

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@Sandeepyadav93
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@Sandeepyadav93: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-249nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-498nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-498nodes-onperfsector openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-24nodes-onperfsector openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-node-density-heavy-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-node-density-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-24nodes-onperfsector openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-120nodes-onperfsector openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-nd openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-nd-cni openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-cdv2 openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-crd openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-nd openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-nd-cni openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-cdv2 openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-crd openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-node-density-heavy-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-node-density-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-249nodes-onperfsector-nd openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed

A total of 48 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@Sandeepyadav93
Copy link
Copy Markdown
Contributor Author

/assign @mukrishn @mcornea

@Sandeepyadav93
Copy link
Copy Markdown
Contributor Author

Looking good

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/79335/rehearse-79335-periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector/2055288859511492608/artifacts/control-plane-24nodes-onperfsector/openshift-qe-hypershift-infra/build-log.txt

[1m15-05-2026T14:45:20  Fri May 15 14:45:20 UTC 2026 - Initiate migration of prometheus to infra nodepools[0m
prometheus-k8s-0                                         6/6     Running   0          12m     10.131.10.11   ip-10-0-110-117.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                                         6/6     Running   0          12m     10.128.6.11    ip-10-0-113-94.us-east-2.compute.internal    <none>           <none>
NAME             READY   AGE
prometheus-k8s   2/2     12m
[1m15-05-2026T14:45:20  Fri May 15 14:45:20 UTC 2026 - Apply cluster-monitoring-config to move prometheus to infra nodes[0m
configmap/cluster-monitoring-config created
[1m15-05-2026T14:45:21  Fri May 15 14:45:21 UTC 2026 - Wait for cluster-monitoring-operator to reconcile the configuration[0m
[1m15-05-2026T14:45:31  Fri May 15 14:45:31 UTC 2026 - StatefulSet reconciled with infra nodeSelector and tolerations[0m
[1m15-05-2026T14:45:31  Fri May 15 14:45:31 UTC 2026 - Restart stateful set pods[0m
rollout restart -n openshift-monitoring statefulset/prometheus-k8s
statefulset.apps/prometheus-k8s restarted
[1m15-05-2026T14:45:31  Fri May 15 14:45:31 UTC 2026 - Wait till they are completely restarted[0m
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 1 pods at revision prometheus-k8s-7d5c6ccc66...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 2 pods at revision prometheus-k8s-7d5c6ccc66...
[1m15-05-2026T14:47:35  Fri May 15 14:47:35 UTC 2026 - Verify prometheus pods are running on infra nodes[0m
[1m15-05-2026T14:47:36  Fri May 15 14:47:36 UTC 2026 - prometheus pod on ip-10-0-64-208.us-east-2.compute.internal (infra node) ✓[0m
[1m15-05-2026T14:47:36  Fri May 15 14:47:36 UTC 2026 - prometheus pod on ip-10-0-119-27.us-east-2.compute.internal (infra node) ✓[0m
[1m15-05-2026T14:47:36  Fri May 15 14:47:36 UTC 2026 - All prometheus-k8s pods are on infra nodes ✓[0m

@mcornea
Copy link
Copy Markdown
Contributor

mcornea commented May 18, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2026
@mcornea
Copy link
Copy Markdown
Contributor

mcornea commented May 18, 2026

/pj-rehearse ack

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@mcornea: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 18, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 18, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mcornea, Sandeepyadav93

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 18, 2026

@Sandeepyadav93: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 12c766e into openshift:main May 18, 2026
11 checks passed
@Sandeepyadav93 Sandeepyadav93 deleted the hcp_fix branch May 18, 2026 13:08
wgahnagl pushed a commit to wgahnagl/release that referenced this pull request May 20, 2026
…enshift#79335)

The rebalanceInfra function was restarting prometheus-k8s statefulset
without configuring node placement, causing pods to randomly land on
worker nodes instead of infra nodes. This led to OOM issues on workers
as prometheus workload is resource-intensive.

Root cause: Missing nodeSelector and tolerations configuration for
prometheus before pod restart. Previously, topologySpreadConstraints
helped ensure at least one prometheus pod landed on infra nodes (as
described in RFE-5107), but topologySpreadConstraints is no longer
present in the current prometheus-k8s StatefulSet. Without explicit
nodeSelector and tolerations, prometheus pods schedule to workers.

Changes:
- Apply cluster-monitoring-config ConfigMap with nodeSelector and
  tolerations for prometheusK8s to explicitly target infra nodes
  (other monitoring components consume minimal resources and remain
  on workers)
- Wait for cluster-monitoring-operator to reconcile the StatefulSet
  template spec before restarting pods (poll up to 5 minutes using jq
  to verify nodeSelector and tolerations are present in the spec)
- Add inline verification after rollout to ensure prometheus pods
  actually land on infra nodes (12 retries over 2 minutes)
- Fail fast with explicit error if StatefulSet reconciliation times out
  or pods don't schedule to infra nodes, preventing silent OOM failures
  on workers

Related: https://redhat.atlassian.net/browse/RFE-5107

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants