Skip to content

baremetal: recover from RAC0508 during iDRAC boot override#4421

Merged
LiliDeng merged 2 commits into
mainfrom
vyadav_idrac_bug_april
Apr 21, 2026
Merged

baremetal: recover from RAC0508 during iDRAC boot override#4421
LiliDeng merged 2 commits into
mainfrom
vyadav_idrac_bug_april

Conversation

@vyadavmsft
Copy link
Copy Markdown
Collaborator

@vyadavmsft vyadavmsft commented Apr 16, 2026

Detect IDRAC.2.8.RAC0508 and generic provider-not-ready failures as reset-worthy iDRAC service errors, retry the boot-order override once after resetting iDRAC, and wait for Dell Lifecycle Controller remote services to report ready before declaring the reset complete.

Manager API reachability alone is not sufficient for ImportSystemConfiguration. Without the LC readiness gate, immediate retries can still hit RAC0508 while the provider is initializing.

This keeps the recovery localized to the shared iDRAC reset flow so other reset-based recovery paths benefit from the same stronger post-reset readiness check while preserving original behavior for non-retriable failures.

Description

Related Issue

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Refactoring
  • Documentation update

Checklist

  • Description is filled in above
  • No credentials, secrets, or internal details are included
  • Peer review requested (if not, add required peer reviewers after raising PR)
  • Tests executed and results posted below

Test Validation

Key Test Cases:

Impacted LISA Features:

Tested Azure Marketplace Images:

Test Results

Image VM Size Result
PASSED / FAILED / SKIPPED

Detect IDRAC.2.8.RAC0508 and generic provider-not-ready failures as
reset-worthy iDRAC service errors, retry the boot-order override once
after resetting iDRAC, and wait for Dell Lifecycle Controller remote
services to report ready before declaring the reset complete.

Manager API reachability alone is not sufficient for
ImportSystemConfiguration. Without the LC readiness gate, immediate
retries can still hit RAC0508 while the provider is initializing.

This keeps the recovery localized to the shared iDRAC reset flow so
other reset-based recovery paths benefit from the same stronger
post-reset readiness check while preserving original behavior for
non-retriable failures.
@vyadavmsft vyadavmsft requested review from Copilot and removed request for Copilot April 16, 2026 19:44
@github-actions
Copy link
Copy Markdown

🤖 AI Test Selection

No test cases were selected for this PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens Dell iDRAC recovery during baremetal deployment by treating RAC0508 / provider-not-ready failures as reset-worthy, retrying the boot-order override once after an iDRAC reset, and adding a post-reset readiness gate for Lifecycle Controller remote services (not just Manager API reachability).

Changes:

  • Extend iDRAC error detection to include IDRAC.2.8.RAC0508 and “provider is not ready” text.
  • After iDRAC reset, wait for Lifecycle Controller remote services readiness before considering reset complete.
  • Add a one-time retry of ImportSystemConfiguration (boot-order override) after reset-based recovery.

Comment thread lisa/sut_orchestrator/baremetal/cluster/idrac.py Outdated
Comment thread lisa/sut_orchestrator/baremetal/cluster/idrac.py Outdated
Comment thread lisa/sut_orchestrator/baremetal/cluster/idrac.py
@vyadavmsft
Copy link
Copy Markdown
Collaborator Author

@copilot apply changes based on the comments in this thread

… message

Agent-Logs-Url: https://github.com/microsoft/lisa/sessions/eceaeac2-0c9b-4d2a-931d-baf7a471dad0

Co-authored-by: vyadavmsft <1424753+vyadavmsft@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 16, 2026

@copilot apply changes based on the comments in this thread

Applied both changes from the review in commit 4b72dc6:

  1. All message ID checks in _reset_if_idrac_error now match against the lowercase-normalized string (normalized_error) for consistent case-insensitive detection.
  2. The timeout_message in _reset_idrac now includes actionable guidance: checking the LC job queue and iDRAC UI logs, and a note about increasing IDRAC_REMOTE_SERVICES_TIMEOUT on slow platforms.

@github-actions
Copy link
Copy Markdown

🤖 AI Test Selection

No test cases were selected for this PR.

@LiliDeng LiliDeng requested a review from anirudhrb April 17, 2026 05:03
Copy link
Copy Markdown
Collaborator

@anirudhrb anirudhrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise LGTM. But haven't looked into the details of idrac.

@LiliDeng LiliDeng merged commit bf695a1 into main Apr 21, 2026
61 checks passed
@LiliDeng LiliDeng deleted the vyadav_idrac_bug_april branch April 21, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants