BMH is stuck in inspecting state #1706

ss2901 · 2024-04-30T06:19:48Z

I am using DELL server for deployment and it is getting stuck on inspecting state. Further, suse image (SLES15) is used for booting. Also, tried with Ubuntu 22.04. Deployment is getting failed with Inspection error which says that timeout reached while inspecting the node

Events:
Normal InspectionStarted 34m metal3-baremetal-controller Hardware inspection started
Normal InspectionError 4m35s metal3-baremetal-controller timeout reached while inspecting the node
Normal InspectionStarted 4m33s metal3-baremetal-controller Hardware inspection started

And on IDRAC, virtual console, it is stuck on unable to access console, root account is locked. Though, I have checked the credentials for root it is working fine.

Also, below is the yaml format of bmh:

$ kubectl get bmh -A -o yaml
    apiVersion: v1
    items:
    - apiVersion: metal3.io/v1alpha1
       kind: BareMetalHost
  metadata:
    annotations:
      meta.helm.sh/release-name: cluster-bmh
      meta.helm.sh/release-namespace: my-rke2-capm3
      sylvaproject.org/baremetal-host-name: my-server
      sylvaproject.org/cluster-name: my-rke2-capm3
      sylvaproject.org/default-longhorn-disks-config: '[{ "path":"/var/longhorn/disks/sdb","storageReserved":0,"allowScheduling":true,"tags":[
        "ssd", "fast" ] },{ "path":"/var/longhorn/disks/sdc","storageReserved":0,"allowScheduling":true,"tags":[
        "ssd", "fast" ] } ]'
    creationTimestamp: "2024-04-29T20:54:34Z"
    finalizers:
    - baremetalhost.metal3.io
    generation: 2
    labels:
      app.kubernetes.io/instance: cluster-bmh
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: sylva-capi-cluster
      app.kubernetes.io/version: 0.0.0
      cluster-role: control-plane
      helm.sh/chart: sylva-capi-cluster-0.0.0_ab1e5edb7f30
      helm.toolkit.fluxcd.io/name: cluster-bmh
      helm.toolkit.fluxcd.io/namespace: my-rke2-capm3
      host-type: generic
    name: my-rke2-capm3-my-server
    namespace: my-rke2-capm3
    resourceVersion: "1180427"
    uid: 76e9268e-b807-42b6-ac07-0feb8013e18d
  spec:
    automatedCleaningMode: metadata
    bmc:
      address: redfish://<bmc-address>/redfish/v1/Systems/System.Embedded.1
      credentialsName: my-rke2-capm3-my-server-secret
      disableCertificateVerification: true
    bootMACAddress: <mac-address>
    bootMode: UEFI
    description: Dell M640 Blade Server
    online: true
    rootDeviceHints:
      hctl: "0:0:0:0"
  status:
    errorCount: 12
    errorMessage: ""
    goodCredentials:
      credentials:
        name: my-rke2-capm3-my-server-secret
        namespace: my-rke2-capm3
      credentialsVersion: "238295"
    hardwareProfile: unknown
    lastUpdated: "2024-04-30T06:07:58Z"
    operationHistory:
      deprovision:
        end: null
        start: null
      inspect:
        end: null
        start: "2024-04-29T20:54:53Z"
      provision:
        end: null
        start: null
      register:
        end: "2024-04-29T20:54:53Z"
        start: "2024-04-29T20:54:36Z"
    operationalStatus: OK
    poweredOn: false
    provisioning:
      ID: 647bb414-2678-4bfb-9782-66b99adcdd6f
      bootMode: UEFI
      image:
        url: ""
      rootDeviceHints:
        hctl: "0:0:0:0"
      state: inspecting
    triedCredentials:
      credentials:
        name: my-rke2-capm3-my-server-secret
        namespace: my-rke2-capm3
      credentialsVersion: "238295"
kind: List
metadata:
  resourceVersion: ""

/kind bug

The text was updated successfully, but these errors were encountered:

metal3-io-bot · 2024-04-30T06:19:56Z

This issue is currently awaiting triage.
If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dtantsur · 2024-04-30T07:10:50Z

Hi! The SSH public key for the root account on the inspection/deployment ramdisk can be passed to the ironic image (IRONIC_RAMDISK_SSH_KEY variable). Not sure it that's what you did, mentioning it for completeness.

Once you get it, the first thing to check is networking. In the vast majority of cases, what you observe is caused by inability of the ramdisk to reach back to ironic on the provisioning network. The other end of it is the dnsmasq container in the metal3 pod - you can even start by checking its logs. If it's empty or does not mention the provided bootMACAddress, chances are high the DHCP traffic is not reaching Metal3 on the provisioning network.

I hope these hints help.

matthewei · 2024-05-08T03:32:15Z

Could you login the BMC to double check the console log?

Rozzii · 2024-05-15T14:24:17Z

I would like to also ask for the logs of the Ironic container of the Ironic pod @ss2901 please.

Rozzii · 2024-05-15T14:24:31Z

/triage needs-information

metal3-io-bot · 2024-08-13T14:42:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

metal3-io-bot · 2024-09-12T14:44:08Z

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

metal3-io-bot · 2024-09-12T14:44:13Z

@metal3-io-bot: Closing this issue.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

metal3-io-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 30, 2024

metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Apr 30, 2024

metal3-io-bot added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 15, 2024

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2024

metal3-io-bot closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BMH is stuck in inspecting state #1706

BMH is stuck in inspecting state #1706

ss2901 commented Apr 30, 2024 •

edited

Loading

metal3-io-bot commented Apr 30, 2024

dtantsur commented Apr 30, 2024

matthewei commented May 8, 2024

Rozzii commented May 15, 2024

Rozzii commented May 15, 2024

metal3-io-bot commented Aug 13, 2024

metal3-io-bot commented Sep 12, 2024

metal3-io-bot commented Sep 12, 2024

BMH is stuck in inspecting state #1706

BMH is stuck in inspecting state #1706

Comments

ss2901 commented Apr 30, 2024 • edited Loading

metal3-io-bot commented Apr 30, 2024

dtantsur commented Apr 30, 2024

matthewei commented May 8, 2024

Rozzii commented May 15, 2024

Rozzii commented May 15, 2024

metal3-io-bot commented Aug 13, 2024

metal3-io-bot commented Sep 12, 2024

metal3-io-bot commented Sep 12, 2024

ss2901 commented Apr 30, 2024 •

edited

Loading