Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BMH is stuck in inspecting state #1706

Closed
ss2901 opened this issue Apr 30, 2024 · 8 comments
Closed

BMH is stuck in inspecting state #1706

ss2901 opened this issue Apr 30, 2024 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@ss2901
Copy link

ss2901 commented Apr 30, 2024

I am using DELL server for deployment and it is getting stuck on inspecting state. Further, suse image (SLES15) is used for booting. Also, tried with Ubuntu 22.04. Deployment is getting failed with Inspection error which says that timeout reached while inspecting the node

Events:
Normal InspectionStarted 34m metal3-baremetal-controller Hardware inspection started
Normal InspectionError 4m35s metal3-baremetal-controller timeout reached while inspecting the node
Normal InspectionStarted 4m33s metal3-baremetal-controller Hardware inspection started

And on IDRAC, virtual console, it is stuck on unable to access console, root account is locked. Though, I have checked the credentials for root it is working fine.

Also, below is the yaml format of bmh:

$ kubectl get bmh -A -o yaml
    apiVersion: v1
    items:
    - apiVersion: metal3.io/v1alpha1
       kind: BareMetalHost
  metadata:
    annotations:
      meta.helm.sh/release-name: cluster-bmh
      meta.helm.sh/release-namespace: my-rke2-capm3
      sylvaproject.org/baremetal-host-name: my-server
      sylvaproject.org/cluster-name: my-rke2-capm3
      sylvaproject.org/default-longhorn-disks-config: '[{ "path":"/var/longhorn/disks/sdb","storageReserved":0,"allowScheduling":true,"tags":[
        "ssd", "fast" ] },{ "path":"/var/longhorn/disks/sdc","storageReserved":0,"allowScheduling":true,"tags":[
        "ssd", "fast" ] } ]'
    creationTimestamp: "2024-04-29T20:54:34Z"
    finalizers:
    - baremetalhost.metal3.io
    generation: 2
    labels:
      app.kubernetes.io/instance: cluster-bmh
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: sylva-capi-cluster
      app.kubernetes.io/version: 0.0.0
      cluster-role: control-plane
      helm.sh/chart: sylva-capi-cluster-0.0.0_ab1e5edb7f30
      helm.toolkit.fluxcd.io/name: cluster-bmh
      helm.toolkit.fluxcd.io/namespace: my-rke2-capm3
      host-type: generic
    name: my-rke2-capm3-my-server
    namespace: my-rke2-capm3
    resourceVersion: "1180427"
    uid: 76e9268e-b807-42b6-ac07-0feb8013e18d
  spec:
    automatedCleaningMode: metadata
    bmc:
      address: redfish://<bmc-address>/redfish/v1/Systems/System.Embedded.1
      credentialsName: my-rke2-capm3-my-server-secret
      disableCertificateVerification: true
    bootMACAddress: <mac-address>
    bootMode: UEFI
    description: Dell M640 Blade Server
    online: true
    rootDeviceHints:
      hctl: "0:0:0:0"
  status:
    errorCount: 12
    errorMessage: ""
    goodCredentials:
      credentials:
        name: my-rke2-capm3-my-server-secret
        namespace: my-rke2-capm3
      credentialsVersion: "238295"
    hardwareProfile: unknown
    lastUpdated: "2024-04-30T06:07:58Z"
    operationHistory:
      deprovision:
        end: null
        start: null
      inspect:
        end: null
        start: "2024-04-29T20:54:53Z"
      provision:
        end: null
        start: null
      register:
        end: "2024-04-29T20:54:53Z"
        start: "2024-04-29T20:54:36Z"
    operationalStatus: OK
    poweredOn: false
    provisioning:
      ID: 647bb414-2678-4bfb-9782-66b99adcdd6f
      bootMode: UEFI
      image:
        url: ""
      rootDeviceHints:
        hctl: "0:0:0:0"
      state: inspecting
    triedCredentials:
      credentials:
        name: my-rke2-capm3-my-server-secret
        namespace: my-rke2-capm3
      credentialsVersion: "238295"
kind: List
metadata:
  resourceVersion: ""

/kind bug

@metal3-io-bot metal3-io-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 30, 2024
@metal3-io-bot
Copy link
Contributor

This issue is currently awaiting triage.
If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Apr 30, 2024
@dtantsur
Copy link
Member

Hi! The SSH public key for the root account on the inspection/deployment ramdisk can be passed to the ironic image (IRONIC_RAMDISK_SSH_KEY variable). Not sure it that's what you did, mentioning it for completeness.

Once you get it, the first thing to check is networking. In the vast majority of cases, what you observe is caused by inability of the ramdisk to reach back to ironic on the provisioning network. The other end of it is the dnsmasq container in the metal3 pod - you can even start by checking its logs. If it's empty or does not mention the provided bootMACAddress, chances are high the DHCP traffic is not reaching Metal3 on the provisioning network.

I hope these hints help.

@matthewei
Copy link

Could you login the BMC to double check the console log?

@Rozzii
Copy link
Member

Rozzii commented May 15, 2024

I would like to also ask for the logs of the Ironic container of the Ironic pod @ss2901 please.

@Rozzii
Copy link
Member

Rozzii commented May 15, 2024

/triage needs-information

@metal3-io-bot metal3-io-bot added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 15, 2024
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2024
@metal3-io-bot
Copy link
Contributor

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

@metal3-io-bot
Copy link
Contributor

@metal3-io-bot: Closing this issue.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

5 participants