Error: Failed to associate the BaremetalHost to the Metal3Machine #265

alosadagrande · 2020-03-20T13:50:11Z

While I was setting metal3-dev-env to create target clusters based in CentOS 8 as base image I face this error with Machines:

NAME                           PROVIDERID                                      PHASE
centos8-controlplane-ghjx2     metal3://60914f8e-2a41-4bce-82c2-e652083aac3d   Failed
centos8-md-0-9fbd54c6d-6xqfr   metal3://4a06806b-413d-453d-95d1-d24f306840ba   Failed
centos8-md-0-9fbd54c6d-nlwd7   metal3://bfca3312-1860-488d-9bdc-7a3b58e5940f   Failed
centos8-md-0-9fbd54c6d-t4r7w   metal3://1e2d61e7-d44b-4eb4-9d90-a7df245165dc   Failed

I can say that I see them Running and suddenly become Failed after an uncertain period of time. However, it is not always like that. Sometimes only the control-plane or some workers. I can see from the capi-controller-manager:

[alosadag@eko4 metal3-dev-env]$ kubectl logs -f capi-controller-manager-664c75c4df-2qhnt -n capi-system

I0320 09:42:03.053773       1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3" 
E0320 09:42:03.053834       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"centos8-md-0-9fbd54c6d-zkh9x\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3" 
I0320 09:42:03.066965       1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3" 
E0320 09:42:03.069221       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"centos8-md-0-9fbd54c6d-zkh9x\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3"

From my point of view everything looks like everything is running fine except the Machines error message:

============== cluster =================
NAME      PHASE
centos8   Provisioned
============ metal3cluster ==========
NAME      READY   ERROR   CLUSTER   ENDPOINT
centos8   true            centos8   map[host:192.168.111.249 port:6443]
================ bareMetalHost ================
NAME     STATUS   PROVISIONING STATUS   CONSUMER                     BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK   ready                                              ipmi://192.168.111.1:6230   unknown            false
node-1   OK   provisioned           centos8-md-0-btx7x           ipmi://192.168.111.1:6231   unknown            true
node-2   OK   provisioned           centos8-controlplane-2gsvr   ipmi://192.168.111.1:6232   unknown            true
node-3   OK   provisioned           centos8-md-0-7fqjh           ipmi://192.168.111.1:6233   unknown            true
node-4   OK   provisioned           centos8-md-0-qd7fc           ipmi://192.168.111.1:6234   unknown            true
node-5   OK   ready                                              ipmi://192.168.111.1:6235   unknown            true
=============== Metal3Machine ===============
NAME                         PROVIDERID                                      READY   CLUSTER   PHASE
centos8-controlplane-2gsvr   metal3://60914f8e-2a41-4bce-82c2-e652083aac3d   true    centos8
centos8-md-0-7fqjh           metal3://4a06806b-413d-453d-95d1-d24f306840ba   true    centos8
centos8-md-0-btx7x           metal3://bfca3312-1860-488d-9bdc-7a3b58e5940f   true    centos8
centos8-md-0-qd7fc           metal3://1e2d61e7-d44b-4eb4-9d90-a7df245165dc   true    centos8
=================== Machines ===================
NAME                           PROVIDERID                                      PHASE
centos8-controlplane-ghjx2     metal3://60914f8e-2a41-4bce-82c2-e652083aac3d   Failed
centos8-md-0-9fbd54c6d-6xqfr   metal3://4a06806b-413d-453d-95d1-d24f306840ba   Failed
centos8-md-0-9fbd54c6d-nlwd7   metal3://bfca3312-1860-488d-9bdc-7a3b58e5940f   Failed
centos8-md-0-9fbd54c6d-t4r7w   metal3://1e2d61e7-d44b-4eb4-9d90-a7df245165dc   Failed
=============== Machinedeployment ===============
NAME           PHASE     REPLICAS   AVAILABLE   READY
centos8-md-0   Running   3          3           3
 =============== Metal3MachineTemplate ===============
NAME                   AGE
centos8-controlplane   15h
centos8-md-0           14h
============= kubeAdmConfigTemplate =============
NAME           AGE
centos8-md-0   14h
================= kubeAdmConfig =================
NAME                         AGE
centos8-controlplane-7wrcb   15h
centos8-md-0-9qqfh           10h
centos8-md-0-bhq65           11h
centos8-md-0-dg6fx           10h

Also, the target cluster looks OK to me:

[centos@node-2 ~]$ kubectl  get nodes -o wide
NAME     STATUS   ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
node-1   Ready    <none>   10h   v1.17.4   192.168.111.21   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1
node-2   Ready    master   15h   v1.17.4   192.168.111.22   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1
node-3   Ready    <none>   10h   v1.17.4   192.168.111.23   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1
node-4   Ready    <none>   10h   v1.17.4   192.168.111.24   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1

I do not know what implications has this Machine objects to be in failed state. Also I am not sure if it can be related to run the target cluster with CentOS8 instead of CentOS 7.

The text was updated successfully, but these errors were encountered:

maelk · 2020-03-20T16:30:04Z

Did you see this behavior with CentOS7 ?
Can you please add a yaml output of one of the machines ? There should be more info than just "Failed".

maelk · 2020-03-22T09:32:41Z

This is independent of the OS. Here are the machines in a failed CI run :

apiVersion: v1
items:
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: Machine
  metadata:
    creationTimestamp: "2020-03-22T08:26:40Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generation: 3
    labels:
      cluster.x-k8s.io/cluster-name: test1
      cluster.x-k8s.io/control-plane: ""
      kubeadm.controlplane.cluster.x-k8s.io/hash: "3288331389"
    name: test1-controlplane-lpnxq
    namespace: metal3
    ownerReferences:
    - apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
      blockOwnerDeletion: true
      controller: true
      kind: KubeadmControlPlane
      name: test1-controlplane
      uid: 834ec5bd-3cbf-4b87-9ccc-f8b5b07140d4
    resourceVersion: "6836"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-controlplane-lpnxq
    uid: 6be1adb7-c185-4779-be4c-d04b0e396f2e
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
        kind: KubeadmConfig
        name: test1-controlplane-bzpmb
        namespace: metal3
        uid: 5c3bbc07-b8f3-4ccb-98f5-be299edf226e
      dataSecretName: test1-controlplane-bzpmb
    clusterName: test1
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
      kind: Metal3Machine
      name: test1-controlplane-jfnpf
      namespace: metal3
      uid: 1cc69570-8c3d-4fbd-a8d6-c7d25118cb31
    providerID: metal3://d99b9c2b-5aee-45e6-9249-e7b42f7ccc76
    version: v1.17.0
  status:
    addresses:
    - address: 192.168.111.21
      type: InternalIP
    - address: 172.22.0.91
      type: InternalIP
    - address: node-1
      type: Hostname
    - address: node-1
      type: InternalDNS
    bootstrapReady: true
    failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
      Kind=Metal3Machine with name "test1-controlplane-jfnpf": Failed to associate
      the BaremetalHost to the Metal3Machine'
    failureReason: CreateError
    infrastructureReady: true
    lastUpdated: "2020-03-22T08:34:41Z"
    nodeRef:
      name: node-1
      uid: 49aa2fb7-b645-4da3-9600-64e31074bf81
    phase: Failed
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: Machine
  metadata:
    creationTimestamp: "2020-03-22T08:26:21Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generateName: test1-md-0-78d55dc456-
    generation: 3
    labels:
      cluster.x-k8s.io/cluster-name: test1
      machine-template-hash: "3481187012"
      nodepool: nodepool-0
    name: test1-md-0-78d55dc456-5b86v
    namespace: metal3
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1alpha3
      blockOwnerDeletion: true
      controller: true
      kind: MachineSet
      name: test1-md-0-78d55dc456
      uid: 7b2e3dbc-8af2-4f43-8945-6968bc12bd76
    resourceVersion: "9982"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-md-0-78d55dc456-5b86v
    uid: 3826ad1f-b1ed-4504-b265-2dde934055e9
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
        kind: KubeadmConfig
        name: test1-md-0-5hhsj
        namespace: metal3
        uid: 2d396de3-c3f4-45cc-baff-d915ebcfb20d
      dataSecretName: test1-md-0-5hhsj
    clusterName: test1
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
      kind: Metal3Machine
      name: test1-md-0-hr5jw
      namespace: metal3
      uid: 3477a996-d531-4901-a20e-bd9c93b1a8e6
    providerID: metal3://6fc41e7b-cc2b-4a06-a9d4-883c563d0c47
    version: v1.17.0
  status:
    addresses:
    - address: 172.22.0.87
      type: InternalIP
    - address: 192.168.111.20
      type: InternalIP
    - address: node-0
      type: Hostname
    - address: node-0
      type: InternalDNS
    bootstrapReady: true
    failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
      Kind=Metal3Machine with name "test1-md-0-hr5jw": Failed to associate the BaremetalHost
      to the Metal3Machine'
    failureReason: CreateError
    infrastructureReady: true
    lastUpdated: "2020-03-22T08:45:24Z"
    nodeRef:
      name: node-0
      uid: 8d146172-c593-455e-bf3c-c5c1dcb1bb73
    phase: Failed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The problem is that CAPM3 reports an error when it fails to associate the machine (i.e. requeues) and CAPI picks it up, but does not pick up the change back to normal state. CAPM3 needs to be modified to not set the Metal3Machine to error when it is a transient error.

This will be tackled in metal3-io/cluster-api-provider-metal3#30
/close

metal3-io-bot · 2020-03-22T09:32:44Z

@maelk: Closing this issue.

In response to this:

This is independent of the OS. Here are the machines in a failed CI run :

apiVersion: v1
items:
- apiVersion: cluster.x-k8s.io/v1alpha3
 kind: Machine
 metadata:
   creationTimestamp: "2020-03-22T08:26:40Z"
   finalizers:
   - machine.cluster.x-k8s.io
   generation: 3
   labels:
     cluster.x-k8s.io/cluster-name: test1
     cluster.x-k8s.io/control-plane: ""
     kubeadm.controlplane.cluster.x-k8s.io/hash: "3288331389"
   name: test1-controlplane-lpnxq
   namespace: metal3
   ownerReferences:
   - apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
     blockOwnerDeletion: true
     controller: true
     kind: KubeadmControlPlane
     name: test1-controlplane
     uid: 834ec5bd-3cbf-4b87-9ccc-f8b5b07140d4
   resourceVersion: "6836"
   selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-controlplane-lpnxq
   uid: 6be1adb7-c185-4779-be4c-d04b0e396f2e
 spec:
   bootstrap:
     configRef:
       apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
       kind: KubeadmConfig
       name: test1-controlplane-bzpmb
       namespace: metal3
       uid: 5c3bbc07-b8f3-4ccb-98f5-be299edf226e
     dataSecretName: test1-controlplane-bzpmb
   clusterName: test1
   infrastructureRef:
     apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
     kind: Metal3Machine
     name: test1-controlplane-jfnpf
     namespace: metal3
     uid: 1cc69570-8c3d-4fbd-a8d6-c7d25118cb31
   providerID: metal3://d99b9c2b-5aee-45e6-9249-e7b42f7ccc76
   version: v1.17.0
 status:
   addresses:
   - address: 192.168.111.21
     type: InternalIP
   - address: 172.22.0.91
     type: InternalIP
   - address: node-1
     type: Hostname
   - address: node-1
     type: InternalDNS
   bootstrapReady: true
   failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
     Kind=Metal3Machine with name "test1-controlplane-jfnpf": Failed to associate
     the BaremetalHost to the Metal3Machine'
   failureReason: CreateError
   infrastructureReady: true
   lastUpdated: "2020-03-22T08:34:41Z"
   nodeRef:
     name: node-1
     uid: 49aa2fb7-b645-4da3-9600-64e31074bf81
   phase: Failed
- apiVersion: cluster.x-k8s.io/v1alpha3
 kind: Machine
 metadata:
   creationTimestamp: "2020-03-22T08:26:21Z"
   finalizers:
   - machine.cluster.x-k8s.io
   generateName: test1-md-0-78d55dc456-
   generation: 3
   labels:
     cluster.x-k8s.io/cluster-name: test1
     machine-template-hash: "3481187012"
     nodepool: nodepool-0
   name: test1-md-0-78d55dc456-5b86v
   namespace: metal3
   ownerReferences:
   - apiVersion: cluster.x-k8s.io/v1alpha3
     blockOwnerDeletion: true
     controller: true
     kind: MachineSet
     name: test1-md-0-78d55dc456
     uid: 7b2e3dbc-8af2-4f43-8945-6968bc12bd76
   resourceVersion: "9982"
   selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-md-0-78d55dc456-5b86v
   uid: 3826ad1f-b1ed-4504-b265-2dde934055e9
 spec:
   bootstrap:
     configRef:
       apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
       kind: KubeadmConfig
       name: test1-md-0-5hhsj
       namespace: metal3
       uid: 2d396de3-c3f4-45cc-baff-d915ebcfb20d
     dataSecretName: test1-md-0-5hhsj
   clusterName: test1
   infrastructureRef:
     apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
     kind: Metal3Machine
     name: test1-md-0-hr5jw
     namespace: metal3
     uid: 3477a996-d531-4901-a20e-bd9c93b1a8e6
   providerID: metal3://6fc41e7b-cc2b-4a06-a9d4-883c563d0c47
   version: v1.17.0
 status:
   addresses:
   - address: 172.22.0.87
     type: InternalIP
   - address: 192.168.111.20
     type: InternalIP
   - address: node-0
     type: Hostname
   - address: node-0
     type: InternalDNS
   bootstrapReady: true
   failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
     Kind=Metal3Machine with name "test1-md-0-hr5jw": Failed to associate the BaremetalHost
     to the Metal3Machine'
   failureReason: CreateError
   infrastructureReady: true
   lastUpdated: "2020-03-22T08:45:24Z"
   nodeRef:
     name: node-0
     uid: 8d146172-c593-455e-bf3c-c5c1dcb1bb73
   phase: Failed
kind: List
metadata:
 resourceVersion: ""
 selfLink: ""

The problem is that CAPM3 reports an error when it fails to associate the machine (i.e. requeues) and CAPI picks it up, but does not pick up the change back to normal state. CAPM3 needs to be modified to not set the Metal3Machine to error when it is a transient error.

This will be tackled in metal3-io/cluster-api-provider-metal3#30
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alosadagrande mentioned this issue Mar 20, 2020

Target cluster OS set to Centos8 instead of CentOS 7 #264

Merged

metal3-io-bot closed this as completed Mar 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Failed to associate the BaremetalHost to the Metal3Machine #265

Error: Failed to associate the BaremetalHost to the Metal3Machine #265

alosadagrande commented Mar 20, 2020 •

edited

Loading

maelk commented Mar 20, 2020

maelk commented Mar 22, 2020

metal3-io-bot commented Mar 22, 2020

Error: Failed to associate the BaremetalHost to the Metal3Machine #265

Error: Failed to associate the BaremetalHost to the Metal3Machine #265

Comments

alosadagrande commented Mar 20, 2020 • edited Loading

maelk commented Mar 20, 2020

maelk commented Mar 22, 2020

metal3-io-bot commented Mar 22, 2020

alosadagrande commented Mar 20, 2020 •

edited

Loading