Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Failed to associate the BaremetalHost to the Metal3Machine #265

Closed
alosadagrande opened this issue Mar 20, 2020 · 3 comments
Closed

Comments

@alosadagrande
Copy link
Member

alosadagrande commented Mar 20, 2020

While I was setting metal3-dev-env to create target clusters based in CentOS 8 as base image I face this error with Machines:

NAME                           PROVIDERID                                      PHASE
centos8-controlplane-ghjx2     metal3://60914f8e-2a41-4bce-82c2-e652083aac3d   Failed
centos8-md-0-9fbd54c6d-6xqfr   metal3://4a06806b-413d-453d-95d1-d24f306840ba   Failed
centos8-md-0-9fbd54c6d-nlwd7   metal3://bfca3312-1860-488d-9bdc-7a3b58e5940f   Failed
centos8-md-0-9fbd54c6d-t4r7w   metal3://1e2d61e7-d44b-4eb4-9d90-a7df245165dc   Failed

I can say that I see them Running and suddenly become Failed after an uncertain period of time. However, it is not always like that. Sometimes only the control-plane or some workers. I can see from the capi-controller-manager:

[alosadag@eko4 metal3-dev-env]$ kubectl logs -f capi-controller-manager-664c75c4df-2qhnt -n capi-system

I0320 09:42:03.053773       1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3" 
E0320 09:42:03.053834       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"centos8-md-0-9fbd54c6d-zkh9x\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3" 
I0320 09:42:03.066965       1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3" 
E0320 09:42:03.069221       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"centos8-md-0-9fbd54c6d-zkh9x\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="centos8" "machine"="centos8-md-0-9fbd54c6d-zkh9x" "namespace"="metal3" 

From my point of view everything looks like everything is running fine except the Machines error message:

============== cluster =================
NAME      PHASE
centos8   Provisioned
============ metal3cluster ==========
NAME      READY   ERROR   CLUSTER   ENDPOINT
centos8   true            centos8   map[host:192.168.111.249 port:6443]
================ bareMetalHost ================
NAME     STATUS   PROVISIONING STATUS   CONSUMER                     BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK   ready                                              ipmi://192.168.111.1:6230   unknown            false
node-1   OK   provisioned           centos8-md-0-btx7x           ipmi://192.168.111.1:6231   unknown            true
node-2   OK   provisioned           centos8-controlplane-2gsvr   ipmi://192.168.111.1:6232   unknown            true
node-3   OK   provisioned           centos8-md-0-7fqjh           ipmi://192.168.111.1:6233   unknown            true
node-4   OK   provisioned           centos8-md-0-qd7fc           ipmi://192.168.111.1:6234   unknown            true
node-5   OK   ready                                              ipmi://192.168.111.1:6235   unknown            true
=============== Metal3Machine ===============
NAME                         PROVIDERID                                      READY   CLUSTER   PHASE
centos8-controlplane-2gsvr   metal3://60914f8e-2a41-4bce-82c2-e652083aac3d   true    centos8
centos8-md-0-7fqjh           metal3://4a06806b-413d-453d-95d1-d24f306840ba   true    centos8
centos8-md-0-btx7x           metal3://bfca3312-1860-488d-9bdc-7a3b58e5940f   true    centos8
centos8-md-0-qd7fc           metal3://1e2d61e7-d44b-4eb4-9d90-a7df245165dc   true    centos8
=================== Machines ===================
NAME                           PROVIDERID                                      PHASE
centos8-controlplane-ghjx2     metal3://60914f8e-2a41-4bce-82c2-e652083aac3d   Failed
centos8-md-0-9fbd54c6d-6xqfr   metal3://4a06806b-413d-453d-95d1-d24f306840ba   Failed
centos8-md-0-9fbd54c6d-nlwd7   metal3://bfca3312-1860-488d-9bdc-7a3b58e5940f   Failed
centos8-md-0-9fbd54c6d-t4r7w   metal3://1e2d61e7-d44b-4eb4-9d90-a7df245165dc   Failed
=============== Machinedeployment ===============
NAME           PHASE     REPLICAS   AVAILABLE   READY
centos8-md-0   Running   3          3           3
 =============== Metal3MachineTemplate ===============
NAME                   AGE
centos8-controlplane   15h
centos8-md-0           14h
============= kubeAdmConfigTemplate =============
NAME           AGE
centos8-md-0   14h
================= kubeAdmConfig =================
NAME                         AGE
centos8-controlplane-7wrcb   15h
centos8-md-0-9qqfh           10h
centos8-md-0-bhq65           11h
centos8-md-0-dg6fx           10h

Also, the target cluster looks OK to me:

[centos@node-2 ~]$ kubectl  get nodes -o wide
NAME     STATUS   ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
node-1   Ready    <none>   10h   v1.17.4   192.168.111.21   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1
node-2   Ready    master   15h   v1.17.4   192.168.111.22   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1
node-3   Ready    <none>   10h   v1.17.4   192.168.111.23   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1
node-4   Ready    <none>   10h   v1.17.4   192.168.111.24   <none>        CentOS Linux 8 (Core)   4.18.0-147.3.1.el8_1.x86_64   docker://18.9.1

I do not know what implications has this Machine objects to be in failed state. Also I am not sure if it can be related to run the target cluster with CentOS8 instead of CentOS 7.

@maelk
Copy link
Member

maelk commented Mar 20, 2020

Did you see this behavior with CentOS7 ?
Can you please add a yaml output of one of the machines ? There should be more info than just "Failed".

@maelk
Copy link
Member

maelk commented Mar 22, 2020

This is independent of the OS. Here are the machines in a failed CI run :

apiVersion: v1
items:
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: Machine
  metadata:
    creationTimestamp: "2020-03-22T08:26:40Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generation: 3
    labels:
      cluster.x-k8s.io/cluster-name: test1
      cluster.x-k8s.io/control-plane: ""
      kubeadm.controlplane.cluster.x-k8s.io/hash: "3288331389"
    name: test1-controlplane-lpnxq
    namespace: metal3
    ownerReferences:
    - apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
      blockOwnerDeletion: true
      controller: true
      kind: KubeadmControlPlane
      name: test1-controlplane
      uid: 834ec5bd-3cbf-4b87-9ccc-f8b5b07140d4
    resourceVersion: "6836"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-controlplane-lpnxq
    uid: 6be1adb7-c185-4779-be4c-d04b0e396f2e
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
        kind: KubeadmConfig
        name: test1-controlplane-bzpmb
        namespace: metal3
        uid: 5c3bbc07-b8f3-4ccb-98f5-be299edf226e
      dataSecretName: test1-controlplane-bzpmb
    clusterName: test1
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
      kind: Metal3Machine
      name: test1-controlplane-jfnpf
      namespace: metal3
      uid: 1cc69570-8c3d-4fbd-a8d6-c7d25118cb31
    providerID: metal3://d99b9c2b-5aee-45e6-9249-e7b42f7ccc76
    version: v1.17.0
  status:
    addresses:
    - address: 192.168.111.21
      type: InternalIP
    - address: 172.22.0.91
      type: InternalIP
    - address: node-1
      type: Hostname
    - address: node-1
      type: InternalDNS
    bootstrapReady: true
    failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
      Kind=Metal3Machine with name "test1-controlplane-jfnpf": Failed to associate
      the BaremetalHost to the Metal3Machine'
    failureReason: CreateError
    infrastructureReady: true
    lastUpdated: "2020-03-22T08:34:41Z"
    nodeRef:
      name: node-1
      uid: 49aa2fb7-b645-4da3-9600-64e31074bf81
    phase: Failed
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: Machine
  metadata:
    creationTimestamp: "2020-03-22T08:26:21Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generateName: test1-md-0-78d55dc456-
    generation: 3
    labels:
      cluster.x-k8s.io/cluster-name: test1
      machine-template-hash: "3481187012"
      nodepool: nodepool-0
    name: test1-md-0-78d55dc456-5b86v
    namespace: metal3
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1alpha3
      blockOwnerDeletion: true
      controller: true
      kind: MachineSet
      name: test1-md-0-78d55dc456
      uid: 7b2e3dbc-8af2-4f43-8945-6968bc12bd76
    resourceVersion: "9982"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-md-0-78d55dc456-5b86v
    uid: 3826ad1f-b1ed-4504-b265-2dde934055e9
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
        kind: KubeadmConfig
        name: test1-md-0-5hhsj
        namespace: metal3
        uid: 2d396de3-c3f4-45cc-baff-d915ebcfb20d
      dataSecretName: test1-md-0-5hhsj
    clusterName: test1
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
      kind: Metal3Machine
      name: test1-md-0-hr5jw
      namespace: metal3
      uid: 3477a996-d531-4901-a20e-bd9c93b1a8e6
    providerID: metal3://6fc41e7b-cc2b-4a06-a9d4-883c563d0c47
    version: v1.17.0
  status:
    addresses:
    - address: 172.22.0.87
      type: InternalIP
    - address: 192.168.111.20
      type: InternalIP
    - address: node-0
      type: Hostname
    - address: node-0
      type: InternalDNS
    bootstrapReady: true
    failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
      Kind=Metal3Machine with name "test1-md-0-hr5jw": Failed to associate the BaremetalHost
      to the Metal3Machine'
    failureReason: CreateError
    infrastructureReady: true
    lastUpdated: "2020-03-22T08:45:24Z"
    nodeRef:
      name: node-0
      uid: 8d146172-c593-455e-bf3c-c5c1dcb1bb73
    phase: Failed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The problem is that CAPM3 reports an error when it fails to associate the machine (i.e. requeues) and CAPI picks it up, but does not pick up the change back to normal state. CAPM3 needs to be modified to not set the Metal3Machine to error when it is a transient error.

This will be tackled in metal3-io/cluster-api-provider-metal3#30
/close

@metal3-io-bot
Copy link
Collaborator

@maelk: Closing this issue.

In response to this:

This is independent of the OS. Here are the machines in a failed CI run :

apiVersion: v1
items:
- apiVersion: cluster.x-k8s.io/v1alpha3
 kind: Machine
 metadata:
   creationTimestamp: "2020-03-22T08:26:40Z"
   finalizers:
   - machine.cluster.x-k8s.io
   generation: 3
   labels:
     cluster.x-k8s.io/cluster-name: test1
     cluster.x-k8s.io/control-plane: ""
     kubeadm.controlplane.cluster.x-k8s.io/hash: "3288331389"
   name: test1-controlplane-lpnxq
   namespace: metal3
   ownerReferences:
   - apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
     blockOwnerDeletion: true
     controller: true
     kind: KubeadmControlPlane
     name: test1-controlplane
     uid: 834ec5bd-3cbf-4b87-9ccc-f8b5b07140d4
   resourceVersion: "6836"
   selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-controlplane-lpnxq
   uid: 6be1adb7-c185-4779-be4c-d04b0e396f2e
 spec:
   bootstrap:
     configRef:
       apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
       kind: KubeadmConfig
       name: test1-controlplane-bzpmb
       namespace: metal3
       uid: 5c3bbc07-b8f3-4ccb-98f5-be299edf226e
     dataSecretName: test1-controlplane-bzpmb
   clusterName: test1
   infrastructureRef:
     apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
     kind: Metal3Machine
     name: test1-controlplane-jfnpf
     namespace: metal3
     uid: 1cc69570-8c3d-4fbd-a8d6-c7d25118cb31
   providerID: metal3://d99b9c2b-5aee-45e6-9249-e7b42f7ccc76
   version: v1.17.0
 status:
   addresses:
   - address: 192.168.111.21
     type: InternalIP
   - address: 172.22.0.91
     type: InternalIP
   - address: node-1
     type: Hostname
   - address: node-1
     type: InternalDNS
   bootstrapReady: true
   failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
     Kind=Metal3Machine with name "test1-controlplane-jfnpf": Failed to associate
     the BaremetalHost to the Metal3Machine'
   failureReason: CreateError
   infrastructureReady: true
   lastUpdated: "2020-03-22T08:34:41Z"
   nodeRef:
     name: node-1
     uid: 49aa2fb7-b645-4da3-9600-64e31074bf81
   phase: Failed
- apiVersion: cluster.x-k8s.io/v1alpha3
 kind: Machine
 metadata:
   creationTimestamp: "2020-03-22T08:26:21Z"
   finalizers:
   - machine.cluster.x-k8s.io
   generateName: test1-md-0-78d55dc456-
   generation: 3
   labels:
     cluster.x-k8s.io/cluster-name: test1
     machine-template-hash: "3481187012"
     nodepool: nodepool-0
   name: test1-md-0-78d55dc456-5b86v
   namespace: metal3
   ownerReferences:
   - apiVersion: cluster.x-k8s.io/v1alpha3
     blockOwnerDeletion: true
     controller: true
     kind: MachineSet
     name: test1-md-0-78d55dc456
     uid: 7b2e3dbc-8af2-4f43-8945-6968bc12bd76
   resourceVersion: "9982"
   selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-md-0-78d55dc456-5b86v
   uid: 3826ad1f-b1ed-4504-b265-2dde934055e9
 spec:
   bootstrap:
     configRef:
       apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
       kind: KubeadmConfig
       name: test1-md-0-5hhsj
       namespace: metal3
       uid: 2d396de3-c3f4-45cc-baff-d915ebcfb20d
     dataSecretName: test1-md-0-5hhsj
   clusterName: test1
   infrastructureRef:
     apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
     kind: Metal3Machine
     name: test1-md-0-hr5jw
     namespace: metal3
     uid: 3477a996-d531-4901-a20e-bd9c93b1a8e6
   providerID: metal3://6fc41e7b-cc2b-4a06-a9d4-883c563d0c47
   version: v1.17.0
 status:
   addresses:
   - address: 172.22.0.87
     type: InternalIP
   - address: 192.168.111.20
     type: InternalIP
   - address: node-0
     type: Hostname
   - address: node-0
     type: InternalDNS
   bootstrapReady: true
   failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
     Kind=Metal3Machine with name "test1-md-0-hr5jw": Failed to associate the BaremetalHost
     to the Metal3Machine'
   failureReason: CreateError
   infrastructureReady: true
   lastUpdated: "2020-03-22T08:45:24Z"
   nodeRef:
     name: node-0
     uid: 8d146172-c593-455e-bf3c-c5c1dcb1bb73
   phase: Failed
kind: List
metadata:
 resourceVersion: ""
 selfLink: ""

The problem is that CAPM3 reports an error when it fails to associate the machine (i.e. requeues) and CAPI picks it up, but does not pick up the change back to normal state. CAPM3 needs to be modified to not set the Metal3Machine to error when it is a transient error.

This will be tackled in metal3-io/cluster-api-provider-metal3#30
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants