Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After completing the test node autonomy, the edge node status still keep ready #106

Closed
HirazawaUi opened this issue Aug 18, 2020 · 22 comments · Fixed by #123
Closed

After completing the test node autonomy, the edge node status still keep ready #106

HirazawaUi opened this issue Aug 18, 2020 · 22 comments · Fixed by #123
Assignees

Comments

@HirazawaUi
Copy link

HirazawaUi commented Aug 18, 2020

Situation description

  1. I installed the kubernetes cluster using kubeadm. The version of the cluster is 1.16. The cluster has a master and three nodes.

  2. After I finished installing open-yurt manually, I started trying to test whether the result of my installation was successful

  3. I used the Test node autonomy chapter in https://github.com/alibaba/openyurt/blob/master/docs/tutorial/yurtctl.md to test

  4. After I completed the actions in the Test node autonomy chapter, the edge node status still keep reday

Operation steps

  1. I created a sample pod
kubectl apply -f-<<EOF
apiVersion: v1
kind: Pod
metadata:
  name: bbox
spec:
  nodeName: node3       
  containers:
  - image: busybox
    command:
    - top
    name: bbox
EOF
  • node3 is the edge node. I chose the simplest way to schedule the sample pod to the edge node, although this method is not recommended in the kubernetes documentation
  1. I modified yurt-hub.yaml. make the value of --server-addr= a non-existent ip and port
    - --server-addr=https://1.1.1.1:6448
    
  2. Then I used the curl -s http://127.0.0.1:10261 command to test and verify whether the edge node can work normally in offline mode. the result of the command is as expected
    {
      "kind": "Status",
      "metadata": {
    
      },
      "status": "Failure",
      "message": "request( get : /) is not supported when cluster is unhealthy",
      "reason": "BadRequest",
      "code": 400
    }
    
  3. But node3 status still keep ready. and yurt-hub enters pending state
    kubectl get nodes
    NAME     STATUS   ROLES    AGE   VERSION
    master   Ready    master   23h   v1.16.6
    node1    Ready    <none>   23h   v1.16.6
    node2    Ready    <none>   23h   v1.16.6
    node3    Ready    <none>   23h   v1.16.6
    
    # kubectl get pods -n kube-system | grep yurt
    yurt-controller-manager-59544577cc-t948z   1/1     Running   0          5h42m
    yurt-hub-node3                             0/1     Pending   0          5h32m
    

Some configuration items and logs that may be used as reference

  1. Label information of each node
    root@master:~# kubectl describe nodes master | grep Labels
    Labels:             alibabacloud.com/is-edge-worker=false
    root@master:~# kubectl describe nodes node1 | grep Labels
    Labels:             alibabacloud.com/is-edge-worker=false
    root@master:~# kubectl describe nodes node2 | grep Labels
    Labels:             alibabacloud.com/is-edge-worker=false
    root@master:~# kubectl describe nodes node3 | grep Labels
    Labels:             alibabacloud.com/is-edge-worker=true
    
  2. Configuration of kube-controller-manager
        - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
        - --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle
        - --kubeconfig=/etc/kubernetes/controller-manager.conf
    
  3. /etc/kubernetes/manifests/yurthub.yml
    # cat yurthub.yml
    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        k8s-app: yurt-hub
      name: yurt-hub
      namespace: kube-system
    spec:
      volumes:
      - name: pki
        hostPath:
          path: /etc/kubernetes/pki
          type: Directory
      - name: kubernetes
        hostPath:
          path: /etc/kubernetes
          type: Directory
      - name: pem-dir
        hostPath:
          path: /var/lib/kubelet/pki
          type: Directory
      containers:
      - name: yurt-hub
        image: openyurt/yurthub:latest
        imagePullPolicy: Always
        volumeMounts:
        - name: kubernetes
          mountPath: /etc/kubernetes
        - name: pki
          mountPath: /etc/kubernetes/pki
        - name: pem-dir
          mountPath: /var/lib/kubelet/pki
        command:
        - yurthub
        - --v=2
        - --server-addr=https://1.1.1.1:6448
        - --node-name=$(NODE_NAME)
        livenessProbe:
          httpGet:
            host: 127.0.0.1
            path: /v1/healthz
            port: 10261
          initialDelaySeconds: 300
          periodSeconds: 5
          failureThreshold: 3
        resources:
          requests:
            cpu: 150m
            memory: 150Mi
          limits:
            memory: 300Mi
        securityContext:
          capabilities:
            add: ["NET_ADMIN", "NET_RAW"]
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
      hostNetwork: true
      priorityClassName: system-node-critical
      priority: 2000001000
    
  4. /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
    # cat  /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
    # Note: This dropin only works with kubeadm and kubelet v1.11+
    [Service]
    #Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/var/lib/openyurt/kubelet.conf"
    Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/var/lib/openyurt/kubelet.conf"
    Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
    # This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
    EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
    # This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
    # the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
    EnvironmentFile=-/etc/default/kubelet
    ExecStart=
    ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
    
  5. /var/lib/openyurt/kubelet.conf
    # cat /var/lib/openyurt/kubelet.conf
    apiVersion: v1
    clusters:
    - cluster:
        server: http://127.0.0.1:10261
      name: default-cluster
    contexts:
    - context:
        cluster: default-cluster
        namespace: default
      name: default-context
    current-context: default-context
    kind: Config
    preferences: {}
    users:
    - name: default-auth
    
  6. Use kubectl describe to view yurt-hub pod information
    # kubectl describe pods yurt-hub-node3 -n kube-system
    Name:                 yurt-hub-node3
    Namespace:            kube-system
    Priority:             2000001000
    Priority Class Name:  system-node-critical
    Node:                 node3/
    Labels:               k8s-app=yurt-hub
    Annotations:          kubernetes.io/config.hash: 7be1318d63088969eafcd2fa5887f2ef
                          kubernetes.io/config.mirror: 7be1318d63088969eafcd2fa5887f2ef
                          kubernetes.io/config.seen: 2020-08-18T08:41:27.431580091Z
                          kubernetes.io/config.source: file
    Status:               Pending
    IP:
    IPs:                  <none>
    Containers:
      yurt-hub:
        Image:      openyurt/yurthub:latest
        Port:       <none>
        Host Port:  <none>
        Command:
          yurthub
          --v=2
          --server-addr=https://10.10.13.82:6448
          --node-name=$(NODE_NAME)
        Limits:
          memory:  300Mi
        Requests:
          cpu:     150m
          memory:  150Mi
        Liveness:  http-get http://127.0.0.1:10261/v1/healthz delay=300s timeout=1s period=5s #success=1 #failure=3
        Environment:
          NODE_NAME:   (v1:spec.nodeName)
        Mounts:
          /etc/kubernetes from kubernetes (rw)
          /etc/kubernetes/pki from pki (rw)
          /var/lib/kubelet/pki from pem-dir (rw)
    Volumes:
      pki:
        Type:          HostPath (bare host directory volume)
        Path:          /etc/kubernetes/pki
        HostPathType:  Directory
      kubernetes:
        Type:          HostPath (bare host directory volume)
        Path:          /etc/kubernetes
        HostPathType:  Directory
      pem-dir:
        Type:          HostPath (bare host directory volume)
        Path:          /var/lib/kubelet/pki
        HostPathType:  Directory
    QoS Class:         Burstable
    Node-Selectors:    <none>
    Tolerations:       :NoExecute
    Events:            <none>
    
  7. Use docker ps on the edge node to view the log of the yurt-hub container. Intercept the last 20 lines
    # docker logs 0c89efbe949b --tail 20
    I0818 13:54:13.293068       1 health_checker.go:151] ping cluster healthz with result, Get https://1.1.1.1:6448/healthz: dial tcp 1.1.1.1:6448: connect: connection refused
    I0818 13:54:13.561262       1 util.go:177] kubelet get nodes: /api/v1/nodes/node3?resourceVersion=0&timeout=10s with status code 200, spent 331.836µs, left 10 requests in flight
    I0818 13:54:15.746576       1 util.go:177] kubelet update leases: /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node3?timeout=10s with status code 200, spent 83.127µs, left 10 requests in flight
    I0818 13:54:15.828560       1 util.go:177] kubelet get pods: /api/v1/namespaces/kube-system/pods/yurt-hub-node3 with status code 200, spent 436.489µs, left 10 requests in flight
    I0818 13:54:15.829628       1 util.go:177] kubelet patch pods: /api/v1/namespaces/kube-system/pods/yurt-hub-node3/status with status code 200, spent 307.187µs, left 10 requests in flight
    I0818 13:54:17.831366       1 util.go:177] kubelet delete pods: /api/v1/namespaces/kube-system/pods/yurt-hub-node3 with status code 200, spent 147.492µs, left 10 requests in flight
    I0818 13:54:17.833762       1 util.go:177] kubelet create pods: /api/v1/namespaces/kube-system/pods with status code 201, spent 111.762µs, left 10 requests in flight
    I0818 13:54:22.273899       1 health_checker.go:151] ping cluster healthz with result, Get https://1.1.1.1:6448/healthz: dial tcp 1.1.1.1:6448: connect: connection refused
    I0818 13:54:23.486523       1 util.go:177] kubelet watch configmaps: /api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dkube-flannel-cfg&resourceVersion=2161&timeout=7m54s&timeoutSeconds=474&watch=true with status code 200, spent 7m54.000780359s, left 9 requests in flight
    I0818 13:54:23.648871       1 util.go:177] kubelet get nodes: /api/v1/nodes/node3?resourceVersion=0&timeout=10s with status code 200, spent 266.182µs, left 10 requests in flight
    I0818 13:54:25.748497       1 util.go:177] kubelet update leases: /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node3?timeout=10s with status code 200, spent 189.694µs, left 10 requests in flight
    I0818 13:54:25.830919       1 util.go:177] kubelet get pods: /api/v1/namespaces/kube-system/pods/yurt-hub-node3 with status code 200, spent 1.375535ms, left 10 requests in flight
    I0818 13:54:25.835015       1 util.go:177] kubelet patch pods: /api/v1/namespaces/kube-system/pods/yurt-hub-node3/status with status code 200, spent 1.363765ms, left 10 requests in flight
    I0818 13:54:33.733913       1 util.go:177] kubelet get nodes: /api/v1/nodes/node3?resourceVersion=0&timeout=10s with status code 200, spent 303.499µs, left 10 requests in flight
    I0818 13:54:34.261504       1 health_checker.go:151] ping cluster healthz with result, Get https://1.1.1.1:6448/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    I0818 13:54:35.751002       1 util.go:177] kubelet update leases: /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node3?timeout=10s with status code 200, spent 144.723µs, left 10 requests in flight
    I0818 13:54:35.830895       1 util.go:177] kubelet get pods: /api/v1/namespaces/kube-system/pods/yurt-hub-node3 with status code 200, spent 1.146812ms, left 10 requests in flight
    I0818 13:54:35.834366       1 util.go:177] kubelet patch pods: /api/v1/namespaces/kube-system/pods/yurt-hub-node3/status with status code 200, spent 744.857µs, left 10 requests in flight
    I0818 13:54:42.274049       1 health_checker.go:151] ping cluster healthz with result, Get https://1.1.1.1:6448/healthz: dial tcp 1.1.1.1:6448: connect: connection refused
    I0818 13:54:43.818381       1 util.go:177] kubelet get nodes: /api/v1/nodes/node3?resourceVersion=0&timeout=10s with status code 200, spent 248.672µs, left 10 requests in flight
    
  8. Use kubectl logs to view the logs of yurt-controller-manager. Intercept the last 20 lines
    # kubectl logs yurt-controller-manager-59544577cc-t948z -n kube-system --tail 20
    E0818 13:56:07.239721       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:10.560864       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:13.288544       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:16.726605       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:19.623694       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:23.572803       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:26.809117       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:29.021205       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:31.271086       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:34.083918       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:37.493386       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:40.222869       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:44.149011       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:47.699211       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:50.177053       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:52.553163       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:55.573328       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:56:58.677034       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:57:02.844152       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    E0818 13:57:05.044990       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
    

At last

​ I very much hope that you can help me solve the problem or point out my mistakes. If there is any other information that needs to be provided, please communicate with me in time

@xujunjie-cover
Copy link
Contributor

Hello, @HirazawaUi
In your steps,Configuration of kube-controller-manager
Please cancel append -nodelifecycle to - --controllers=*,bootstrapsigner,tokencleaner,

In my opinion, Work node sends its health status to master, if send overtime, Master node will change node status;
After install yurt on work node, edge work node sends its health status to yurthub, Master node should change the node status
like descriptions in yurtctl.md;

but if you set kube-controller-manager config -nodelifecycle, Master node will not change status when receive health status overtime.

I don't know if my opinion is right, I will find introduction in Kubernetes.

@Fei-Guo
Copy link
Member

Fei-Guo commented Aug 20, 2020

Isn't this an expected behavior? With the yurt-controller-manager which re-implements the node lifecycle controller, if an edge node stops sending the heartbeats, the yurt-controller-manager will not mark node NotReady.

@HirazawaUi
Copy link
Author

Hello, @HirazawaUi
In your steps,Configuration of kube-controller-manager
Please cancel append -nodelifecycle to - --controllers=*,bootstrapsigner,tokencleaner,

In my opinion, Work node sends its health status to master, if send overtime, Master node will change node status;
After install yurt on work node, edge work node sends its health status to yurthub, Master node should change the node status
like descriptions in yurtctl.md;

but if you set kube-controller-manager config -nodelifecycle, Master node will not change status when receive health status overtime.

I don't know if my opinion is right, I will find introduction in Kubernetes.

thank you very much for your answers, which solved my problem and now openyurt is running normally. I am not sure if anyone else has encountered this problem, I think this parameter should be described in more detail in the document

@HirazawaUi
Copy link
Author

Isn't this an expected behavior? With the yurt-controller-manager which re-implements the node lifecycle controller, if an edge node stops sending the heartbeats, the yurt-controller-manager will not mark node NotReady.

In the Test node autonomy chapter of https://github.com/alibaba/openyurt/blob/master/docs/tutorial/yurtctl.md, I see that the expected result is that after modifying yurthub's --server-addr= , The edge node will enter the notready state, and the pod will always be in the running state

@Fei-Guo
Copy link
Member

Fei-Guo commented Aug 20, 2020

You are right. I take it back. The yurt-controller-manager just does not evict Pods and Node status is still updated to NotReady.

@xujunjie-cover
Copy link
Contributor

Show nodes status after Convert a multi-nodes Kubernetes cluster in yurtctl.md may be better. Thanks

@Fei-Guo
Copy link
Member

Fei-Guo commented Aug 20, 2020

It is recommended that the default nodelifecycle controller should not be installed, otherwise it will conflict with yurt-controller-manager. In yurtctl tool, to ease the workflow, we delete the nodelifecycle sa from kube-system assuming that
the default nodecontroller will not work without the sa. Is this possible that in your setup, the default nodelifecycle controller still works (using other SAs)?

@HirazawaUi
Copy link
Author

I observed that my simple pod entered the Terminating state about five minutes after the edge node stopped sending heartbeats. I don't know if this is normal and meets expectations. There is no similar description in the document

  1. Information output using kubectl get pods

    # kubectl get pods
    NAME   READY   STATUS        RESTARTS   AGE
    bbox   1/1     Terminating   0          59m
    
  2. Information output using kubectl describe pods bbox

    # kubectl describe pods bbox
    Name:                      bbox
    Namespace:                 default
    Priority:                  0
    Node:                      node3/10.10.13.85
    Start Time:                Thu, 20 Aug 2020 06:01:18 +0000
    Labels:                    <none>
    Annotations:               Status:  Terminating (lasts 54m)
    Termination Grace Period:  30s
    IP:                        10.20.3.8
    IPs:
      IP:  10.20.3.8
    Containers:
      bbox:
        Container ID:  docker://1d7e95fe71f632363f7e811813ce1fd7778cfa53258cc66b5eb5aae39babca68
        Image:         busybox
        Image ID:      docker-pullable://busybox@sha256:4f47c01fa91355af2865ac10fef5bf6ec9c7f42ad2321377c21e844427972977
        Port:          <none>
        Host Port:     <none>
        Command:
          top
        State:          Running
          Started:      Thu, 20 Aug 2020 06:01:37 +0000
        Ready:          True
        Restart Count:  0
        Environment:    <none>
        Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from default-token-ccjzj (ro)
    Conditions:
      Type              Status
      Initialized       True
      Ready             False
      ContainersReady   True
      PodScheduled      True
    Volumes:
      default-token-ccjzj:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  default-token-ccjzj
        Optional:    false
    QoS Class:       BestEffort
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s
    Events:
      Type    Reason   Age   From            Message
      ----    ------   ----  ----            -------
      Normal  Pulled   60m   kubelet, node3  Successfully pulled image "busybox"
      Normal  Created  60m   kubelet, node3  Created container bbox
      Normal  Started  60m   kubelet, node3  Started container bbox
    

@HirazawaUi
Copy link
Author

It is recommended that the default nodelifecycle controller should not be installed, otherwise it will conflict with yurt-controller-manager. In yurtctl tool, to ease the workflow, we delete the nodelifecycle sa from kube-system assuming that
the default nodecontroller will not work without the sa. Is this possible that in your setup, the default nodelifecycle controller still works (using other SAs)?

I did not find the relevant instructions in the kubernetes documentation, about using other SAs make the default nodelifecycle controller still works. In fact, this is to test the brand new kubernetes cluster installed by openyurt using kubeadm, I haven't made any changes to it

@Fei-Guo
Copy link
Member

Fei-Guo commented Aug 20, 2020

I observed that my simple pod entered the Terminating state about five minutes after the edge node stopped sending heartbeats. I don't know if this is normal and meets expectations. There is no similar description in the document

This looks like the default node controller behavior. Can you do kubectl get sa -n kube-system | grep node and see if the node controller sa is still there?

@HirazawaUi
Copy link
Author

I observed that my simple pod entered the Terminating state about five minutes after the edge node stopped sending heartbeats. I don't know if this is normal and meets expectations. There is no similar description in the document

This looks like the default node controller behavior. Can you do kubectl get sa -n kube-system | grep node and see if the node controller sa is still there?

Yes, it exists in the kubernetes cluster,Do I need to delete it?

@Fei-Guo
Copy link
Member

Fei-Guo commented Aug 20, 2020

I observed that my simple pod entered the Terminating state about five minutes after the edge node stopped sending heartbeats. I don't know if this is normal and meets expectations. There is no similar description in the document

This looks like the default node controller behavior. Can you do kubectl get sa -n kube-system | grep node and see if the node controller sa is still there?

Yes, it exists in the kubernetes cluster,Do I need to delete it?

Yes, the Yurtctl supposes to delete this SA (https://github.com/alibaba/openyurt/blob/ea19a211e43324f71a318a2236799b18291df4d8/pkg/yurtctl/cmd/convert/convert.go#L215). You can manually delete it for now. We should figure out why this deletion fails.

@charleszheng44
Copy link
Member

@xujunjie-cover your understand is correct. If nodelifecycle controller is disabled, then the node status will never change. However, to my knowledge, the yurt-controller-manager will be responsible for managing the nodelifecycle, therefore the default nodelifecycle should still be disabled. @rambohe-ch would you verify this?

@HirazawaUi I just tried the manually setup process (disable the nodelifecycle controller by applying option --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle), and everything works as expected (node become NotReady after disconnected from the apiserver). Could you check if the edge node is marked as autonomous? (edgenode has the label node.beta.alibabacloud.com/autonomy=true )

@Fei-Guo
Copy link
Member

Fei-Guo commented Aug 20, 2020

Ah, I realized that you are doing the conversion manually instead of using yurtctl. Then you should use --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle option to disable the default nodelifecylce controller.
Basically, follow the step mentioned in https://github.com/alibaba/openyurt/blob/master/docs/tutorial/manually-setup.md#disable-the-default-nodelifecycle-controller

Can you please do it and repeat the test? We will go from there and see what does not meet the expectation.

@HirazawaUi
Copy link
Author

HirazawaUi commented Aug 21, 2020

@xujunjie-cover your understand is correct. If nodelifecycle controller is disabled, then the node status will never change. However, to my knowledge, the yurt-controller-manager will be responsible for managing the nodelifecycle, therefore the default nodelifecycle should still be disabled. @rambohe-ch would you verify this?

@HirazawaUi I just tried the manually setup process (disable the nodelifecycle controller by applying option --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle), and everything works as expected (node become NotReady after disconnected from the apiserver). Could you check if the edge node is marked as autonomous? (edgenode has the label node.beta.alibabacloud.com/autonomy=true )

this is the label and Annotations information of the edge node, alibabacloud.com/is-edge-worker=true label and node.beta.alibabacloud.com/autonomy: true annotation exist

Name:               node3
Roles:              <none>
Labels:             alibabacloud.com/is-edge-worker=true
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=node3
                    kubernetes.io/os=linux
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"fe:5d:79:c2:90:e5"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.10.13.85
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    node.beta.alibabacloud.com/autonomy: true
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 17 Aug 2020 13:55:19 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  node3
  AcquireTime:     <unset>
  RenewTime:       Fri, 21 Aug 2020 02:29:27 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 18 Aug 2020 08:38:30 +0000   Tue, 18 Aug 2020 08:38:30 +0000   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Fri, 21 Aug 2020 02:28:53 +0000   Fri, 21 Aug 2020 02:24:51 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 21 Aug 2020 02:28:53 +0000   Fri, 21 Aug 2020 02:24:51 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 21 Aug 2020 02:28:53 +0000   Fri, 21 Aug 2020 02:24:51 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Fri, 21 Aug 2020 02:28:53 +0000   Fri, 21 Aug 2020 02:24:51 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.10.13.85
  Hostname:    node3
Capacity:
  cpu:                4
  ephemeral-storage:  102684600Ki
  hugepages-2Mi:      0
  memory:             8168092Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  94634127204
  hugepages-2Mi:      0
  memory:             8065692Ki
  pods:               110
System Info:
  Machine ID:                 0500c02d85d74055b90a973b0bd7a4cc
  System UUID:                6A2B2942-28F3-C25F-F60C-2D14B7B068F2
  Boot ID:                    92c16623-c1be-4064-aa23-4f372029c12b
  Kernel Version:             4.15.0-99-generic
  OS Image:                   Ubuntu 18.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.12
  Kubelet Version:            v1.16.6
  Kube-Proxy Version:         v1.16.6
PodCIDR:                      10.20.3.0/24
PodCIDRs:                     10.20.3.0/24
Non-terminated Pods:          (2 in total)
  Namespace                   Name                           CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                           ------------  ----------  ---------------  -------------  ---
  kube-system                 kube-flannel-ds-amd64-nj6q8    100m (2%)     100m (2%)   50Mi (0%)        50Mi (0%)      3d12h
  kube-system                 kube-proxy-n9nxr               0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d12h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (2%)  100m (2%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:
  Type    Reason                   Age                 From            Message
  ----    ------                   ----                ----            -------
  Normal  NodeHasSufficientMemory  12m (x21 over 20h)  kubelet, node3  Node node3 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    12m (x21 over 20h)  kubelet, node3  Node node3 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     12m (x21 over 20h)  kubelet, node3  Node node3 status is now: NodeHasSufficientPID
  Normal  NodeReady                12m (x3 over 20h)   kubelet, node3  Node node3 status is now: NodeReady

@HirazawaUi
Copy link
Author

HirazawaUi commented Aug 21, 2020

Ah, I realized that you are doing the conversion manually instead of using yurtctl. Then you should use --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle option to disable the default nodelifecylce controller.
Basically, follow the step mentioned in https://github.com/alibaba/openyurt/blob/master/docs/tutorial/manually-setup.md#disable-the-default-nodelifecycle-controller

Can you please do it and repeat the test? We will go from there and see what does not meet the expectation.

I have executed kubectl delete -f yurt-controller-manager.yaml to delete the deployment controller and rebuild, but the result is still the same as above, the edge node state is still keep ready, should I try other versions of kubernetes or try to follow the document example only A kubernetes cluster with one master and one node

@Fei-Guo
Copy link
Member

Fei-Guo commented Aug 21, 2020

I have executed kubectl delete -f yurt-controller-manager.yaml to delete the deployment controller and rebuild, but the result is still the same as above, the edge node state is still keep ready, should I try other versions of kubernetes or try to follow the document example only A kubernetes cluster with one master and one pod

Please note that you should restart the default Kubernetes controller-manager kube-controller-manager in kube-system with the correct option, not the yurt-controller-manager.

@HirazawaUi
Copy link
Author

I observed that my simple pod entered the Terminating state about five minutes after the edge node stopped sending heartbeats. I don't know if this is normal and meets expectations. There is no similar description in the document

This looks like the default node controller behavior. Can you do kubectl get sa -n kube-system | grep node and see if the node controller sa is still there?

before this. I have modified the --controllers of kube-controller-manager. It has restarted automatically

@charleszheng44
Copy link
Member

Ah, I realized that you are doing the conversion manually instead of using yurtctl. Then you should use --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle option to disable the default nodelifecylce controller.
Basically, follow the step mentioned in https://github.com/alibaba/openyurt/blob/master/docs/tutorial/manually-setup.md#disable-the-default-nodelifecycle-controller
Can you please do it and repeat the test? We will go from there and see what does not meet the expectation.

I have executed kubectl delete -f yurt-controller-manager.yaml to delete the deployment controller and rebuild, but the result is still the same as above, the edge node state is still keep ready, should I try other versions of kubernetes or try to follow the document example only A kubernetes cluster with one master and one node

@HirazawaUi Thanks for the detailed log output. I will verify the manual setup process on a multi-nodes 1.16 Kubernetes, and let you know if the problem can be reproduced.

@HirazawaUi
Copy link
Author

@charleszheng44 @Fei-Guo
I think I know where the problem is. I fell into a misunderstanding of thinking. I rechecked the log of yurt-controller-manager and found that the problem is simply because the default serviceaccount in kube-system has insufficient permissions.
Then,I use The serviceaccount was created on the command line, and the manifest of yurt-controller-manager was modified to bind it with yurt-controller-manager. Now all the results are in line with expectations

Show the log of yurt-controller-manager again

E0905 07:02:52.347282       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
E0905 07:02:55.841780       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
E0905 07:03:00.146093       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
E0905 07:03:04.198585       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
E0905 07:03:08.111697       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
E0905 07:03:12.270039       1 leaderelection.go:330] error retrieving resource lock kube-system/yurt-controller-manager: leases.coordination.k8s.io "yurt-controller-manager" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"

Create serviceaccount

# kubectl -n kube-system create sa yurt-cm
serviceaccount/yurt-cm created
# kubectl create clusterrolebinding yurt-cm --clusterrole=cluster-admin --serviceaccount=kube-system:yurt-cm
clusterrolebinding.rbac.authorization.k8s.io/yurt-cm created

Modify the deploy of yurt-controller-manager

      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: yurt-cm
      serviceAccountName: yurt-cm
      terminationGracePeriodSeconds: 30

@vincent-pli
Copy link
Member

@charleszheng44
I hit the issue again today, seems we do not resolve it: #52

@charleszheng44
Copy link
Member

@charleszheng44
I hit the issue again today, seems we do not resolve it: #52

Thanks for reporting the issue, I forgot to fix it 😅. #123 should resolve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants