Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance + Broker Pod disappears #372

Closed
ammmze opened this issue Sep 2, 2021 · 9 comments · Fixed by #373
Closed

Instance + Broker Pod disappears #372

ammmze opened this issue Sep 2, 2021 · 9 comments · Fixed by #373
Assignees
Labels
bug Something isn't working

Comments

@ammmze
Copy link
Contributor

ammmze commented Sep 2, 2021

Describe the bug
I'm having an issue where my Instance and it is associated Broker pod seem to disappear and won't come back until I either restart my agent and udev discovery pods, or it seems I can simply modify my Configuration resource which triggers it to come back.

Output of kubectl get pods,akrii,akric -o wide

NAME                                                 READY   STATUS    RESTARTS   AGE     IP             NODE                 NOMINATED NODE   READINESS GATES
pod/aeotec-gen-5-79e1a1-pod                          1/1     Running   0          9m21s   10.244.0.8     talos-192-168-32-2   <none>           <none>
pod/akri-agent-daemonset-cv95v                       1/1     Running   0          23d     10.244.1.5     talos-192-168-32-3   <none>           <none>
pod/akri-agent-daemonset-kpqzq                       1/1     Running   0          13h     10.244.0.224   talos-192-168-32-2   <none>           <none>
pod/akri-agent-daemonset-pwdj6                       1/1     Running   0          25d     10.244.2.8     talos-192-168-32-4   <none>           <none>
pod/akri-controller-deployment-5496f7c977-p75v6      1/1     Running   0          17h     10.244.2.85    talos-192-168-32-4   <none>           <none>
pod/akri-udev-discovery-daemonset-6kw69              1/1     Running   0          23d     10.244.1.3     talos-192-168-32-3   <none>           <none>
pod/akri-udev-discovery-daemonset-gkhzv              1/1     Running   0          25d     10.244.2.7     talos-192-168-32-4   <none>           <none>
pod/akri-udev-discovery-daemonset-np5dt              1/1     Running   0          13h     10.244.0.223   talos-192-168-32-2   <none>           <none>
pod/chrony-5jqss                                     1/1     Running   0          17d     10.244.0.253   talos-192-168-32-2   <none>           <none>
pod/chrony-5vhn5                                     1/1     Running   0          17d     10.244.2.2     talos-192-168-32-4   <none>           <none>
pod/chrony-94xlv                                     1/1     Running   0          17d     10.244.1.6     talos-192-168-32-3   <none>           <none>
pod/intel-gpu-plugin-2wkdw                           1/1     Running   0          46h     10.244.2.3     talos-192-168-32-4   <none>           <none>
pod/intel-gpu-plugin-rrj4w                           1/1     Running   4          75d     10.244.0.162   talos-192-168-32-2   <none>           <none>
pod/intel-gpu-plugin-s8j6k                           1/1     Running   0          75d     10.244.1.4     talos-192-168-32-3   <none>           <none>
pod/node-feature-discovery-master-7f48f68ccf-v7jb2   1/1     Running   0          17h     10.244.2.75    talos-192-168-32-4   <none>           <none>
pod/node-feature-discovery-worker-kssph              1/1     Running   13         75d     10.244.1.2     talos-192-168-32-3   <none>           <none>
pod/node-feature-discovery-worker-n7vmv              1/1     Running   8          75d     10.244.2.4     talos-192-168-32-4   <none>           <none>
pod/node-feature-discovery-worker-zs5cn              1/1     Running   75         75d     10.244.0.164   talos-192-168-32-2   <none>           <none>

NAME                                   CONFIG         SHARED   NODES                    AGE
instance.akri.sh/aeotec-gen-5-79e1a1   aeotec-gen-5   false    ["talos-192-168-32-2"]   9m24s

NAME                                 CAPACITY   AGE
configuration.akri.sh/aeotec-gen-5   1          43d
configuration.akri.sh/conbee-ii      1          43d
configuration.akri.sh/ti-cc2531      1          43d

Kubernetes Version: 1.21.3

Logs
I believe these are applicable logs from the controller. I just had my conbee-ii instance disappear on me. The discussion we had in Slack points to potential race conditions happening when the Configuration resources are updated. Typically as a result of Flux updating metadata on the resources to manage its state. So we would likely have multiple resources being updated at nearly the same time.

[2021-09-02T18:00:41Z TRACE controller::util::instance_action] internal_do_instance_watch - aquired sync lock
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance - enter
[2021-09-02T18:00:41Z INFO  controller::util::instance_action] handle_instance - deleted Akri Instance aeotec-gen-5-79e1a1: Instance { configuration_name: "aeotec-gen-5", broker_properties: {"UDEV_DEVNODE": "/dev/ttyACM1"}, shared: false, nodes: ["talos-192-168-32-2"], device_usage: {"aeotec-gen-5-79e1a1-0": "talos-192-168-32-2"} }
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - enter Remove
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked from instance={"talos-192-168-32-2": PodContext { node_name: None, namespace: None, action: NoAction }}
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - find all pods that have akri.sh/instance=aeotec-gen-5-79e1a1
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/instance=aeotec-gen-5-79e1a1") field_selector=None
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - found 1 pods
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - update actions based on the existing pods
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] select_pod_action phase="Running" action=Remove unknown_node=false
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_pod_action phase="Running" action=Remove unknown_node=false trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_pods_on_known_nodes phase="Running" action=Remove trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_running_pods action=Remove trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_running_pods - Running Pod (tracked) ... PodAction::Remove ("aeotec-gen-5-79e1a1-pod")
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked after querying existing pods={"talos-192-168-32-2": PodContext { node_name: Some("talos-192-168-32-2"), namespace: Some("host-system"), action: Remove }}
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_deletion_work - pod::create_pod_app_name("aeotec-gen-5-79e1a1", "talos-192-168-32-2", false, "pod")
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_deletion_work - pod::remove_pod name="aeotec-gen-5-79e1a1-pod", namespace="host-system"
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] remove_pod enter
[2021-09-02T18:00:41Z INFO  akri_shared::k8s::pod] remove_pod pods.delete(...).await?:
[2021-09-02T18:00:41Z INFO  akri_shared::k8s::pod] remove_pod pods.delete return: "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_deletion_work - pod::remove_pod succeeded
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - exit
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] internal_do_instance_watch - aquired sync lock
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance - enter
[2021-09-02T18:00:41Z INFO  controller::util::instance_action] handle_instance - deleted Akri Instance conbee-ii-55a336: Instance { configuration_name: "conbee-ii", broker_properties: {"UDEV_DEVNODE": "/dev/ttyACM2"}, shared: false, nodes: ["talos-192-168-32-2"], device_usage: {"conbee-ii-55a336-0": "talos-192-168-32-2"} }
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - enter Remove
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked from instance={"talos-192-168-32-2": PodContext { node_name: None, namespace: None, action: NoAction }}
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - find all pods that have akri.sh/instance=conbee-ii-55a336
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/instance=conbee-ii-55a336") field_selector=None
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_pod - pod name "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Running"
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - enter
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - found 1 pods
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - update actions based on the existing pods
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] select_pod_action phase="Running" action=Remove unknown_node=false
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_pod_action phase="Running" action=Remove unknown_node=false trace_node_name="conbee-ii-55a336-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_pods_on_known_nodes phase="Running" action=Remove trace_node_name="conbee-ii-55a336-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_running_pods action=Remove trace_node_name="conbee-ii-55a336-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_running_pods - Running Pod (tracked) ... PodAction::Remove ("conbee-ii-55a336-pod")
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked after querying existing pods={"talos-192-168-32-2": PodContext { node_name: Some("talos-192-168-32-2"), namespace: Some("host-system"), action: Remove }}
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_deletion_work - pod::create_pod_app_name("conbee-ii-55a336", "talos-192-168-32-2", false, "pod")
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_deletion_work - pod::remove_pod name="conbee-ii-55a336-pod", namespace="host-system"
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] remove_pod enter
[2021-09-02T18:00:41Z INFO  akri_shared::k8s::pod] remove_pod pods.delete(...).await?:
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_pod - pod name "conbee-ii-55a336-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Running"
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - enter
[2021-09-02T18:00:41Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:41Z INFO  akri_shared::k8s::pod] remove_pod pods.delete return: "conbee-ii-55a336-pod"
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_deletion_work - pod::remove_pod succeeded
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - exit
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] internal_do_instance_watch - aquired sync lock
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance - enter
[2021-09-02T18:00:41Z INFO  controller::util::instance_action] handle_instance - added Akri Instance aeotec-gen-5-79e1a1: Instance { configuration_name: "aeotec-gen-5", broker_properties: {"UDEV_DEVNODE": "/dev/ttyACM1"}, shared: false, nodes: ["talos-192-168-32-2"], device_usage: {"aeotec-gen-5-79e1a1-0": ""} }
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - enter Add
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked from instance={"talos-192-168-32-2": PodContext { node_name: None, namespace: None, action: Add }}
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - find all pods that have akri.sh/instance=aeotec-gen-5-79e1a1
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/instance=aeotec-gen-5-79e1a1") field_selector=None
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:41Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - found 1 pods
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - update actions based on the existing pods
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] select_pod_action phase="Running" action=Add unknown_node=false
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_pod_action phase="Running" action=Add unknown_node=false trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_pods_on_known_nodes phase="Running" action=Add trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_running_pods action=Add trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:41Z TRACE controller::util::pod_action] choice_for_running_pods - Running Pod (tracked) ... PodAction::NoAction ("aeotec-gen-5-79e1a1-pod")
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked after querying existing pods={"talos-192-168-32-2": PodContext { node_name: Some("talos-192-168-32-2"), namespace: Some("host-system"), action: NoAction }}
[2021-09-02T18:00:41Z TRACE controller::util::instance_action] handle_instance_change - exit
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_pod - pod name "conbee-ii-55a336-pod"
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Running"
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - enter
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_pod - pod name "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Running"
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - enter
[2021-09-02T18:00:42Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - pod name "conbee-ii-55a336-pod"
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Running"
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Deleted event]
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - Deleted: "conbee-ii-55a336-pod"
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_deleted_pod_if_needed - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_deleted_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_deleted_pod_if_needed - call handle_non_running_pod
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_non_running_pod - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] get_instance_and_configuration_from_pod - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - find_pods_with_label(akri.sh/instance=conbee-ii-55a336)
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/instance=conbee-ii-55a336") field_selector=None
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - found 0 pods
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - num_non_terminating_pods: 0
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service app_name="conbee-ii-55a336-svc", namespace="host-system"
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::service] remove_service enter
[2021-09-02T18:00:43Z INFO  akri_shared::k8s::service] remove_service svcs.create(...).await?:
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::service] remove_service - service already deleted
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service succeeded
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - find_pods_with_label(akri.sh/configuration=conbee-ii)
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/configuration=conbee-ii") field_selector=None
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - found 0 pods
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - num_non_terminating_pods: 0
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service app_name="conbee-ii-svc", namespace="host-system"
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::service] remove_service enter
[2021-09-02T18:00:43Z INFO  akri_shared::k8s::service] remove_service svcs.create(...).await?:
[2021-09-02T18:00:43Z INFO  akri_shared::k8s::service] remove_service svcs.delete return: "Success"
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service succeeded
[2021-09-02T18:00:43Z TRACE akri_shared::akri::instance] find_instance enter
[2021-09-02T18:00:43Z TRACE akri_shared::akri::instance] find_instance kube_client.request::<KubeAkriInstance>(akri_instance_type.get(...)?).await?
[2021-09-02T18:00:43Z TRACE akri_shared::akri::instance] find_instance kube_client.request returned kube error: ErrorResponse { status: "Failure", message: "instances.akri.sh \"conbee-ii-55a336\" not found", reason: "NotFound", code: 404 }
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - pod name "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Running"
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Deleted event]
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_pod - Deleted: "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_deleted_pod_if_needed - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_deleted_pod_if_needed - last_known_state: Running
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_deleted_pod_if_needed - call handle_non_running_pod
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] handle_non_running_pod - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] get_instance_and_configuration_from_pod - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - find_pods_with_label(akri.sh/instance=aeotec-gen-5-79e1a1)
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/instance=aeotec-gen-5-79e1a1") field_selector=None
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - found 0 pods
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - num_non_terminating_pods: 0
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service app_name="aeotec-gen-5-79e1a1-svc", namespace="host-system"
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::service] remove_service enter
[2021-09-02T18:00:43Z INFO  akri_shared::k8s::service] remove_service svcs.create(...).await?:
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::service] remove_service - service already deleted
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service succeeded
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - enter
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - find_pods_with_label(akri.sh/configuration=aeotec-gen-5)
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/configuration=aeotec-gen-5") field_selector=None
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] find_pods_and_cleanup_svc_if_unsupported - found 0 pods
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - num_non_terminating_pods: 0
[2021-09-02T18:00:43Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service app_name="aeotec-gen-5-svc", namespace="host-system"
[2021-09-02T18:00:43Z TRACE akri_shared::k8s::service] remove_service enter
[2021-09-02T18:00:43Z INFO  akri_shared::k8s::service] remove_service svcs.create(...).await?:
[2021-09-02T18:00:44Z INFO  akri_shared::k8s::service] remove_service svcs.delete return: "Success"
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] cleanup_svc_if_unsupported - service::remove_service succeeded
[2021-09-02T18:00:44Z TRACE akri_shared::akri::instance] find_instance enter
[2021-09-02T18:00:44Z TRACE akri_shared::akri::instance] find_instance kube_client.request::<KubeAkriInstance>(akri_instance_type.get(...)?).await?
[2021-09-02T18:00:44Z TRACE akri_shared::akri::instance] find_instance return
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - enter Update
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked from instance={"talos-192-168-32-2": PodContext { node_name: None, namespace: None, action: Add }}
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - find all pods that have akri.sh/instance=aeotec-gen-5-79e1a1
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/instance=aeotec-gen-5-79e1a1") field_selector=None
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - found 0 pods
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - update actions based on the existing pods
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked after querying existing pods={"talos-192-168-32-2": PodContext { node_name: None, namespace: None, action: Add }}
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - find configuration for "aeotec-gen-5"
[2021-09-02T18:00:44Z TRACE akri_shared::akri::configuration] find_configuration enter
[2021-09-02T18:00:44Z TRACE akri_shared::akri::configuration] find_configuration kube_client.request::<KubeAkriConfig>(akri_config_type.get(...)?).await?
[2021-09-02T18:00:44Z TRACE akri_shared::akri::configuration] find_configuration return
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - found configuration for "aeotec-gen-5"
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_addition_work - Create new Pod for Node="talos-192-168-32-2"
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] create_new_pod_from_spec enter
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] create_new_pod_from_spec return
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_addition_work - New pod spec=Pod { metadata: Some(ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: None, deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: None, labels: Some({"akri.sh/configuration": "aeotec-gen-5", "akri.sh/instance": "aeotec-gen-5-79e1a1", "akri.sh/target-node": "talos-192-168-32-2", "app": "aeotec-gen-5-79e1a1-pod", "controller": "akri.sh"}), managed_fields: None, name: Some("aeotec-gen-5-79e1a1-pod"), namespace: Some("host-system"), owner_references: Some([OwnerReference { api_version: "akri.sh/v0", block_owner_deletion: Some(true), controller: Some(true), kind: "Instance", name: "aeotec-gen-5-79e1a1", uid: "e2ef5ff0-20a1-447a-8cc6-73244dc20a35" }]), resource_version: None, self_link: None, uid: None }), spec: Some(PodSpec { active_deadline_seconds: None, affinity: Some(Affinity { node_affinity: Some(NodeAffinity { preferred_during_scheduling_ignored_during_execution: None, required_during_scheduling_ignored_during_execution: Some(NodeSelector { node_selector_terms: [NodeSelectorTerm { match_expressions: None, match_fields: Some([NodeSelectorRequirement { key: "metadata.name", operator: "In", values: Some(["talos-192-168-32-2"]) }]) }] }) }), pod_affinity: None, pod_anti_affinity: None }), automount_service_account_token: None, containers: [Container { args: None, command: None, env: Some([EnvVar { name: "DEVICE_OPTIONS", value: Some("9600"), value_from: None }]), env_from: None, image: Some("ammmze/akri-ser2net-broker:1.0.4"), image_pull_policy: Some("IfNotPresent"), lifecycle: None, liveness_probe: None, name: "aeotec-gen-5-broker", ports: Some([ContainerPort { container_port: 2000, host_ip: None, host_port: None, name: Some("device"), protocol: None }]), readiness_probe: None, resources: Some(ResourceRequirements { limits: Some({"akri.sh/aeotec-gen-5-79e1a1": Quantity("1"), "cpu": Quantity("100m"), "memory": Quantity("200Mi")}), requests: Some({"akri.sh/aeotec-gen-5-79e1a1": Quantity("1"), "cpu": Quantity("10m"), "memory": Quantity("50Mi")}) }), security_context: Some(SecurityContext { allow_privilege_escalation: None, capabilities: None, privileged: Some(true), proc_mount: None, read_only_root_filesystem: None, run_as_group: None, run_as_non_root: None, run_as_user: None, se_linux_options: None, windows_options: None }), startup_probe: None, stdin: None, stdin_once: None, termination_message_path: None, termination_message_policy: None, tty: None, volume_devices: None, volume_mounts: None, working_dir: None }], dns_config: None, dns_policy: None, enable_service_links: None, ephemeral_containers: None, host_aliases: None, host_ipc: None, host_network: None, host_pid: None, hostname: None, image_pull_secrets: None, init_containers: None, node_name: None, node_selector: None, overhead: None, preemption_policy: None, priority: None, priority_class_name: None, readiness_gates: None, restart_policy: None, runtime_class_name: None, scheduler_name: None, security_context: None, service_account: None, service_account_name: None, share_process_namespace: None, subdomain: None, termination_grace_period_seconds: None, tolerations: None, topology_spread_constraints: None, volumes: None }), status: None }
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] create_pod enter
[2021-09-02T18:00:44Z INFO  akri_shared::k8s::pod] create_pod pods.create(...).await?:
[2021-09-02T18:00:44Z INFO  akri_shared::k8s::pod] create_pod pods.create return: "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_addition_work - pod::create_pod succeeded
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_addition_work - POST nodeInfo.SetNode 

[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - exit
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Added event]
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - pod name "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Pending"
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - pod name "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Pending"
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] internal_do_instance_watch - aquired sync lock
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance - enter
[2021-09-02T18:00:44Z INFO  controller::util::instance_action] handle_instance - modified Akri Instance aeotec-gen-5-79e1a1: Instance { configuration_name: "aeotec-gen-5", broker_properties: {"UDEV_DEVNODE": "/dev/ttyACM1"}, shared: false, nodes: ["talos-192-168-32-2"], device_usage: {"aeotec-gen-5-79e1a1-0": "talos-192-168-32-2"} }
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - enter Update
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked from instance={"talos-192-168-32-2": PodContext { node_name: None, namespace: None, action: Add }}
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - find all pods that have akri.sh/instance=aeotec-gen-5-79e1a1
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] find_pods_with_selector with label_selector=Some("akri.sh/instance=aeotec-gen-5-79e1a1") field_selector=None
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] find_pods_with_selector PRE pods.list(...).await?
[2021-09-02T18:00:44Z TRACE akri_shared::k8s::pod] find_pods_with_selector return
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - found 1 pods
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - update actions based on the existing pods
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] select_pod_action phase="Pending" action=Update unknown_node=false
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] choice_for_pod_action phase="Pending" action=Update unknown_node=false trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] choice_for_pods_on_known_nodes phase="Pending" action=Update trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] choice_for_non_running_pods action=Update trace_node_name="aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] time_choice_for_non_running_pods
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] time_choice_for_non_running_pods - no start time found ... give it more time? ("aeotec-gen-5-79e1a1-pod")
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] time_choice_for_non_running_pods - give_it_more_time: (true)
[2021-09-02T18:00:44Z TRACE controller::util::pod_action] time_choice_for_non_running_pods - Pending Pod (tracked) ... PodAction::NoAction ("aeotec-gen-5-79e1a1-pod")
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - nodes tracked after querying existing pods={"talos-192-168-32-2": PodContext { node_name: Some("talos-192-168-32-2"), namespace: Some("host-system"), action: NoAction }}
[2021-09-02T18:00:44Z TRACE controller::util::instance_action] handle_instance_change - exit
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - pod name "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:44Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Pending"
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Modified event]
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] handle_pod - pod name "aeotec-gen-5-79e1a1-pod"
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] handle_pod - pod phase "Running"
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - enter
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - last_known_state: Pending
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] handle_running_pod_if_needed - call handle_running_pod
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] handle_running_pod - enter
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] get_instance_and_configuration_from_pod - enter
[2021-09-02T18:00:45Z TRACE akri_shared::akri::configuration] find_configuration enter
[2021-09-02T18:00:45Z TRACE akri_shared::akri::configuration] find_configuration kube_client.request::<KubeAkriConfig>(akri_config_type.get(...)?).await?
[2021-09-02T18:00:45Z TRACE akri_shared::akri::configuration] find_configuration return
[2021-09-02T18:00:45Z TRACE akri_shared::akri::instance] find_instance enter
[2021-09-02T18:00:45Z TRACE akri_shared::akri::instance] find_instance kube_client.request::<KubeAkriInstance>(akri_instance_type.get(...)?).await?
[2021-09-02T18:00:45Z TRACE akri_shared::akri::instance] find_instance return
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] add_instance_and_configuration_services - instance="aeotec-gen-5-79e1a1"
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] create_or_update_service - instance="aeotec-gen-5-79e1a1" with ownership:OwnershipInfo { object_type: Configuration, object_uid: "2fcf85d6-d59c-4236-8424-5b0f3788c38e", object_name: "aeotec-gen-5" }
[2021-09-02T18:00:45Z TRACE akri_shared::k8s::service] find_services_with_selector with selector="akri.sh/configuration=aeotec-gen-5"
[2021-09-02T18:00:45Z TRACE akri_shared::k8s::service] find_services_with_selector PRE svcs.list(...).await?
[2021-09-02T18:00:45Z TRACE akri_shared::k8s::service] find_services_with_selector return
[2021-09-02T18:00:45Z TRACE controller::util::pod_watcher] create_or_update_service - New instance svc spec=Service { metadata: Some(ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: None, deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: None, labels: Some({"akri.sh/configuration": "aeotec-gen-5", "app": "aeotec-gen-5-svc", "controller": "akri.sh"}), managed_fields: None, name: Some("aeotec-gen-5-svc"), namespace: Some("host-system"), owner_references: Some([OwnerReference { api_version: "akri.sh/v0", block_owner_deletion: Some(true), controller: Some(true), kind: "Configuration", name: "aeotec-gen-5", uid: "2fcf85d6-d59c-4236-8424-5b0f3788c38e" }]), resource_version: None, self_link: None, uid: None }), spec: Some(ServiceSpec { cluster_ip: None, external_ips: None, external_name: None, external_traffic_policy: None, health_check_node_port: None, ip_family: None, load_balancer_ip: None, load_balancer_source_ranges: None, ports: Some([ServicePort { name: Some("device"), node_port: None, port: 2000, protocol: Some("TCP"), target_port: Some(String("device")) }]), publish_not_ready_addresses: None, selector: Some({"akri.sh/configuration": "aeotec-gen-5", "controller": "akri.sh"}), session_affinity: None, session_affinity_config: None, type_: Some("ClusterIP") }), status: None }
[2021-09-02T18:00:45Z TRACE akri_shared::k8s::service] create_service enter
[2021-09-02T18:00:45Z INFO  akri_shared::k8s::service] create_service svcs.create(...).await?:
[2021-09-02T18:00:45Z INFO  akri_shared::k8s::service] create_service services.create return: "aeotec-gen-5-svc"
@ammmze ammmze added the bug Something isn't working label Sep 2, 2021
@kate-goldenring
Copy link
Contributor

I think there are two issues happening here:

  1. The Akri Controller is deleting and re-applying configurations when the yaml metadata or status are changed, which flux is doing as @ammmze mentioned above. (This could be a good first issue)
  2. The delete and re-add of the Configuration is not working properly sometimes. There may be race cases here. Maybe the discovery handler connection is not being dropped before reconnecting.

@ammmze
Copy link
Contributor Author

ammmze commented Sep 2, 2021

👍🏻 I have zero rust knowledge, but I think for issue #1 (which I think in large part would stabilize my usage of Akri), the area of the code we'd be looking at is here.

So I guess the first thing is we need to be able to check to see what changed. And if its just metadata or status, then we ignore the change, otherwise do what we do today.

@kate-goldenring
Copy link
Contributor

👍🏻 I have zero rust knowledge, but I think for issue #1 (which I think in large part would stabilize my usage of Akri), the area of the code we'd be looking at is here.

So I guess the first thing is we need to be able to check to see what changed. And if its just metadata or status, then we ignore the change, otherwise do what we do today.

@ammmze, i agree that's where the action needs to happen.

I dont think there is a simple way of checking if metadata or status changed, since we aren't storing the previous Configuration. We could add that to the ConfigInfo struct here in handle_config_add when the Configuration is applied. ConfigInfo is an entry of ConfigMap, so then, when we get a Modified event, we could pull the previous Configuration from the `ConfigMap in order to see what changed and act accordingly.

It's not super trivial, but would you be interested in putting in the fix? Trying out rust?

@ammmze
Copy link
Contributor Author

ammmze commented Sep 2, 2021

👍🏻 Yea I am up for giving it a go. Probably won't happen super quick since it'll happen between work and family life stuff, but it seems reasonably simple for a intro to rust 😀

@ammmze
Copy link
Contributor Author

ammmze commented Sep 3, 2021

For the race condition issue, maybe this doesn't do what I'm thinking (again still very much a noob with rust ... I come from primarily java and javascript), but this seems odd to me that we are cloning the config map when we do the delete. Then the add happens on the original config_map. I'm probably missing something or not fully understanding it, but just figured I'd toss it out there since I'm looking at this thing.

@kate-goldenring
Copy link
Contributor

For the race condition issue, maybe this doesn't do what I'm thinking (again still very much a noob with rust ... I come from primarily java and javascript), but this seems odd to me that we are cloning the config map when we do the delete. Then the add happens on the original config_map. I'm probably missing something or not fully understanding it, but just figured I'd toss it out there since I'm looking at this thing.

That is cloning a reference to the shared map which is of type Arc<Mutex<HashMap<String, ConfigInfo>>>. Arc is a reference counter that is often used in rust when sharing data between threads. Its implementation of clone is to increment the number of references to the map. You've pointed out a very Rust-y behavior. I anticipate that the race case is in the discovery operator

@romoh
Copy link
Contributor

romoh commented Sep 7, 2021

First part fix in PR #373

@romoh romoh moved this from Triage needed to In progress in Akri Roadmap Sep 7, 2021
@kate-goldenring
Copy link
Contributor

I did some investigation into the instances not being recreated, and like you mention in #373, i think this started after v0.6.5. Specifically, I think it is a result of updating to the latest kube-rs in #361 and using Api::watch instead of kube::runtime::watcher. I was able to re-pro unexpected behavior. I put additional notes in a PR #378 with the change and created issue #374 to track it. Thanks for bringing this up @ammmze!

@ammmze
Copy link
Contributor Author

ammmze commented Sep 10, 2021

I did some investigation into the instances not being recreated, and like you mention in #373, i think this started after v0.6.5. Specifically, I think it is a result of updating to the latest kube-rs in #361 and using Api::watch instead of kube::runtime::watcher. I was able to re-pro unexpected behavior. I put additional notes in a PR #378 with the change and created issue #374 to track it. Thanks for bringing this up @ammmze!

Awesome! Thank you! Excited to see these changes 🎉 !

Akri Roadmap automation moved this from In progress to Done Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants