Vizier Pem "Failed to connect to NATS, nats_status=6" & NATS pods terminating

**Describe the bug**

I am deploying via Pixie operator via Helm and facing another issue that appears related to the `pl-nats` pod/s.  

Once the `vizier-metadata` pod comes up the `pl-nats` pod/s terminate after ~20 seconds (i assume they are passing their liveness/readiness probe given this timing).  Then the `vizier-pem` pods report the following (`kelvin` pod also reports similar):
```
F20230620 15:23:15.315608 1551459 statusor.h:148] Check failed: _s.ok() Bad Status: Unknown : Failed to connect to NATS, nats_status=6
```

I have tried increasing the `pl-nats` pod count, but the behaviour persists, and there is no useful logs from the `pl-nats` pods:
```
$ kubectl logs -n pl pl-nats-0 --follow
[7] 2023/06/19 19:30:01.646266 [INF] Starting nats-server
[7] 2023/06/19 19:30:01.646383 [INF]   Version:  2.9.17
[7] 2023/06/19 19:30:01.646391 [INF]   Git:      [4f2c9a5]
[7] 2023/06/19 19:30:01.646405 [INF]   Name:     NBCXEHKCGCSRNBM43XQXGD3TGJFQP3NIATOSAGG74CQCK6HILV2E4TVH
[7] 2023/06/19 19:30:01.646443 [INF]   ID:       NBCXEHKCGCSRNBM43XQXGD3TGJFQP3NIATOSAGG74CQCK6HILV2E4TVH
[7] 2023/06/19 19:30:01.646463 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/06/19 19:30:01.648401 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/06/19 19:30:01.648587 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/06/19 19:30:01.648601 [INF] TLS required for client connections
[7] 2023/06/19 19:30:01.649319 [INF] Server is ready
[7] 2023/06/19 19:36:18.629639 [INF] Initiating Shutdown...
[7] 2023/06/19 19:36:18.632981 [INF] Server Exiting..
```

I am able to see the following events from the `pl-nats` pod, but this doesn't make a great deal of sense, nor does it explain the terminations:
```
15s         Warning   FailedPreStopHook                 pod/pl-nats-0                                  Exec lifecycle hook ([/bin/sh -c /nats-server -sl=ldm=/var/run/nats/nats.pid && /bin/sleep 60]) for Container "pl-nats" in Pod "pl-nats-0_pl(f2658e64-5d8d-4632-9133-cec9b8d5be00)" failed - error: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "2cf016760be7c1f00852ced2432672f156678ab79b902e983fe3857e96525efc": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown, message: ""
```

**To Reproduce**
Steps to reproduce the behavior:
1. Configure a values file for the helm chart
2. Deploy the helm chart against a K8s cluster

**Config**

`values.yaml` file:

```yaml
deployOLM: ""
olmNamespace: "olm"
olmOperatorNamespace: "px-operator"
olmBundleChannel: "stable"
olmCatalogSource:
  annotations: {}
  labels: {}
name: "pixie"
clusterName: "{{ cluster_name }}"
version: ""
deployKey: "{{ pixielabs.deploymentKey }}"
customDeployKeySecret: ""
disableAutoUpdate: false
useEtcdOperator: true
cloudAddr: "{{ pixielabs.webui.url }}"
devCloudNamespace: "{{ pixielabs.webui.namespace }}"
pemMemoryLimit: ""
pemMemoryRequest: ""
dataAccess: "Full"
pod:
  annotations: {}
  labels: {}
  resources: {}
  nodeSelector: {}
patches:
  vizier-metadata: '{"spec": {"template": {"spec": {"containers": [{"name":"app","livenessProbe": {"initialDelaySeconds": 360, "periodSeconds": 30 }}]}}}}'
```

**Expected behaviour**
A clear and concise description of what you expected to happen.

All pods come up

**App information (please complete the following information):**
- Helm Chart - pixie-operator-chart-0.1.4
- K8s cluster version: 1.25

**Additional context**

I did come across [this issue reported via New Relic Forum](https://forum.newrelic.com/s/hubtopic/aAX8W0000008diiWAA/pixie-vizierpem-pod-is-always-get-crashloopbackoff-state) as well as [this one](https://forum.newrelic.com/s/hubtopic/aAX8W0000008dQDWAY/kubernetespixie-failed-to-connect-to-nats-natsstatus6) which contains the  error message we are seeing.  Neither resolutions or the suggestion made [here](https://forum.newrelic.com/s/hubtopic/aAX8W0000008cKIWAY/problem-with-vizierpem-pods-in-kubernetes-cluster) resolve the issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vizier Pem "Failed to connect to NATS, nats_status=6" & NATS pods terminating #1544

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vizier Pem "Failed to connect to NATS, nats_status=6" & NATS pods terminating #1544

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions