Describe the bug
I am deploying via Pixie operator via Helm and facing another issue that appears related to the pl-nats pod/s.
Once the vizier-metadata pod comes up the pl-nats pod/s terminate after ~20 seconds (i assume they are passing their liveness/readiness probe given this timing). Then the vizier-pem pods report the following (kelvin pod also reports similar):
F20230620 15:23:15.315608 1551459 statusor.h:148] Check failed: _s.ok() Bad Status: Unknown : Failed to connect to NATS, nats_status=6
I have tried increasing the pl-nats pod count, but the behaviour persists, and there is no useful logs from the pl-nats pods:
$ kubectl logs -n pl pl-nats-0 --follow
[7] 2023/06/19 19:30:01.646266 [INF] Starting nats-server
[7] 2023/06/19 19:30:01.646383 [INF] Version: 2.9.17
[7] 2023/06/19 19:30:01.646391 [INF] Git: [4f2c9a5]
[7] 2023/06/19 19:30:01.646405 [INF] Name: NBCXEHKCGCSRNBM43XQXGD3TGJFQP3NIATOSAGG74CQCK6HILV2E4TVH
[7] 2023/06/19 19:30:01.646443 [INF] ID: NBCXEHKCGCSRNBM43XQXGD3TGJFQP3NIATOSAGG74CQCK6HILV2E4TVH
[7] 2023/06/19 19:30:01.646463 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/06/19 19:30:01.648401 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/06/19 19:30:01.648587 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/06/19 19:30:01.648601 [INF] TLS required for client connections
[7] 2023/06/19 19:30:01.649319 [INF] Server is ready
[7] 2023/06/19 19:36:18.629639 [INF] Initiating Shutdown...
[7] 2023/06/19 19:36:18.632981 [INF] Server Exiting..
I am able to see the following events from the pl-nats pod, but this doesn't make a great deal of sense, nor does it explain the terminations:
15s Warning FailedPreStopHook pod/pl-nats-0 Exec lifecycle hook ([/bin/sh -c /nats-server -sl=ldm=/var/run/nats/nats.pid && /bin/sleep 60]) for Container "pl-nats" in Pod "pl-nats-0_pl(f2658e64-5d8d-4632-9133-cec9b8d5be00)" failed - error: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "2cf016760be7c1f00852ced2432672f156678ab79b902e983fe3857e96525efc": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown, message: ""
To Reproduce
Steps to reproduce the behavior:
- Configure a values file for the helm chart
- Deploy the helm chart against a K8s cluster
Config
values.yaml file:
deployOLM: ""
olmNamespace: "olm"
olmOperatorNamespace: "px-operator"
olmBundleChannel: "stable"
olmCatalogSource:
annotations: {}
labels: {}
name: "pixie"
clusterName: "{{ cluster_name }}"
version: ""
deployKey: "{{ pixielabs.deploymentKey }}"
customDeployKeySecret: ""
disableAutoUpdate: false
useEtcdOperator: true
cloudAddr: "{{ pixielabs.webui.url }}"
devCloudNamespace: "{{ pixielabs.webui.namespace }}"
pemMemoryLimit: ""
pemMemoryRequest: ""
dataAccess: "Full"
pod:
annotations: {}
labels: {}
resources: {}
nodeSelector: {}
patches:
vizier-metadata: '{"spec": {"template": {"spec": {"containers": [{"name":"app","livenessProbe": {"initialDelaySeconds": 360, "periodSeconds": 30 }}]}}}}'
Expected behaviour
A clear and concise description of what you expected to happen.
All pods come up
App information (please complete the following information):
- Helm Chart - pixie-operator-chart-0.1.4
- K8s cluster version: 1.25
Additional context
I did come across this issue reported via New Relic Forum as well as this one which contains the error message we are seeing. Neither resolutions or the suggestion made here resolve the issue.
Describe the bug
I am deploying via Pixie operator via Helm and facing another issue that appears related to the
pl-natspod/s.Once the
vizier-metadatapod comes up thepl-natspod/s terminate after ~20 seconds (i assume they are passing their liveness/readiness probe given this timing). Then thevizier-pempods report the following (kelvinpod also reports similar):I have tried increasing the
pl-natspod count, but the behaviour persists, and there is no useful logs from thepl-natspods:I am able to see the following events from the
pl-natspod, but this doesn't make a great deal of sense, nor does it explain the terminations:To Reproduce
Steps to reproduce the behavior:
Config
values.yamlfile:Expected behaviour
A clear and concise description of what you expected to happen.
All pods come up
App information (please complete the following information):
Additional context
I did come across this issue reported via New Relic Forum as well as this one which contains the error message we are seeing. Neither resolutions or the suggestion made here resolve the issue.