Describe the bug
While installing Pixie, Vizier can fail its healthcheck even when the pl-nats pod is running and the NATS monitor endpoint is healthy.
The failure appears to happen when Pixie checks the NATS monitor endpoint using the generated pod DNS name:
<pod-ip-with-dashes>.<namespace>.pod.cluster.local:8222
In our cluster, this DNS name was not resolvable from the component performing the healthcheck. As a result, Pixie marked pl-nats-0 pod as unhealthy and vizor operator deleting the pod and installation did not complete successfully.
We temporarily worked around this by adding hostAliases so the generated pod DNS name resolves to the NATS pod IP. However, this is fragile because pod IPs can change and the workaround has to be maintained outside Pixie.
Observed Behavior
- Pixie install waits for Vizier healthcheck.
pl-nats pod is running.
- NATS monitor endpoint on port
8222 is reachable by pod IP.
- The healthcheck fails because the generated pod DNS name cannot be resolved.
- Adding
hostAliases for the generated pod DNS name allows the healthcheck to pass.
Example failing endpoint format:
http://<pod-ip-with-dashes>.<namespace>.pod.cluster.local:8222
Expected behavior
Pixie’s NATS healthcheck should succeed when the NATS monitor endpoint is reachable, even if pod DNS names such as *.pod.cluster.local are not resolvable in the cluster.
For the NATS monitor endpoint specifically, Pixie should avoid relying on pod DNS resolution where TLS hostname validation is not required.
App information (please complete the following information):
- Pixie version: release/cloud/v0.1.9
- K8s cluster version: v1.33.9-gke.1060000
- Node Kernel version: 6.6.122+
Additional context
The GKE cluster is running Kube-DNS(schema "1.0.0") with Node-local-cache enabled.
Recommendation
For StatefullSet/Headless service can we make use of Hostname and Subdomain for pods status?
Existing references:
#1544
#1581
Describe the bug
While installing Pixie, Vizier can fail its healthcheck even when the
pl-natspod is running and the NATS monitor endpoint is healthy.The failure appears to happen when Pixie checks the NATS monitor endpoint using the generated pod DNS name:
<pod-ip-with-dashes>.<namespace>.pod.cluster.local:8222In our cluster, this DNS name was not resolvable from the component performing the healthcheck. As a result, Pixie marked
pl-nats-0pod as unhealthy and vizor operator deleting the pod and installation did not complete successfully.We temporarily worked around this by adding
hostAliasesso the generated pod DNS name resolves to the NATS pod IP. However, this is fragile because pod IPs can change and the workaround has to be maintained outside Pixie.Observed Behavior
pl-natspod is running.8222is reachable by pod IP.hostAliasesfor the generated pod DNS name allows the healthcheck to pass.Example failing endpoint format:
Expected behavior
Pixie’s NATS healthcheck should succeed when the NATS monitor endpoint is reachable, even if pod DNS names such as *.pod.cluster.local are not resolvable in the cluster.
For the NATS monitor endpoint specifically, Pixie should avoid relying on pod DNS resolution where TLS hostname validation is not required.
App information (please complete the following information):
Additional context
The GKE cluster is running Kube-DNS(schema "1.0.0") with Node-local-cache enabled.
Recommendation
For StatefullSet/Headless service can we make use of
HostnameandSubdomainfor pods status?Existing references:
#1544
#1581