-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent 502 Bad Gateway issue when service is meshed #4870
Comments
@mbelang To confirm what we discussed on slack, you added CPU resource requests to your nginx ingress controller's proxy and there were no CPU spikes on the proxy when the connection timeout happened. Then you tried increasing the outbound connection timeout on the proxy of the nginx ingress by changing the Also the TCP dump shows the 502 coming back from the datadog agent, which is an uninjected workload running on the same cluster. No error logs on the upstream datadog agent.
Does that mean the proxy sees less connection timeouts after you give more resources to the datadog agent? Let me know if my understanding is correct. |
I'm not sure where I can see the specific metric for the proxy container. I didn't check that but from what I see so far, I get some 502 and the CPU on ingress pod is flat flat flat.
The TCP dump above is from for ingress controller trying to contact upstream app, not datadog agent but I did got a trace for datadog agent is it is exactly the same.
I did play with that trying different values from 5s to 15s. no luck. I also suspected the keep-alive which I tried to lower to 4s and increase to 90s without any luck either.
Yes Is there a way to set the read timeout: https://github.com/linkerd/linkerd2-proxy/blob/13b5fd65da6999f1d3d4d166983af8d54034d6e4/linkerd/app/integration/src/tcp.rs#L165 I didn't manage to see where is that function used and what is the default value. If you see above, I have 2 problems.
for 1) I'm trying to blame keep-alive or connection timeouts but no luck |
So sounds like the errors are only seen on the nginx ingress controller outbound side? Do you have a minimum set of YAML that we can use for repro? Thanks. |
So far yes, but I do have problems with a meshed app trying to reach out to datadog agent. I only have ingress and 1 app meshed in production environment. So far, for the app, no 502 for requests to other apps which is good. The biggest problem now is the ingress. apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "14"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"deployment.kubernetes.io/revision":"13","meta.helm.sh/release-name":"ingress","meta.helm.sh/release-namespace":"ingress"},"generation":7154,"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","app.kubernetes.io/managed-by":"Helm","chart":"nginx-ingress-1.39.0","heritage":"Helm","release":"ingress"},"name":"ingress-nginx-ingress-controller","namespace":"ingress","resourceVersion":"62462358","selfLink":"/apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller","uid":"c041898e-78dd-11ea-ad31-0e9b9c5b4912"},"spec":{"progressDeadlineSeconds":600,"replicas":3,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app":"nginx-ingress","release":"ingress"}},"strategy":{"rollingUpdate":{"maxSurge":"33%","maxUnavailable":0},"type":"RollingUpdate"},"template":{"metadata":{"annotations":{"ad.datadoghq.com/nginx-ingress-controller.check_names":"[\"nginx_ingress_controller\"]","ad.datadoghq.com/nginx-ingress-controller.init_configs":"[{}]","ad.datadoghq.com/nginx-ingress-controller.instances":"[{\"prometheus_url\": \"http://%%host%%:10254/metrics\"}]","config.alpha.linkerd.io/proxy-wait-before-exit-seconds":"40","kubectl.kubernetes.io/restartedAt":"2020-08-04T16:10:44-04:00","linkerd.io/created-by":"linkerd/cli stable-2.8.1","linkerd.io/identity-mode":"default","linkerd.io/proxy-version":"stable-2.8.1"},"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","component":"controller","linkerd.io/control-plane-ns":"linkerd","linkerd.io/proxy-deployment":"ingress-nginx-ingress-controller","linkerd.io/workload-ns":"ingress","release":"ingress"}},"spec":{"containers":[{"args":["/nginx-ingress-controller","--default-backend-service=ingress/ingress-nginx-ingress-default-backend","--election-id=ingress-controller-leader","--ingress-class=nginx","--configmap=ingress/ingress-nginx-ingress-controller"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"DD_AGENT_HOST","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"status.hostIP"}}}],"image":"quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"name":"nginx-ingress-controller","ports":[{"containerPort":80,"name":"http","protocol":"TCP"},{"containerPort":443,"name":"https","protocol":"TCP"},{"containerPort":10254,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"resources":{"limits":{"memory":"512Mi"},"requests":{"cpu":"150m","memory":"512Mi"}},"securityContext":{"allowPrivilegeEscalation":true,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"runAsUser":101},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"},{"env":[{"name":"LINKERD2_PROXY_LOG","value":"warn,linkerd=info"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_ADDR","value":"linkerd-dst.linkerd.svc.cluster.local:8086"},{"name":"LINKERD2_PROXY_DESTINATION_GET_NETWORKS","value":"10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"},{"name":"LINKERD2_PROXY_CONTROL_LISTEN_ADDR","value":"0.0.0.0:4190"},{"name":"LINKERD2_PROXY_ADMIN_LISTEN_ADDR","value":"0.0.0.0:4191"},{"name":"LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR","value":"127.0.0.1:4140"},{"name":"LINKERD2_PROXY_INBOUND_LISTEN_ADDR","value":"0.0.0.0:4143"},{"name":"LINKERD2_PROXY_DESTINATION_GET_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE","value":"10000ms"},{"name":"LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE","value":"4000ms"},{"name":"_pod_ns","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"LINKERD2_PROXY_DESTINATION_CONTEXT","value":"ns:$(_pod_ns)"},{"name":"LINKERD2_PROXY_IDENTITY_DIR","value":"/var/run/linkerd/identity/end-entity"},{"name":"LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS","value":"-----BEGIN CERTIFICATE-----\nMIIBkjCCATmgAwIBAgIRAIIQc+6o+sH3bJmp1G7/55IwCgYIKoZIzj0EAwIwKTEn\nMCUGA1UEAxMeaWRlbnRpdHkubGlua2VyZC5jbHVzdGVyLmxvY2FsMB4XDTIwMDgw\nNDE5NTgyMloXDTMwMDgwMjE5NTgyMlowKTEnMCUGA1UEAxMeaWRlbnRpdHkubGlu\na2VyZC5jbHVzdGVyLmxvY2FsMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEt3MO\nWW9nuJbUijyH3freMWfL0Z90R/8R6iq5Me9Np+iVs4SzG6lrZyjhTN4d7N5szfCY\nii3HIe+AXLgvZXDZTKNCMEAwDgYDVR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMB\nAf8wHQYDVR0OBBYEFC0bx7JhQ54epHUcBE2ZYWzQYaK3MAoGCCqGSM49BAMCA0cA\nMEQCIG4+7HaA/viOLhoukmyelwn76vlZ5VZCdbaG4Z9hCY03AiAprDSy71nkk5ii\nONYQvhbt15P7lUptu4j5nlhF5n+Iaw==\n-----END CERTIFICATE-----\n"},{"name":"LINKERD2_PROXY_IDENTITY_TOKEN_FILE","value":"/var/run/secrets/kubernetes.io/serviceaccount/token"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_ADDR","value":"linkerd-identity.linkerd.svc.cluster.local:8080"},{"name":"_pod_sa","valueFrom":{"fieldRef":{"fieldPath":"spec.serviceAccountName"}}},{"name":"_l5d_ns","value":"linkerd"},{"name":"_l5d_trustdomain","value":"cluster.local"},{"name":"LINKERD2_PROXY_IDENTITY_LOCAL_NAME","value":"$(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_NAME","value":"linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_NAME","value":"linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_TAP_SVC_NAME","value":"linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"}],"image":"gcr.io/linkerd-io/proxy:stable-2.8.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStop":{"exec":{"command":["/bin/bash","-c","sleep 40"]}}},"livenessProbe":{"httpGet":{"path":"/live","port":4191},"initialDelaySeconds":10},"name":"linkerd-proxy","ports":[{"containerPort":4143,"name":"linkerd-proxy"},{"containerPort":4191,"name":"linkerd-admin"}],"readinessProbe":{"httpGet":{"path":"/ready","port":4191},"initialDelaySeconds":2},"resources":{"limits":{"memory":"250Mi"},"requests":{"cpu":"100m","memory":"20Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsUser":2102},"terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/run/linkerd/identity/end-entity","name":"linkerd-identity-end-entity"}]}],"dnsPolicy":"ClusterFirst","initContainers":[{"args":["--incoming-proxy-port","4143","--outgoing-proxy-port","4140","--proxy-uid","2102","--inbound-ports-to-ignore","4190,4191"],"image":"gcr.io/linkerd-io/proxy-init:v1.3.3","imagePullPolicy":"IfNotPresent","name":"linkerd-init","resources":{"limits":{"cpu":"100m","memory":"50Mi"},"requests":{"cpu":"10m","memory":"10Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_ADMIN","NET_RAW","NET_BIND_SERVICE"],"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true,"runAsNonRoot":false,"runAsUser":0},"terminationMessagePolicy":"FallbackToLogsOnError"}],"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"ingress-nginx-ingress","serviceAccountName":"ingress-nginx-ingress","terminationGracePeriodSeconds":60,"volumes":[{"emptyDir":{"medium":"Memory"},"name":"linkerd-identity-end-entity"}]}}},"status":{"availableReplicas":3,"conditions":[{"message":"Deployment has minimum availability.","reason":"MinimumReplicasAvailable","status":"True","type":"Available"},{"message":"ReplicaSet \"ingress-nginx-ingress-controller-7656c98b8f\" has successfully progressed.","reason":"NewReplicaSetAvailable","status":"True","type":"Progressing"}],"observedGeneration":7154,"readyReplicas":3,"replicas":3,"updatedReplicas":3}}
meta.helm.sh/release-name: ingress
meta.helm.sh/release-namespace: ingress
generation: 7161
labels:
app: nginx-ingress
app.kubernetes.io/component: controller
app.kubernetes.io/managed-by: Helm
chart: nginx-ingress-1.39.0
heritage: Helm
release: ingress
name: ingress-nginx-ingress-controller
namespace: ingress
resourceVersion: "62830529"
selfLink: /apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller
uid: c041898e-78dd-11ea-ad31-0e9b9c5b4912
spec:
progressDeadlineSeconds: 600
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: nginx-ingress
release: ingress
strategy:
rollingUpdate:
maxSurge: 33%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
ad.datadoghq.com/nginx-ingress-controller.check_names: '["nginx_ingress_controller"]'
ad.datadoghq.com/nginx-ingress-controller.init_configs: '[{}]'
ad.datadoghq.com/nginx-ingress-controller.instances: '[{"prometheus_url":
"http://%%host%%:10254/metrics"}]'
config.alpha.linkerd.io/proxy-wait-before-exit-seconds: "40"
kubectl.kubernetes.io/restartedAt: "2020-08-04T16:10:44-04:00"
linkerd.io/created-by: linkerd/cli stable-2.8.1
linkerd.io/identity-mode: default
linkerd.io/proxy-version: stable-2.8.1
labels:
app: nginx-ingress
app.kubernetes.io/component: controller
component: controller
linkerd.io/control-plane-ns: linkerd
linkerd.io/proxy-deployment: ingress-nginx-ingress-controller
linkerd.io/workload-ns: ingress
release: ingress
spec:
containers:
- args:
- /nginx-ingress-controller
- --default-backend-service=ingress/ingress-nginx-ingress-default-backend
- --election-id=ingress-controller-leader
- --ingress-class=nginx
- --configmap=ingress/ingress-nginx-ingress-controller
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: DD_AGENT_HOST
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: nginx-ingress-controller
ports:
- containerPort: 80
name: http
protocol: TCP
- containerPort: 443
name: https
protocol: TCP
- containerPort: 10254
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 512Mi
requests:
cpu: 150m
memory: 512Mi
securityContext:
allowPrivilegeEscalation: true
capabilities:
add:
- NET_BIND_SERVICE
drop:
- ALL
runAsUser: 101
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- env:
- name: LINKERD2_PROXY_LOG
value: warn,linkerd=info
- name: LINKERD2_PROXY_DESTINATION_SVC_ADDR
value: linkerd-dst.linkerd.svc.cluster.local:8086
- name: LINKERD2_PROXY_DESTINATION_GET_NETWORKS
value: 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
- name: LINKERD2_PROXY_CONTROL_LISTEN_ADDR
value: 0.0.0.0:4190
- name: LINKERD2_PROXY_ADMIN_LISTEN_ADDR
value: 0.0.0.0:4191
- name: LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR
value: 127.0.0.1:4140
- name: LINKERD2_PROXY_INBOUND_LISTEN_ADDR
value: 0.0.0.0:4143
- name: LINKERD2_PROXY_DESTINATION_GET_SUFFIXES
value: svc.cluster.local.
- name: LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES
value: svc.cluster.local.
- name: LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE
value: 10000ms
- name: LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE
value: 90000ms
- name: _pod_ns
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: LINKERD2_PROXY_DESTINATION_CONTEXT
value: ns:$(_pod_ns)
- name: LINKERD2_PROXY_IDENTITY_DIR
value: /var/run/linkerd/identity/end-entity
- name: LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS
value: |
-----BEGIN CERTIFICATE-----
REDACTED
-----END CERTIFICATE-----
- name: LINKERD2_PROXY_IDENTITY_TOKEN_FILE
value: /var/run/secrets/kubernetes.io/serviceaccount/token
- name: LINKERD2_PROXY_IDENTITY_SVC_ADDR
value: linkerd-identity.linkerd.svc.cluster.local:8080
- name: _pod_sa
valueFrom:
fieldRef:
fieldPath: spec.serviceAccountName
- name: _l5d_ns
value: linkerd
- name: _l5d_trustdomain
value: cluster.local
- name: LINKERD2_PROXY_IDENTITY_LOCAL_NAME
value: $(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
- name: LINKERD2_PROXY_IDENTITY_SVC_NAME
value: linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
- name: LINKERD2_PROXY_DESTINATION_SVC_NAME
value: linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
- name: LINKERD2_PROXY_TAP_SVC_NAME
value: linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
image: gcr.io/linkerd-io/proxy:stable-2.8.1
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- sleep 40
livenessProbe:
httpGet:
path: /live
port: 4191
initialDelaySeconds: 10
name: linkerd-proxy
ports:
- containerPort: 4143
name: linkerd-proxy
- containerPort: 4191
name: linkerd-admin
readinessProbe:
httpGet:
path: /ready
port: 4191
initialDelaySeconds: 2
resources:
limits:
memory: 250Mi
requests:
cpu: 100m
memory: 20Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- ALL
readOnlyRootFilesystem: true
runAsUser: 2102
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/run/linkerd/identity/end-entity
name: linkerd-identity-end-entity
dnsPolicy: ClusterFirst
initContainers:
- args:
- --incoming-proxy-port
- "4143"
- --outgoing-proxy-port
- "4140"
- --proxy-uid
- "2102"
- --inbound-ports-to-ignore
- 4190,4191
image: gcr.io/linkerd-io/proxy-init:v1.3.3
imagePullPolicy: IfNotPresent
name: linkerd-init
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 10m
memory: 10Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_ADMIN
- NET_RAW
- NET_BIND_SERVICE
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsNonRoot: false
runAsUser: 0
terminationMessagePolicy: FallbackToLogsOnError
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: ingress-nginx-ingress
serviceAccountName: ingress-nginx-ingress
terminationGracePeriodSeconds: 60
volumes:
- emptyDir:
medium: Memory
name: linkerd-identity-end-entity
status:
availableReplicas: 3
conditions:
- message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- message: ReplicaSet "ingress-nginx-ingress-controller-59fd9b7b85" has successfully
progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 7161
readyReplicas: 3
replicas: 3
updatedReplicas: 3
--- |
To help further narrow down the repro steps, do these 502s happen only when the datadog agent is the target service? |
I just saw this: hyperium/hyper#2136. I do imagine that linkerd proxy is using that lib right? According to them it is a keep-alive problem and by setting the keep alive lower than the upstream would do. My upstream keep-alive timeout is 5s so I set it to 2s for the proxy... No luck so far. I'm going to try and put the proxy at 0ms for the outbound keep-alive timeout so that a new connection is used all the time. |
I mitigate all 502 on GETs with nginx retry mechanism. I could also do it on non-idemponent but it is a bit dangerous ATM. |
@ihcsim here is an extract of an ingress resource for a test application apiVersion: v1
items:
- apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
acme.cert-manager.io/http01-edit-in-place: "true"
acme.cert-manager.io/http01-ingress-class: "true"
cert-manager.io/cluster-issuer: letsencrypt
certmanager.k8s.io/acme-challenge-type: dns01
certmanager.k8s.io/acme-dns01-provider: route53
certmanager.k8s.io/cluster-issuer: letsencrypt
external-dns.alpha.kubernetes.io/target: REDACTED.
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: "true"
meta.helm.sh/release-name: hello-k8s
meta.helm.sh/release-namespace: hello-k8s
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
grpc_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502 non_idempotent
nginx.ingress.kubernetes.io/proxy-next-upstream-timeout: 30s
nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "3"
creationTimestamp: "2020-04-08T13:45:30Z"
generation: 1
labels:
app: hello-k8s
app.kubernetes.io/instance: hello-k8s
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: app
branch-slug: master
cci-build-number: "3877"
cci-workflow-id: 1a3b6a13-6400-41e6-a9a0-3eafa8d420d8
component: app
helm.sh/chart: app-2.6.0
place: ca
pr-number: ""
sha: 89e64f20862883f79fe25347958459076e281f4d
short-sha: 89e64f2
stage: prod
tag: v0.19.2
version: v0.19.2
name: app
namespace: hello-k8s
resourceVersion: "63104678"
selfLink: /apis/extensions/v1beta1/namespaces/hello-k8s/ingresses/app
uid: 3639f2c6-799f-11ea-ad31-0e9b9c5b4912
spec:
rules:
- host: hello-k8s.example.com
http:
paths:
- backend:
serviceName: app
servicePort: http
tls:
- hosts:
- hello-k8s.example.com
secretName: hello-k8s.example.com-tls
status:
loadBalancer:
ingress:
- ip: x.x.x.x
- ip: x.x.x.x
- ip: x.x.x.x
kind: List
metadata:
resourceVersion: ""
selfLink: "" |
discovered that 0ms is not supported. I also tried to raise keep-alive timeout to 90s (higher that nginx outbound of 60s) without any luck either. |
@mbelang and I had a chance to talk through this issue in Slack this morning. I think we have a good enough handle on it to put together a repro setup:
Then, we should try putting consistent load on the ingress. Ideally, we'd test this all on EKS with the latest AWS CNI, as it seems plausible that it's a bad interaction at the network layer. If we can reproduce this with this kind of setup, then I think it should be pretty straightforward to diagnose/fix. If we can't, we can start digging into more details about how this repro setup differs from @mbelang's actual system. |
@mbelang reports that this problem goes away when all pods are meshed; so this points strongly to the HTTP/1.1 client. |
We're seeing this breaking DNS in the cluster for us at the moment where stuff tries to use TCP, including the linkerd-proxy instances. Here's an example of a CURL from a non-meshed pod showing the port opens fine (normal no-talk hangup after a while)
Meshed pod gets bad gateways:
Meshing DNS isn't an option for us. Environment is edge-20.9.1 running Kubernetes 1.17 on AWS EKS. Direct port forwards all work to the DNS pods, talking direct to services works - but interestingly this seems to be preventign the proxy itself establishing identities for things:
|
@olix0r I still notice some |
@steve-gray We think what you are seeing is closer to #4831. There is some ongoing investigation per #4831 (comment). Feel free to subscribe to that issue. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Even if all my services are meshed, I still have intermittent 502s. I can mitigate with retries at the client level but it is not suitable. Also, I could setup retries with the mesh itself but Linkerd doesn't support retries with body... |
@cpretzer I have yet to update to 2.9.1. Were there any fix or improvements related to the proxy with regard to this issue? |
I'm not confident that retries would help in this case. It really depends on where we're encountering the issue, but I don't think we have enough data to know yet. There were substantial changes between 2.8 and 2.9, especially around caching and discovery. (For instance, there's no longer any DNS resolution in the data path). It would be good to test this more recent version if only to ensure that the problem does not persists -- even if we are able to identify the underlying cause, we're unlikely to backport fixes onto 2.8. If the issue persists, it would be helpful to at least get debug logs from both the client and server proxies, via |
@olix0r I know that retries would not solve the issue but would at least mitigate. I will plan an upgrade to 2.9.1 and see how it goes from there. I did try to put the proxy in Debug mode but I didn't manage to get more information that I posted here. Maybe I missed it but it is a fairly rare event that it is very hard to catch and I do not want to debug that in production cluster though I suspect that the elasticity of the production cluster could have an impact on the issue. |
I came across the thread because I was also running into an issue with using linkerd and the Datadog agent together. In the setup I'm using, Datadog is installed as a daemonset using this Helm chart so it is not meshed. I get similar errors as described above. These logs are coming from the Go Datadog client:
If I disable linkerd, then I no longer have any communications issues with the Datadog agent. |
I ended up meshing datadog pods as well and that resulted in no more 502s from apps to datadog agent. I do see some 502s from datadog agent to linkerd proxy metrics collection api too. And I suspect this is why I have some metrics in datadog that miss some requests. I haven't got the chance to upgrade to 2.9.1 yet but I will soon. @olix0r any reason you tagged the issue for 2.10 release? |
I want to make sure that we take a deeper look at issues like this before we cut another stable release. |
in my case 100% of my requests to the datadog tracer fail if I have linkerd injected into one pod and try to send spans to datadog. |
@shamsalmon Didn't faced that problem at all. It could be a bad configuration 🤷♂️ |
If anyone on this thread would like to try the most recent edge |
If you run into this issue again please feel free to open a new one with more recent logs and description. The Datadog issue was fixed in #5904 and the more recent edges and stable include more helpful logs. Thanks! |
Bug Report
What is the issue?
Without linkerd proxy, there are no trace of 502 bad gateway at the ingress level nor the app level
With linkerd proxy enabled on nginx ingress and or app, intermittent 502 Bad Gateway appears in the system.
I see 2 types of error from the proxy:
and
both lead to (on ingress)
or (on an app)
I have eliminated the scale-up and scale-down event of ingress and app. I also ruled out the graceful termination which I gave a period of 30s on app and 40s on ingress for the sleep on the proxy.
How can it be reproduced?
Logs, error output, etc
here are some linkerd-debug logs where 10.3.45.50 is the ingress controller pod and 10.3.23.252 is the upstream server pod
I've got similar tracing with tcp_dump with ksniff kubectl plugin.
linkerd check
outputEnvironment
nginx ingress service annotations
nginx ingress deployment annotation
Possible solution
Tuning all timeouts per app/service with proxy annotation and also globally with global configuration from chart deployment.
Additional context
The text was updated successfully, but these errors were encountered: