Intermittent 502 Bad Gateway issue when service is meshed #4870

mbelang · 2020-08-12T18:41:49Z

Bug Report

What is the issue?

Without linkerd proxy, there are no trace of 502 bad gateway at the ingress level nor the app level
With linkerd proxy enabled on nginx ingress and or app, intermittent 502 Bad Gateway appears in the system.

I see 2 types of error from the proxy:

## for this one the requests actually made it through the upstream
linkerd2_app_core::errors: Failed to proxy request: connection closed before message completed

and

## for this one the requests never made it to the upstream
linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104)

both lead to (on ingress)

"GET /v1/users?ids=xxxxxxx,xxxxx HTTP/1.1" 502 0 "-" "python-requests/2.23.0" 1708 0.097 [xxxxx-app-http] [] 10.4.27.106:80 0 0.100 502 7543fb48ce22d5c6145b97daadde93d9

or (on an app)

## Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset

Failed to send traces to Datadog Agent at http://10.4.87.60:8126: HTTP error status 502, reason Bad Gateway, message content-length: 0
date: Wed, 12 Aug 2020 17:58:22 GMT

I have eliminated the scale-up and scale-down event of ingress and app. I also ruled out the graceful termination which I gave a period of 30s on app and 40s on ingress for the sleep on the proxy.

How can it be reproduced?

Logs, error output, etc

here are some linkerd-debug logs where 10.3.45.50 is the ingress controller pod and 10.3.23.252 is the upstream server pod

1829200 2893.254017273    127.0.0.1 → 127.0.0.1    TCP 68 443 → 57590 [ACK] Seq=3277 Ack=1894 Win=175744 Len=0 TSval=378310645 TSecr=378310644
1829201 2893.254172427   10.3.45.50 → 127.0.0.1    HTTP 1701 GET /v1/users/xxxxxxxx?include=active_contract HTTP/1.1 
1829202 2893.254471803   10.3.45.50 → 10.3.23.252  TCP 76 47414 → 80 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2090260178 TSecr=0 WS=128
1829203 2893.254837416  10.3.23.252 → 10.3.45.50   TCP 56 80 → 47414 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
1829204 2893.254974304  10.3.23.252 → 10.3.45.50   HTTP 152 HTTP/1.1 502 Bad Gateway

I've got similar tracing with tcp_dump with ksniff kubectl plugin.

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2020-08-13T08:39:27Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

Environment

Kubernetes Version: 1.16, 1.17
Cluster Environment: EKS
Host OS: Amazon Linux 2 (eks optimized ami)
Linkerd version: 2.8.1
nginx ingress version: 0.34.1
upstream app: python 3.7 with hypercorn server

nginx ingress service annotations

    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    service.beta.kubernetes.io/aws-load-balancer-type: elb

nginx ingress deployment annotation

   config.alpha.linkerd.io/proxy-wait-before-exit-seconds: 40
   linkerd.io/inject: enabled

Possible solution

Tuning all timeouts per app/service with proxy annotation and also globally with global configuration from chart deployment.

keep-alive
connection
read timeout (this one I didn't see in linkerd but similar to nginx)

Additional context

The text was updated successfully, but these errors were encountered:

ihcsim · 2020-08-13T16:36:40Z

@mbelang To confirm what we discussed on slack, you added CPU resource requests to your nginx ingress controller's proxy and there were no CPU spikes on the proxy when the connection timeout happened. Then you tried increasing the outbound connection timeout on the proxy of the nginx ingress by changing the LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT env var per #4759, but to no avail.

Also the TCP dump shows the 502 coming back from the datadog agent, which is an uninjected workload running on the same cluster. No error logs on the upstream datadog agent.

Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset

Does that mean the proxy sees less connection timeouts after you give more resources to the datadog agent?

Let me know if my understanding is correct.

mbelang · 2020-08-13T17:01:04Z

To confirm what we discussed on slack, you added CPU resource requests to your nginx ingress controller's proxy and there were no CPU spikes on the proxy when the connection timeout happened

I'm not sure where I can see the specific metric for the proxy container. I didn't check that but from what I see so far, I get some 502 and the CPU on ingress pod is flat flat flat.

Also the TCP dump shows the 502 coming back from the datadog agent

The TCP dump above is from for ingress controller trying to contact upstream app, not datadog agent but I did got a trace for datadog agent is it is exactly the same.

by changing the LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT env var per

I did play with that trying different values from 5s to 15s. no luck. I also suspected the keep-alive which I tried to lower to 4s and increase to 90s without any luck either.

Does that mean the proxy sees less connection timeouts after you give more resources to the datadog agent?

Yes

Is there a way to set the read timeout: https://github.com/linkerd/linkerd2-proxy/blob/13b5fd65da6999f1d3d4d166983af8d54034d6e4/linkerd/app/integration/src/tcp.rs#L165 I didn't manage to see where is that function used and what is the default value.

If you see above, I have 2 problems.

connection reset by peer (the request never made it to the upstream service)
connection closed before message completed (I manage to find that the request actually made it through the upstream service but the connection was cut by the proxy (I imagine)

for 1) I'm trying to blame keep-alive or connection timeouts but no luck
for 2) I'm trying to blame the read timeout but I have no proof.....

ihcsim · 2020-08-13T17:13:53Z

So sounds like the errors are only seen on the nginx ingress controller outbound side? Do you have a minimum set of YAML that we can use for repro? Thanks.

mbelang · 2020-08-13T19:57:27Z

So far yes, but I do have problems with a meshed app trying to reach out to datadog agent. I only have ingress and 1 app meshed in production environment. So far, for the app, no 502 for requests to other apps which is good.

The biggest problem now is the ingress.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "14"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"deployment.kubernetes.io/revision":"13","meta.helm.sh/release-name":"ingress","meta.helm.sh/release-namespace":"ingress"},"generation":7154,"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","app.kubernetes.io/managed-by":"Helm","chart":"nginx-ingress-1.39.0","heritage":"Helm","release":"ingress"},"name":"ingress-nginx-ingress-controller","namespace":"ingress","resourceVersion":"62462358","selfLink":"/apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller","uid":"c041898e-78dd-11ea-ad31-0e9b9c5b4912"},"spec":{"progressDeadlineSeconds":600,"replicas":3,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app":"nginx-ingress","release":"ingress"}},"strategy":{"rollingUpdate":{"maxSurge":"33%","maxUnavailable":0},"type":"RollingUpdate"},"template":{"metadata":{"annotations":{"ad.datadoghq.com/nginx-ingress-controller.check_names":"[\"nginx_ingress_controller\"]","ad.datadoghq.com/nginx-ingress-controller.init_configs":"[{}]","ad.datadoghq.com/nginx-ingress-controller.instances":"[{\"prometheus_url\": \"http://%%host%%:10254/metrics\"}]","config.alpha.linkerd.io/proxy-wait-before-exit-seconds":"40","kubectl.kubernetes.io/restartedAt":"2020-08-04T16:10:44-04:00","linkerd.io/created-by":"linkerd/cli stable-2.8.1","linkerd.io/identity-mode":"default","linkerd.io/proxy-version":"stable-2.8.1"},"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","component":"controller","linkerd.io/control-plane-ns":"linkerd","linkerd.io/proxy-deployment":"ingress-nginx-ingress-controller","linkerd.io/workload-ns":"ingress","release":"ingress"}},"spec":{"containers":[{"args":["/nginx-ingress-controller","--default-backend-service=ingress/ingress-nginx-ingress-default-backend","--election-id=ingress-controller-leader","--ingress-class=nginx","--configmap=ingress/ingress-nginx-ingress-controller"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"DD_AGENT_HOST","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"status.hostIP"}}}],"image":"quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"name":"nginx-ingress-controller","ports":[{"containerPort":80,"name":"http","protocol":"TCP"},{"containerPort":443,"name":"https","protocol":"TCP"},{"containerPort":10254,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"resources":{"limits":{"memory":"512Mi"},"requests":{"cpu":"150m","memory":"512Mi"}},"securityContext":{"allowPrivilegeEscalation":true,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"runAsUser":101},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"},{"env":[{"name":"LINKERD2_PROXY_LOG","value":"warn,linkerd=info"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_ADDR","value":"linkerd-dst.linkerd.svc.cluster.local:8086"},{"name":"LINKERD2_PROXY_DESTINATION_GET_NETWORKS","value":"10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"},{"name":"LINKERD2_PROXY_CONTROL_LISTEN_ADDR","value":"0.0.0.0:4190"},{"name":"LINKERD2_PROXY_ADMIN_LISTEN_ADDR","value":"0.0.0.0:4191"},{"name":"LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR","value":"127.0.0.1:4140"},{"name":"LINKERD2_PROXY_INBOUND_LISTEN_ADDR","value":"0.0.0.0:4143"},{"name":"LINKERD2_PROXY_DESTINATION_GET_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE","value":"10000ms"},{"name":"LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE","value":"4000ms"},{"name":"_pod_ns","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"LINKERD2_PROXY_DESTINATION_CONTEXT","value":"ns:$(_pod_ns)"},{"name":"LINKERD2_PROXY_IDENTITY_DIR","value":"/var/run/linkerd/identity/end-entity"},{"name":"LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS","value":"-----BEGIN CERTIFICATE-----\nMIIBkjCCATmgAwIBAgIRAIIQc+6o+sH3bJmp1G7/55IwCgYIKoZIzj0EAwIwKTEn\nMCUGA1UEAxMeaWRlbnRpdHkubGlua2VyZC5jbHVzdGVyLmxvY2FsMB4XDTIwMDgw\nNDE5NTgyMloXDTMwMDgwMjE5NTgyMlowKTEnMCUGA1UEAxMeaWRlbnRpdHkubGlu\na2VyZC5jbHVzdGVyLmxvY2FsMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEt3MO\nWW9nuJbUijyH3freMWfL0Z90R/8R6iq5Me9Np+iVs4SzG6lrZyjhTN4d7N5szfCY\nii3HIe+AXLgvZXDZTKNCMEAwDgYDVR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMB\nAf8wHQYDVR0OBBYEFC0bx7JhQ54epHUcBE2ZYWzQYaK3MAoGCCqGSM49BAMCA0cA\nMEQCIG4+7HaA/viOLhoukmyelwn76vlZ5VZCdbaG4Z9hCY03AiAprDSy71nkk5ii\nONYQvhbt15P7lUptu4j5nlhF5n+Iaw==\n-----END CERTIFICATE-----\n"},{"name":"LINKERD2_PROXY_IDENTITY_TOKEN_FILE","value":"/var/run/secrets/kubernetes.io/serviceaccount/token"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_ADDR","value":"linkerd-identity.linkerd.svc.cluster.local:8080"},{"name":"_pod_sa","valueFrom":{"fieldRef":{"fieldPath":"spec.serviceAccountName"}}},{"name":"_l5d_ns","value":"linkerd"},{"name":"_l5d_trustdomain","value":"cluster.local"},{"name":"LINKERD2_PROXY_IDENTITY_LOCAL_NAME","value":"$(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_NAME","value":"linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_NAME","value":"linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_TAP_SVC_NAME","value":"linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"}],"image":"gcr.io/linkerd-io/proxy:stable-2.8.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStop":{"exec":{"command":["/bin/bash","-c","sleep 40"]}}},"livenessProbe":{"httpGet":{"path":"/live","port":4191},"initialDelaySeconds":10},"name":"linkerd-proxy","ports":[{"containerPort":4143,"name":"linkerd-proxy"},{"containerPort":4191,"name":"linkerd-admin"}],"readinessProbe":{"httpGet":{"path":"/ready","port":4191},"initialDelaySeconds":2},"resources":{"limits":{"memory":"250Mi"},"requests":{"cpu":"100m","memory":"20Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsUser":2102},"terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/run/linkerd/identity/end-entity","name":"linkerd-identity-end-entity"}]}],"dnsPolicy":"ClusterFirst","initContainers":[{"args":["--incoming-proxy-port","4143","--outgoing-proxy-port","4140","--proxy-uid","2102","--inbound-ports-to-ignore","4190,4191"],"image":"gcr.io/linkerd-io/proxy-init:v1.3.3","imagePullPolicy":"IfNotPresent","name":"linkerd-init","resources":{"limits":{"cpu":"100m","memory":"50Mi"},"requests":{"cpu":"10m","memory":"10Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_ADMIN","NET_RAW","NET_BIND_SERVICE"],"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true,"runAsNonRoot":false,"runAsUser":0},"terminationMessagePolicy":"FallbackToLogsOnError"}],"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"ingress-nginx-ingress","serviceAccountName":"ingress-nginx-ingress","terminationGracePeriodSeconds":60,"volumes":[{"emptyDir":{"medium":"Memory"},"name":"linkerd-identity-end-entity"}]}}},"status":{"availableReplicas":3,"conditions":[{"message":"Deployment has minimum availability.","reason":"MinimumReplicasAvailable","status":"True","type":"Available"},{"message":"ReplicaSet \"ingress-nginx-ingress-controller-7656c98b8f\" has successfully progressed.","reason":"NewReplicaSetAvailable","status":"True","type":"Progressing"}],"observedGeneration":7154,"readyReplicas":3,"replicas":3,"updatedReplicas":3}}
    meta.helm.sh/release-name: ingress
    meta.helm.sh/release-namespace: ingress
  generation: 7161
  labels:
    app: nginx-ingress
    app.kubernetes.io/component: controller
    app.kubernetes.io/managed-by: Helm
    chart: nginx-ingress-1.39.0
    heritage: Helm
    release: ingress
  name: ingress-nginx-ingress-controller
  namespace: ingress
  resourceVersion: "62830529"
  selfLink: /apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller
  uid: c041898e-78dd-11ea-ad31-0e9b9c5b4912
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx-ingress
      release: ingress
  strategy:
    rollingUpdate:
      maxSurge: 33%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        ad.datadoghq.com/nginx-ingress-controller.check_names: '["nginx_ingress_controller"]'
        ad.datadoghq.com/nginx-ingress-controller.init_configs: '[{}]'
        ad.datadoghq.com/nginx-ingress-controller.instances: '[{"prometheus_url":
          "http://%%host%%:10254/metrics"}]'
        config.alpha.linkerd.io/proxy-wait-before-exit-seconds: "40"
        kubectl.kubernetes.io/restartedAt: "2020-08-04T16:10:44-04:00"
        linkerd.io/created-by: linkerd/cli stable-2.8.1
        linkerd.io/identity-mode: default
        linkerd.io/proxy-version: stable-2.8.1
      labels:
        app: nginx-ingress
        app.kubernetes.io/component: controller
        component: controller
        linkerd.io/control-plane-ns: linkerd
        linkerd.io/proxy-deployment: ingress-nginx-ingress-controller
        linkerd.io/workload-ns: ingress
        release: ingress
    spec:
      containers:
      - args:
        - /nginx-ingress-controller
        - --default-backend-service=ingress/ingress-nginx-ingress-default-backend
        - --election-id=ingress-controller-leader
        - --ingress-class=nginx
        - --configmap=ingress/ingress-nginx-ingress-controller
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: DD_AGENT_HOST
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: nginx-ingress-controller
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        - containerPort: 443
          name: https
          protocol: TCP
        - containerPort: 10254
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 150m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: true
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - ALL
          runAsUser: 101
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - env:
        - name: LINKERD2_PROXY_LOG
          value: warn,linkerd=info
        - name: LINKERD2_PROXY_DESTINATION_SVC_ADDR
          value: linkerd-dst.linkerd.svc.cluster.local:8086
        - name: LINKERD2_PROXY_DESTINATION_GET_NETWORKS
          value: 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
        - name: LINKERD2_PROXY_CONTROL_LISTEN_ADDR
          value: 0.0.0.0:4190
        - name: LINKERD2_PROXY_ADMIN_LISTEN_ADDR
          value: 0.0.0.0:4191
        - name: LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR
          value: 127.0.0.1:4140
        - name: LINKERD2_PROXY_INBOUND_LISTEN_ADDR
          value: 0.0.0.0:4143
        - name: LINKERD2_PROXY_DESTINATION_GET_SUFFIXES
          value: svc.cluster.local.
        - name: LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES
          value: svc.cluster.local.
        - name: LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE
          value: 10000ms
        - name: LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE
          value: 90000ms
        - name: _pod_ns
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: LINKERD2_PROXY_DESTINATION_CONTEXT
          value: ns:$(_pod_ns)
        - name: LINKERD2_PROXY_IDENTITY_DIR
          value: /var/run/linkerd/identity/end-entity
        - name: LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS
          value: |
            -----BEGIN CERTIFICATE-----
            REDACTED
            -----END CERTIFICATE-----
        - name: LINKERD2_PROXY_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/kubernetes.io/serviceaccount/token
        - name: LINKERD2_PROXY_IDENTITY_SVC_ADDR
          value: linkerd-identity.linkerd.svc.cluster.local:8080
        - name: _pod_sa
          valueFrom:
            fieldRef:
              fieldPath: spec.serviceAccountName
        - name: _l5d_ns
          value: linkerd
        - name: _l5d_trustdomain
          value: cluster.local
        - name: LINKERD2_PROXY_IDENTITY_LOCAL_NAME
          value: $(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_IDENTITY_SVC_NAME
          value: linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_DESTINATION_SVC_NAME
          value: linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_TAP_SVC_NAME
          value: linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        image: gcr.io/linkerd-io/proxy:stable-2.8.1
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - sleep 40
        livenessProbe:
          httpGet:
            path: /live
            port: 4191
          initialDelaySeconds: 10
        name: linkerd-proxy
        ports:
        - containerPort: 4143
          name: linkerd-proxy
        - containerPort: 4191
          name: linkerd-admin
        readinessProbe:
          httpGet:
            path: /ready
            port: 4191
          initialDelaySeconds: 2
        resources:
          limits:
            memory: 250Mi
          requests:
            cpu: 100m
            memory: 20Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsUser: 2102
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /var/run/linkerd/identity/end-entity
          name: linkerd-identity-end-entity
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --incoming-proxy-port
        - "4143"
        - --outgoing-proxy-port
        - "4140"
        - --proxy-uid
        - "2102"
        - --inbound-ports-to-ignore
        - 4190,4191
        image: gcr.io/linkerd-io/proxy-init:v1.3.3
        imagePullPolicy: IfNotPresent
        name: linkerd-init
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 10m
            memory: 10Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW
            - NET_BIND_SERVICE
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
          runAsNonRoot: false
          runAsUser: 0
        terminationMessagePolicy: FallbackToLogsOnError
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ingress-nginx-ingress
      serviceAccountName: ingress-nginx-ingress
      terminationGracePeriodSeconds: 60
      volumes:
      - emptyDir:
          medium: Memory
        name: linkerd-identity-end-entity
status:
  availableReplicas: 3
  conditions:
  - message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - message: ReplicaSet "ingress-nginx-ingress-controller-59fd9b7b85" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 7161
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3
---

ihcsim · 2020-08-13T20:35:52Z

but I do have problems with a meshed app trying to reach out to datadog agent.

To help further narrow down the repro steps, do these 502s happen only when the datadog agent is the target service?

mbelang · 2020-08-14T02:17:23Z

I just saw this: hyperium/hyper#2136.

I do imagine that linkerd proxy is using that lib right? According to them it is a keep-alive problem and by setting the keep alive lower than the upstream would do.

My upstream keep-alive timeout is 5s so I set it to 2s for the proxy... No luck so far. I'm going to try and put the proxy at 0ms for the outbound keep-alive timeout so that a new connection is used all the time.

mbelang · 2020-08-14T17:59:03Z

I mitigate all 502 on GETs with nginx retry mechanism. I could also do it on non-idemponent but it is a bit dangerous ATM.
I do have less problems now but I'd still like to fix/understand what is going wrong with the linkerd proxy.

mbelang · 2020-08-17T15:25:19Z

@ihcsim here is an extract of an ingress resource for a test application

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    annotations:
      acme.cert-manager.io/http01-edit-in-place: "true"
      acme.cert-manager.io/http01-ingress-class: "true"
      cert-manager.io/cluster-issuer: letsencrypt
      certmanager.k8s.io/acme-challenge-type: dns01
      certmanager.k8s.io/acme-dns01-provider: route53
      certmanager.k8s.io/cluster-issuer: letsencrypt
      external-dns.alpha.kubernetes.io/target: REDACTED.
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
      meta.helm.sh/release-name: hello-k8s
      meta.helm.sh/release-namespace: hello-k8s
      nginx.ingress.kubernetes.io/configuration-snippet: |
        proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
        grpc_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
      nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502 non_idempotent
      nginx.ingress.kubernetes.io/proxy-next-upstream-timeout: 30s
      nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "3"
    creationTimestamp: "2020-04-08T13:45:30Z"
    generation: 1
    labels:
      app: hello-k8s
      app.kubernetes.io/instance: hello-k8s
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: app
      branch-slug: master
      cci-build-number: "3877"
      cci-workflow-id: 1a3b6a13-6400-41e6-a9a0-3eafa8d420d8
      component: app
      helm.sh/chart: app-2.6.0
      place: ca
      pr-number: ""
      sha: 89e64f20862883f79fe25347958459076e281f4d
      short-sha: 89e64f2
      stage: prod
      tag: v0.19.2
      version: v0.19.2
    name: app
    namespace: hello-k8s
    resourceVersion: "63104678"
    selfLink: /apis/extensions/v1beta1/namespaces/hello-k8s/ingresses/app
    uid: 3639f2c6-799f-11ea-ad31-0e9b9c5b4912
  spec:
    rules:
    - host: hello-k8s.example.com
      http:
        paths:
        - backend:
            serviceName: app
            servicePort: http
    tls:
    - hosts:
      - hello-k8s.example.com
      secretName: hello-k8s.example.com-tls
  status:
    loadBalancer:
      ingress:
      - ip: x.x.x.x
      - ip: x.x.x.x
      - ip: x.x.x.x
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

mbelang · 2020-08-18T15:20:07Z

I'm going to try and put the proxy at 0ms for the outbound keep-alive timeout so that a new connection is used all the time.

discovered that 0ms is not supported.

I also tried to raise keep-alive timeout to 90s (higher that nginx outbound of 60s) without any luck either.

olix0r · 2020-08-18T16:05:52Z

@mbelang and I had a chance to talk through this issue in Slack this morning. I think we have a good enough handle on it to put together a repro setup:

nginx ingress, injected with proxy
app with python HTTP server, uninjected

Then, we should try putting consistent load on the ingress. Ideally, we'd test this all on EKS with the latest AWS CNI, as it seems plausible that it's a bad interaction at the network layer.

If we can reproduce this with this kind of setup, then I think it should be pretty straightforward to diagnose/fix. If we can't, we can start digging into more details about how this repro setup differs from @mbelang's actual system.

olix0r · 2020-08-20T22:10:56Z

@mbelang reports that this problem goes away when all pods are meshed; so this points strongly to the HTTP/1.1 client.

steve-gray · 2020-09-17T22:25:58Z

We're seeing this breaking DNS in the cluster for us at the moment where stuff tries to use TCP, including the linkerd-proxy instances. Here's an example of a CURL from a non-meshed pod showing the port opens fine (normal no-talk hangup after a while)

* Expire in 0 ms for 6 (transfer 0x5615dec03f50)
*   Trying 172.20.0.10...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5615dec03f50)
* Connected to 172.20.0.10 (172.20.0.10) port 53 (#0)
> GET / HTTP/1.1
> Host: 172.20.0.10:53
> User-Agent: curl/7.64.0
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host 172.20.0.10 left intact
curl: (52) Empty reply from server

Meshed pod gets bad gateways:

curl -vvv 172.20.0.10:53
* Rebuilt URL to: 172.20.0.10:53/
*   Trying 172.20.0.10...
* TCP_NODELAY set
* Connected to 172.20.0.10 (172.20.0.10) port 53 (#0)
> GET / HTTP/1.1
> Host: 172.20.0.10:53
> User-Agent: curl/7.52.1
> Accept: */*
> 
< HTTP/1.1 502 Bad Gateway
< content-length: 0
< date: Thu, 17 Sep 2020 22:24:10 GMT
< 
* Curl_http_done: called premature == 0
* Connection #0 to host 172.20.0.10 left intact

Meshing DNS isn't an option for us. Environment is edge-20.9.1 running Kubernetes 1.17 on AWS EKS. Direct port forwards all work to the DNS pods, talking direct to services works - but interestingly this seems to be preventign the proxy itself establishing identities for things:

linkerd-proxy [149249.354645611s]  WARN ThreadId(01) trust_dns_proto::xfer::dns_exchange: io_stream hit an error, shutting down: io error: Connection reset by peer (os error 104)    
linkerd-proxy [149253.519311877s]  WARN ThreadId(01) trust_dns_proto::xfer::dns_exchange: io_stream hit an error, shutting down: io error: Connection reset by peer (os error 104)

mbelang · 2020-09-19T14:55:10Z

@olix0r I still notice some WARN inbound:accept{peer.addr=10.4.25.103:55298}:source{target.addr=10.4.25.215:80}: linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104) from time to time. I'm still unsure if it is the same problem yet but we didn't change anything in our configs since I meshed all apps.

ihcsim · 2020-09-21T21:35:37Z

@steve-gray We think what you are seeing is closer to #4831. There is some ongoing investigation per #4831 (comment). Feel free to subscribe to that issue.

stale · 2020-12-24T11:28:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

mbelang · 2021-01-09T19:19:15Z

Even if all my services are meshed, I still have intermittent 502s. I can mitigate with retries at the client level but it is not suitable. Also, I could setup retries with the mesh itself but Linkerd doesn't support retries with body...

cpretzer · 2021-01-09T19:27:07Z

hi @mbelang have you been able to reproduce this with 2.9.1?

You're right that retries only work for GET requests at the moment. There is an open issue (#3985) that we'd love help with, if you're interested.

mbelang · 2021-01-10T19:39:26Z

@cpretzer I have yet to update to 2.9.1. Were there any fix or improvements related to the proxy with regard to this issue?

olix0r · 2021-01-10T20:13:44Z

I'm not confident that retries would help in this case. It really depends on where we're encountering the issue, but I don't think we have enough data to know yet.

There were substantial changes between 2.8 and 2.9, especially around caching and discovery. (For instance, there's no longer any DNS resolution in the data path). It would be good to test this more recent version if only to ensure that the problem does not persists -- even if we are able to identify the underlying cause, we're unlikely to backport fixes onto 2.8.

If the issue persists, it would be helpful to at least get debug logs from both the client and server proxies, via config.linkerd.io/proxy-log-level: linkerd=debug,warn

mbelang · 2021-01-11T13:55:16Z

@olix0r I know that retries would not solve the issue but would at least mitigate. I will plan an upgrade to 2.9.1 and see how it goes from there.

I did try to put the proxy in Debug mode but I didn't manage to get more information that I posted here. Maybe I missed it but it is a fairly rare event that it is very hard to catch and I do not want to debug that in production cluster though I suspect that the elasticity of the production cluster could have an impact on the issue.

ewhauser · 2021-01-14T19:15:00Z

I came across the thread because I was also running into an issue with using linkerd and the Datadog agent together. In the setup I'm using, Datadog is installed as a daemonset using this Helm chart so it is not meshed.

I get similar errors as described above. These logs are coming from the Go Datadog client:

2021/01/14 00:19:20 Datadog Tracer v1.27.0 ERROR: lost 2 traces: Bad Gateway, 11 additional messages skipped (first occurrence: 14 Jan 21 00:18 UTC)%

If I disable linkerd, then I no longer have any communications issues with the Datadog agent.

mbelang · 2021-01-16T02:36:39Z

I ended up meshing datadog pods as well and that resulted in no more 502s from apps to datadog agent. I do see some 502s from datadog agent to linkerd proxy metrics collection api too. And I suspect this is why I have some metrics in datadog that miss some requests. I haven't got the chance to upgrade to 2.9.1 yet but I will soon.

@olix0r any reason you tagged the issue for 2.10 release?

olix0r · 2021-01-17T22:44:37Z

@mbelang

any reason you tagged the issue for 2.10 release?

I want to make sure that we take a deeper look at issues like this before we cut another stable release.

shamsalmon · 2021-02-03T22:26:30Z

in my case 100% of my requests to the datadog tracer fail if I have linkerd injected into one pod and try to send spans to datadog.

mbelang · 2021-02-04T13:08:00Z

@shamsalmon Didn't faced that problem at all. It could be a bad configuration 🤷‍♂️

kleimkuhler · 2021-03-25T16:56:48Z

If anyone on this thread would like to try the most recent edge edge-21.3.4, PR #5904 is included and it fixed some other issues with Datadog (and host network pods in general). I'll keep this open for a little while longer for follow-up questions or comments.

kleimkuhler · 2021-04-07T21:33:41Z

If you run into this issue again please feel free to open a new one with more recent logs and description. The Datadog issue was fixed in #5904 and the more recent edges and stable include more helpful logs. Thanks!

ihcsim added the needs/repro label Aug 13, 2020

ihcsim added the area/proxy label Aug 13, 2020

olix0r self-assigned this Aug 24, 2020

dcodix mentioned this issue Sep 1, 2020

Intermitent (random) 502 Bad Gateway hitting a service in the mesh #4934

Closed

stale bot added the wontfix label Dec 24, 2020

stale bot removed the wontfix label Jan 9, 2021

olix0r added this to the stable-2.10 milestone Jan 14, 2021

olix0r removed this from the stable-2.10 milestone Mar 11, 2021

cpackingham mentioned this issue Mar 12, 2021

Upgrade to stable-2.10.0 from stable-2.9.4 causes “Could not fetch profile” errors #5881

Closed

kleimkuhler self-assigned this Mar 25, 2021

kleimkuhler closed this as completed Apr 7, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent 502 Bad Gateway issue when service is meshed #4870

Intermittent 502 Bad Gateway issue when service is meshed #4870

mbelang commented Aug 12, 2020 •

edited

Loading

ihcsim commented Aug 13, 2020 •

edited

Loading

mbelang commented Aug 13, 2020

ihcsim commented Aug 13, 2020

mbelang commented Aug 13, 2020

ihcsim commented Aug 13, 2020

mbelang commented Aug 14, 2020

mbelang commented Aug 14, 2020

mbelang commented Aug 17, 2020

mbelang commented Aug 18, 2020

olix0r commented Aug 18, 2020

olix0r commented Aug 20, 2020

steve-gray commented Sep 17, 2020 •

edited

Loading

mbelang commented Sep 19, 2020

ihcsim commented Sep 21, 2020

stale bot commented Dec 24, 2020

mbelang commented Jan 9, 2021

cpretzer commented Jan 9, 2021

mbelang commented Jan 10, 2021

olix0r commented Jan 10, 2021

mbelang commented Jan 11, 2021

ewhauser commented Jan 14, 2021

mbelang commented Jan 16, 2021

olix0r commented Jan 17, 2021

shamsalmon commented Feb 3, 2021

mbelang commented Feb 4, 2021

kleimkuhler commented Mar 25, 2021

kleimkuhler commented Apr 7, 2021

Intermittent 502 Bad Gateway issue when service is meshed #4870

Intermittent 502 Bad Gateway issue when service is meshed #4870

Comments

mbelang commented Aug 12, 2020 • edited Loading

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

ihcsim commented Aug 13, 2020 • edited Loading

mbelang commented Aug 13, 2020

ihcsim commented Aug 13, 2020

mbelang commented Aug 13, 2020

ihcsim commented Aug 13, 2020

mbelang commented Aug 14, 2020

mbelang commented Aug 14, 2020

mbelang commented Aug 17, 2020

mbelang commented Aug 18, 2020

olix0r commented Aug 18, 2020

olix0r commented Aug 20, 2020

steve-gray commented Sep 17, 2020 • edited Loading

mbelang commented Sep 19, 2020

ihcsim commented Sep 21, 2020

stale bot commented Dec 24, 2020

mbelang commented Jan 9, 2021

cpretzer commented Jan 9, 2021

mbelang commented Jan 10, 2021

olix0r commented Jan 10, 2021

mbelang commented Jan 11, 2021

ewhauser commented Jan 14, 2021

mbelang commented Jan 16, 2021

olix0r commented Jan 17, 2021

shamsalmon commented Feb 3, 2021

mbelang commented Feb 4, 2021

kleimkuhler commented Mar 25, 2021

kleimkuhler commented Apr 7, 2021

mbelang commented Aug 12, 2020 •

edited

Loading

`linkerd check` output

ihcsim commented Aug 13, 2020 •

edited

Loading

steve-gray commented Sep 17, 2020 •

edited

Loading