Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent 502 Bad Gateway issue when service is meshed #4870

Closed
mbelang opened this issue Aug 12, 2020 · 27 comments
Closed

Intermittent 502 Bad Gateway issue when service is meshed #4870

mbelang opened this issue Aug 12, 2020 · 27 comments

Comments

@mbelang
Copy link

mbelang commented Aug 12, 2020

Bug Report

What is the issue?

Without linkerd proxy, there are no trace of 502 bad gateway at the ingress level nor the app level
With linkerd proxy enabled on nginx ingress and or app, intermittent 502 Bad Gateway appears in the system.

I see 2 types of error from the proxy:

## for this one the requests actually made it through the upstream
linkerd2_app_core::errors: Failed to proxy request: connection closed before message completed

and

## for this one the requests never made it to the upstream
linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104)

both lead to (on ingress)

"GET /v1/users?ids=xxxxxxx,xxxxx HTTP/1.1" 502 0 "-" "python-requests/2.23.0" 1708 0.097 [xxxxx-app-http] [] 10.4.27.106:80 0 0.100 502 7543fb48ce22d5c6145b97daadde93d9

or (on an app)

## Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset

Failed to send traces to Datadog Agent at http://10.4.87.60:8126: HTTP error status 502, reason Bad Gateway, message content-length: 0
date: Wed, 12 Aug 2020 17:58:22 GMT

I have eliminated the scale-up and scale-down event of ingress and app. I also ruled out the graceful termination which I gave a period of 30s on app and 40s on ingress for the sleep on the proxy.

How can it be reproduced?

Logs, error output, etc

here are some linkerd-debug logs where 10.3.45.50 is the ingress controller pod and 10.3.23.252 is the upstream server pod

1829200 2893.254017273    127.0.0.1 → 127.0.0.1    TCP 68 443 → 57590 [ACK] Seq=3277 Ack=1894 Win=175744 Len=0 TSval=378310645 TSecr=378310644
1829201 2893.254172427   10.3.45.50 → 127.0.0.1    HTTP 1701 GET /v1/users/xxxxxxxx?include=active_contract HTTP/1.1 
1829202 2893.254471803   10.3.45.50 → 10.3.23.252  TCP 76 47414 → 80 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2090260178 TSecr=0 WS=128
1829203 2893.254837416  10.3.23.252 → 10.3.45.50   TCP 56 80 → 47414 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
1829204 2893.254974304  10.3.23.252 → 10.3.45.50   HTTP 152 HTTP/1.1 502 Bad Gateway 

I've got similar tracing with tcp_dump with ksniff kubectl plugin.

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2020-08-13T08:39:27Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

Environment

  • Kubernetes Version: 1.16, 1.17
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2 (eks optimized ami)
  • Linkerd version: 2.8.1
  • nginx ingress version: 0.34.1
  • upstream app: python 3.7 with hypercorn server

nginx ingress service annotations

    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    service.beta.kubernetes.io/aws-load-balancer-type: elb

nginx ingress deployment annotation

   config.alpha.linkerd.io/proxy-wait-before-exit-seconds: 40
   linkerd.io/inject: enabled

Possible solution

Tuning all timeouts per app/service with proxy annotation and also globally with global configuration from chart deployment.

  • keep-alive
  • connection
  • read timeout (this one I didn't see in linkerd but similar to nginx)

Additional context

@ihcsim
Copy link
Contributor

ihcsim commented Aug 13, 2020

@mbelang To confirm what we discussed on slack, you added CPU resource requests to your nginx ingress controller's proxy and there were no CPU spikes on the proxy when the connection timeout happened. Then you tried increasing the outbound connection timeout on the proxy of the nginx ingress by changing the LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT env var per #4759, but to no avail.

Also the TCP dump shows the 502 coming back from the datadog agent, which is an uninjected workload running on the same cluster. No error logs on the upstream datadog agent.

Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset

Does that mean the proxy sees less connection timeouts after you give more resources to the datadog agent?

Let me know if my understanding is correct.

@mbelang
Copy link
Author

mbelang commented Aug 13, 2020

To confirm what we discussed on slack, you added CPU resource requests to your nginx ingress controller's proxy and there were no CPU spikes on the proxy when the connection timeout happened

I'm not sure where I can see the specific metric for the proxy container. I didn't check that but from what I see so far, I get some 502 and the CPU on ingress pod is flat flat flat.

Also the TCP dump shows the 502 coming back from the datadog agent

The TCP dump above is from for ingress controller trying to contact upstream app, not datadog agent but I did got a trace for datadog agent is it is exactly the same.

by changing the LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT env var per

I did play with that trying different values from 5s to 15s. no luck. I also suspected the keep-alive which I tried to lower to 4s and increase to 90s without any luck either.

Does that mean the proxy sees less connection timeouts after you give more resources to the datadog agent?

Yes

Is there a way to set the read timeout: https://github.com/linkerd/linkerd2-proxy/blob/13b5fd65da6999f1d3d4d166983af8d54034d6e4/linkerd/app/integration/src/tcp.rs#L165 I didn't manage to see where is that function used and what is the default value.

If you see above, I have 2 problems.

  1. connection reset by peer (the request never made it to the upstream service)
  2. connection closed before message completed (I manage to find that the request actually made it through the upstream service but the connection was cut by the proxy (I imagine)

for 1) I'm trying to blame keep-alive or connection timeouts but no luck
for 2) I'm trying to blame the read timeout but I have no proof.....

@ihcsim
Copy link
Contributor

ihcsim commented Aug 13, 2020

So sounds like the errors are only seen on the nginx ingress controller outbound side? Do you have a minimum set of YAML that we can use for repro? Thanks.

@mbelang
Copy link
Author

mbelang commented Aug 13, 2020

So far yes, but I do have problems with a meshed app trying to reach out to datadog agent. I only have ingress and 1 app meshed in production environment. So far, for the app, no 502 for requests to other apps which is good.

The biggest problem now is the ingress.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "14"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"deployment.kubernetes.io/revision":"13","meta.helm.sh/release-name":"ingress","meta.helm.sh/release-namespace":"ingress"},"generation":7154,"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","app.kubernetes.io/managed-by":"Helm","chart":"nginx-ingress-1.39.0","heritage":"Helm","release":"ingress"},"name":"ingress-nginx-ingress-controller","namespace":"ingress","resourceVersion":"62462358","selfLink":"/apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller","uid":"c041898e-78dd-11ea-ad31-0e9b9c5b4912"},"spec":{"progressDeadlineSeconds":600,"replicas":3,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app":"nginx-ingress","release":"ingress"}},"strategy":{"rollingUpdate":{"maxSurge":"33%","maxUnavailable":0},"type":"RollingUpdate"},"template":{"metadata":{"annotations":{"ad.datadoghq.com/nginx-ingress-controller.check_names":"[\"nginx_ingress_controller\"]","ad.datadoghq.com/nginx-ingress-controller.init_configs":"[{}]","ad.datadoghq.com/nginx-ingress-controller.instances":"[{\"prometheus_url\": \"http://%%host%%:10254/metrics\"}]","config.alpha.linkerd.io/proxy-wait-before-exit-seconds":"40","kubectl.kubernetes.io/restartedAt":"2020-08-04T16:10:44-04:00","linkerd.io/created-by":"linkerd/cli stable-2.8.1","linkerd.io/identity-mode":"default","linkerd.io/proxy-version":"stable-2.8.1"},"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","component":"controller","linkerd.io/control-plane-ns":"linkerd","linkerd.io/proxy-deployment":"ingress-nginx-ingress-controller","linkerd.io/workload-ns":"ingress","release":"ingress"}},"spec":{"containers":[{"args":["/nginx-ingress-controller","--default-backend-service=ingress/ingress-nginx-ingress-default-backend","--election-id=ingress-controller-leader","--ingress-class=nginx","--configmap=ingress/ingress-nginx-ingress-controller"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"DD_AGENT_HOST","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"status.hostIP"}}}],"image":"quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"name":"nginx-ingress-controller","ports":[{"containerPort":80,"name":"http","protocol":"TCP"},{"containerPort":443,"name":"https","protocol":"TCP"},{"containerPort":10254,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"resources":{"limits":{"memory":"512Mi"},"requests":{"cpu":"150m","memory":"512Mi"}},"securityContext":{"allowPrivilegeEscalation":true,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"runAsUser":101},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"},{"env":[{"name":"LINKERD2_PROXY_LOG","value":"warn,linkerd=info"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_ADDR","value":"linkerd-dst.linkerd.svc.cluster.local:8086"},{"name":"LINKERD2_PROXY_DESTINATION_GET_NETWORKS","value":"10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"},{"name":"LINKERD2_PROXY_CONTROL_LISTEN_ADDR","value":"0.0.0.0:4190"},{"name":"LINKERD2_PROXY_ADMIN_LISTEN_ADDR","value":"0.0.0.0:4191"},{"name":"LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR","value":"127.0.0.1:4140"},{"name":"LINKERD2_PROXY_INBOUND_LISTEN_ADDR","value":"0.0.0.0:4143"},{"name":"LINKERD2_PROXY_DESTINATION_GET_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE","value":"10000ms"},{"name":"LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE","value":"4000ms"},{"name":"_pod_ns","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"LINKERD2_PROXY_DESTINATION_CONTEXT","value":"ns:$(_pod_ns)"},{"name":"LINKERD2_PROXY_IDENTITY_DIR","value":"/var/run/linkerd/identity/end-entity"},{"name":"LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS","value":"-----BEGIN CERTIFICATE-----\nMIIBkjCCATmgAwIBAgIRAIIQc+6o+sH3bJmp1G7/55IwCgYIKoZIzj0EAwIwKTEn\nMCUGA1UEAxMeaWRlbnRpdHkubGlua2VyZC5jbHVzdGVyLmxvY2FsMB4XDTIwMDgw\nNDE5NTgyMloXDTMwMDgwMjE5NTgyMlowKTEnMCUGA1UEAxMeaWRlbnRpdHkubGlu\na2VyZC5jbHVzdGVyLmxvY2FsMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEt3MO\nWW9nuJbUijyH3freMWfL0Z90R/8R6iq5Me9Np+iVs4SzG6lrZyjhTN4d7N5szfCY\nii3HIe+AXLgvZXDZTKNCMEAwDgYDVR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMB\nAf8wHQYDVR0OBBYEFC0bx7JhQ54epHUcBE2ZYWzQYaK3MAoGCCqGSM49BAMCA0cA\nMEQCIG4+7HaA/viOLhoukmyelwn76vlZ5VZCdbaG4Z9hCY03AiAprDSy71nkk5ii\nONYQvhbt15P7lUptu4j5nlhF5n+Iaw==\n-----END CERTIFICATE-----\n"},{"name":"LINKERD2_PROXY_IDENTITY_TOKEN_FILE","value":"/var/run/secrets/kubernetes.io/serviceaccount/token"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_ADDR","value":"linkerd-identity.linkerd.svc.cluster.local:8080"},{"name":"_pod_sa","valueFrom":{"fieldRef":{"fieldPath":"spec.serviceAccountName"}}},{"name":"_l5d_ns","value":"linkerd"},{"name":"_l5d_trustdomain","value":"cluster.local"},{"name":"LINKERD2_PROXY_IDENTITY_LOCAL_NAME","value":"$(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_NAME","value":"linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_NAME","value":"linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_TAP_SVC_NAME","value":"linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"}],"image":"gcr.io/linkerd-io/proxy:stable-2.8.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStop":{"exec":{"command":["/bin/bash","-c","sleep 40"]}}},"livenessProbe":{"httpGet":{"path":"/live","port":4191},"initialDelaySeconds":10},"name":"linkerd-proxy","ports":[{"containerPort":4143,"name":"linkerd-proxy"},{"containerPort":4191,"name":"linkerd-admin"}],"readinessProbe":{"httpGet":{"path":"/ready","port":4191},"initialDelaySeconds":2},"resources":{"limits":{"memory":"250Mi"},"requests":{"cpu":"100m","memory":"20Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsUser":2102},"terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/run/linkerd/identity/end-entity","name":"linkerd-identity-end-entity"}]}],"dnsPolicy":"ClusterFirst","initContainers":[{"args":["--incoming-proxy-port","4143","--outgoing-proxy-port","4140","--proxy-uid","2102","--inbound-ports-to-ignore","4190,4191"],"image":"gcr.io/linkerd-io/proxy-init:v1.3.3","imagePullPolicy":"IfNotPresent","name":"linkerd-init","resources":{"limits":{"cpu":"100m","memory":"50Mi"},"requests":{"cpu":"10m","memory":"10Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_ADMIN","NET_RAW","NET_BIND_SERVICE"],"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true,"runAsNonRoot":false,"runAsUser":0},"terminationMessagePolicy":"FallbackToLogsOnError"}],"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"ingress-nginx-ingress","serviceAccountName":"ingress-nginx-ingress","terminationGracePeriodSeconds":60,"volumes":[{"emptyDir":{"medium":"Memory"},"name":"linkerd-identity-end-entity"}]}}},"status":{"availableReplicas":3,"conditions":[{"message":"Deployment has minimum availability.","reason":"MinimumReplicasAvailable","status":"True","type":"Available"},{"message":"ReplicaSet \"ingress-nginx-ingress-controller-7656c98b8f\" has successfully progressed.","reason":"NewReplicaSetAvailable","status":"True","type":"Progressing"}],"observedGeneration":7154,"readyReplicas":3,"replicas":3,"updatedReplicas":3}}
    meta.helm.sh/release-name: ingress
    meta.helm.sh/release-namespace: ingress
  generation: 7161
  labels:
    app: nginx-ingress
    app.kubernetes.io/component: controller
    app.kubernetes.io/managed-by: Helm
    chart: nginx-ingress-1.39.0
    heritage: Helm
    release: ingress
  name: ingress-nginx-ingress-controller
  namespace: ingress
  resourceVersion: "62830529"
  selfLink: /apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller
  uid: c041898e-78dd-11ea-ad31-0e9b9c5b4912
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx-ingress
      release: ingress
  strategy:
    rollingUpdate:
      maxSurge: 33%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        ad.datadoghq.com/nginx-ingress-controller.check_names: '["nginx_ingress_controller"]'
        ad.datadoghq.com/nginx-ingress-controller.init_configs: '[{}]'
        ad.datadoghq.com/nginx-ingress-controller.instances: '[{"prometheus_url":
          "http://%%host%%:10254/metrics"}]'
        config.alpha.linkerd.io/proxy-wait-before-exit-seconds: "40"
        kubectl.kubernetes.io/restartedAt: "2020-08-04T16:10:44-04:00"
        linkerd.io/created-by: linkerd/cli stable-2.8.1
        linkerd.io/identity-mode: default
        linkerd.io/proxy-version: stable-2.8.1
      labels:
        app: nginx-ingress
        app.kubernetes.io/component: controller
        component: controller
        linkerd.io/control-plane-ns: linkerd
        linkerd.io/proxy-deployment: ingress-nginx-ingress-controller
        linkerd.io/workload-ns: ingress
        release: ingress
    spec:
      containers:
      - args:
        - /nginx-ingress-controller
        - --default-backend-service=ingress/ingress-nginx-ingress-default-backend
        - --election-id=ingress-controller-leader
        - --ingress-class=nginx
        - --configmap=ingress/ingress-nginx-ingress-controller
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: DD_AGENT_HOST
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: nginx-ingress-controller
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        - containerPort: 443
          name: https
          protocol: TCP
        - containerPort: 10254
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 150m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: true
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - ALL
          runAsUser: 101
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - env:
        - name: LINKERD2_PROXY_LOG
          value: warn,linkerd=info
        - name: LINKERD2_PROXY_DESTINATION_SVC_ADDR
          value: linkerd-dst.linkerd.svc.cluster.local:8086
        - name: LINKERD2_PROXY_DESTINATION_GET_NETWORKS
          value: 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
        - name: LINKERD2_PROXY_CONTROL_LISTEN_ADDR
          value: 0.0.0.0:4190
        - name: LINKERD2_PROXY_ADMIN_LISTEN_ADDR
          value: 0.0.0.0:4191
        - name: LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR
          value: 127.0.0.1:4140
        - name: LINKERD2_PROXY_INBOUND_LISTEN_ADDR
          value: 0.0.0.0:4143
        - name: LINKERD2_PROXY_DESTINATION_GET_SUFFIXES
          value: svc.cluster.local.
        - name: LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES
          value: svc.cluster.local.
        - name: LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE
          value: 10000ms
        - name: LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE
          value: 90000ms
        - name: _pod_ns
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: LINKERD2_PROXY_DESTINATION_CONTEXT
          value: ns:$(_pod_ns)
        - name: LINKERD2_PROXY_IDENTITY_DIR
          value: /var/run/linkerd/identity/end-entity
        - name: LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS
          value: |
            -----BEGIN CERTIFICATE-----
            REDACTED
            -----END CERTIFICATE-----
        - name: LINKERD2_PROXY_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/kubernetes.io/serviceaccount/token
        - name: LINKERD2_PROXY_IDENTITY_SVC_ADDR
          value: linkerd-identity.linkerd.svc.cluster.local:8080
        - name: _pod_sa
          valueFrom:
            fieldRef:
              fieldPath: spec.serviceAccountName
        - name: _l5d_ns
          value: linkerd
        - name: _l5d_trustdomain
          value: cluster.local
        - name: LINKERD2_PROXY_IDENTITY_LOCAL_NAME
          value: $(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_IDENTITY_SVC_NAME
          value: linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_DESTINATION_SVC_NAME
          value: linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_TAP_SVC_NAME
          value: linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        image: gcr.io/linkerd-io/proxy:stable-2.8.1
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - sleep 40
        livenessProbe:
          httpGet:
            path: /live
            port: 4191
          initialDelaySeconds: 10
        name: linkerd-proxy
        ports:
        - containerPort: 4143
          name: linkerd-proxy
        - containerPort: 4191
          name: linkerd-admin
        readinessProbe:
          httpGet:
            path: /ready
            port: 4191
          initialDelaySeconds: 2
        resources:
          limits:
            memory: 250Mi
          requests:
            cpu: 100m
            memory: 20Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsUser: 2102
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /var/run/linkerd/identity/end-entity
          name: linkerd-identity-end-entity
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --incoming-proxy-port
        - "4143"
        - --outgoing-proxy-port
        - "4140"
        - --proxy-uid
        - "2102"
        - --inbound-ports-to-ignore
        - 4190,4191
        image: gcr.io/linkerd-io/proxy-init:v1.3.3
        imagePullPolicy: IfNotPresent
        name: linkerd-init
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 10m
            memory: 10Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW
            - NET_BIND_SERVICE
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
          runAsNonRoot: false
          runAsUser: 0
        terminationMessagePolicy: FallbackToLogsOnError
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ingress-nginx-ingress
      serviceAccountName: ingress-nginx-ingress
      terminationGracePeriodSeconds: 60
      volumes:
      - emptyDir:
          medium: Memory
        name: linkerd-identity-end-entity
status:
  availableReplicas: 3
  conditions:
  - message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - message: ReplicaSet "ingress-nginx-ingress-controller-59fd9b7b85" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 7161
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3
---

@ihcsim
Copy link
Contributor

ihcsim commented Aug 13, 2020

but I do have problems with a meshed app trying to reach out to datadog agent.

To help further narrow down the repro steps, do these 502s happen only when the datadog agent is the target service?

@mbelang
Copy link
Author

mbelang commented Aug 14, 2020

I just saw this: hyperium/hyper#2136.

I do imagine that linkerd proxy is using that lib right? According to them it is a keep-alive problem and by setting the keep alive lower than the upstream would do.

My upstream keep-alive timeout is 5s so I set it to 2s for the proxy... No luck so far. I'm going to try and put the proxy at 0ms for the outbound keep-alive timeout so that a new connection is used all the time.

@mbelang
Copy link
Author

mbelang commented Aug 14, 2020

I mitigate all 502 on GETs with nginx retry mechanism. I could also do it on non-idemponent but it is a bit dangerous ATM.
I do have less problems now but I'd still like to fix/understand what is going wrong with the linkerd proxy.

@mbelang
Copy link
Author

mbelang commented Aug 17, 2020

@ihcsim here is an extract of an ingress resource for a test application

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    annotations:
      acme.cert-manager.io/http01-edit-in-place: "true"
      acme.cert-manager.io/http01-ingress-class: "true"
      cert-manager.io/cluster-issuer: letsencrypt
      certmanager.k8s.io/acme-challenge-type: dns01
      certmanager.k8s.io/acme-dns01-provider: route53
      certmanager.k8s.io/cluster-issuer: letsencrypt
      external-dns.alpha.kubernetes.io/target: REDACTED.
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
      meta.helm.sh/release-name: hello-k8s
      meta.helm.sh/release-namespace: hello-k8s
      nginx.ingress.kubernetes.io/configuration-snippet: |
        proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
        grpc_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
      nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502 non_idempotent
      nginx.ingress.kubernetes.io/proxy-next-upstream-timeout: 30s
      nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "3"
    creationTimestamp: "2020-04-08T13:45:30Z"
    generation: 1
    labels:
      app: hello-k8s
      app.kubernetes.io/instance: hello-k8s
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: app
      branch-slug: master
      cci-build-number: "3877"
      cci-workflow-id: 1a3b6a13-6400-41e6-a9a0-3eafa8d420d8
      component: app
      helm.sh/chart: app-2.6.0
      place: ca
      pr-number: ""
      sha: 89e64f20862883f79fe25347958459076e281f4d
      short-sha: 89e64f2
      stage: prod
      tag: v0.19.2
      version: v0.19.2
    name: app
    namespace: hello-k8s
    resourceVersion: "63104678"
    selfLink: /apis/extensions/v1beta1/namespaces/hello-k8s/ingresses/app
    uid: 3639f2c6-799f-11ea-ad31-0e9b9c5b4912
  spec:
    rules:
    - host: hello-k8s.example.com
      http:
        paths:
        - backend:
            serviceName: app
            servicePort: http
    tls:
    - hosts:
      - hello-k8s.example.com
      secretName: hello-k8s.example.com-tls
  status:
    loadBalancer:
      ingress:
      - ip: x.x.x.x
      - ip: x.x.x.x
      - ip: x.x.x.x
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@mbelang
Copy link
Author

mbelang commented Aug 18, 2020

I'm going to try and put the proxy at 0ms for the outbound keep-alive timeout so that a new connection is used all the time.

discovered that 0ms is not supported.

I also tried to raise keep-alive timeout to 90s (higher that nginx outbound of 60s) without any luck either.

@olix0r
Copy link
Member

olix0r commented Aug 18, 2020

@mbelang and I had a chance to talk through this issue in Slack this morning. I think we have a good enough handle on it to put together a repro setup:

  • nginx ingress, injected with proxy
  • app with python HTTP server, uninjected

Then, we should try putting consistent load on the ingress. Ideally, we'd test this all on EKS with the latest AWS CNI, as it seems plausible that it's a bad interaction at the network layer.

If we can reproduce this with this kind of setup, then I think it should be pretty straightforward to diagnose/fix. If we can't, we can start digging into more details about how this repro setup differs from @mbelang's actual system.

@olix0r
Copy link
Member

olix0r commented Aug 20, 2020

@mbelang reports that this problem goes away when all pods are meshed; so this points strongly to the HTTP/1.1 client.

@steve-gray
Copy link
Contributor

steve-gray commented Sep 17, 2020

We're seeing this breaking DNS in the cluster for us at the moment where stuff tries to use TCP, including the linkerd-proxy instances. Here's an example of a CURL from a non-meshed pod showing the port opens fine (normal no-talk hangup after a while)

* Expire in 0 ms for 6 (transfer 0x5615dec03f50)
*   Trying 172.20.0.10...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5615dec03f50)
* Connected to 172.20.0.10 (172.20.0.10) port 53 (#0)
> GET / HTTP/1.1
> Host: 172.20.0.10:53
> User-Agent: curl/7.64.0
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host 172.20.0.10 left intact
curl: (52) Empty reply from server

Meshed pod gets bad gateways:

curl -vvv 172.20.0.10:53
* Rebuilt URL to: 172.20.0.10:53/
*   Trying 172.20.0.10...
* TCP_NODELAY set
* Connected to 172.20.0.10 (172.20.0.10) port 53 (#0)
> GET / HTTP/1.1
> Host: 172.20.0.10:53
> User-Agent: curl/7.52.1
> Accept: */*
> 
< HTTP/1.1 502 Bad Gateway
< content-length: 0
< date: Thu, 17 Sep 2020 22:24:10 GMT
< 
* Curl_http_done: called premature == 0
* Connection #0 to host 172.20.0.10 left intact

Meshing DNS isn't an option for us. Environment is edge-20.9.1 running Kubernetes 1.17 on AWS EKS. Direct port forwards all work to the DNS pods, talking direct to services works - but interestingly this seems to be preventign the proxy itself establishing identities for things:

linkerd-proxy [149249.354645611s]  WARN ThreadId(01) trust_dns_proto::xfer::dns_exchange: io_stream hit an error, shutting down: io error: Connection reset by peer (os error 104)    
linkerd-proxy [149253.519311877s]  WARN ThreadId(01) trust_dns_proto::xfer::dns_exchange: io_stream hit an error, shutting down: io error: Connection reset by peer (os error 104)   

@mbelang
Copy link
Author

mbelang commented Sep 19, 2020

@olix0r I still notice some WARN inbound:accept{peer.addr=10.4.25.103:55298}:source{target.addr=10.4.25.215:80}: linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104) from time to time. I'm still unsure if it is the same problem yet but we didn't change anything in our configs since I meshed all apps.

@ihcsim
Copy link
Contributor

ihcsim commented Sep 21, 2020

@steve-gray We think what you are seeing is closer to #4831. There is some ongoing investigation per #4831 (comment). Feel free to subscribe to that issue.

@stale
Copy link

stale bot commented Dec 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 24, 2020
@mbelang
Copy link
Author

mbelang commented Jan 9, 2021

Even if all my services are meshed, I still have intermittent 502s. I can mitigate with retries at the client level but it is not suitable. Also, I could setup retries with the mesh itself but Linkerd doesn't support retries with body...

@stale stale bot removed the wontfix label Jan 9, 2021
@cpretzer
Copy link
Contributor

cpretzer commented Jan 9, 2021

hi @mbelang have you been able to reproduce this with 2.9.1?

You're right that retries only work for GET requests at the moment. There is an open issue (#3985) that we'd love help with, if you're interested.

@mbelang
Copy link
Author

mbelang commented Jan 10, 2021

@cpretzer I have yet to update to 2.9.1. Were there any fix or improvements related to the proxy with regard to this issue?

@olix0r
Copy link
Member

olix0r commented Jan 10, 2021

I'm not confident that retries would help in this case. It really depends on where we're encountering the issue, but I don't think we have enough data to know yet.

There were substantial changes between 2.8 and 2.9, especially around caching and discovery. (For instance, there's no longer any DNS resolution in the data path). It would be good to test this more recent version if only to ensure that the problem does not persists -- even if we are able to identify the underlying cause, we're unlikely to backport fixes onto 2.8.

If the issue persists, it would be helpful to at least get debug logs from both the client and server proxies, via config.linkerd.io/proxy-log-level: linkerd=debug,warn

@mbelang
Copy link
Author

mbelang commented Jan 11, 2021

@olix0r I know that retries would not solve the issue but would at least mitigate. I will plan an upgrade to 2.9.1 and see how it goes from there.

I did try to put the proxy in Debug mode but I didn't manage to get more information that I posted here. Maybe I missed it but it is a fairly rare event that it is very hard to catch and I do not want to debug that in production cluster though I suspect that the elasticity of the production cluster could have an impact on the issue.

@ewhauser
Copy link

I came across the thread because I was also running into an issue with using linkerd and the Datadog agent together. In the setup I'm using, Datadog is installed as a daemonset using this Helm chart so it is not meshed.

I get similar errors as described above. These logs are coming from the Go Datadog client:

2021/01/14 00:19:20 Datadog Tracer v1.27.0 ERROR: lost 2 traces: Bad Gateway, 11 additional messages skipped (first occurrence: 14 Jan 21 00:18 UTC)%                                                         

If I disable linkerd, then I no longer have any communications issues with the Datadog agent.

@olix0r olix0r added this to the stable-2.10 milestone Jan 14, 2021
@mbelang
Copy link
Author

mbelang commented Jan 16, 2021

I ended up meshing datadog pods as well and that resulted in no more 502s from apps to datadog agent. I do see some 502s from datadog agent to linkerd proxy metrics collection api too. And I suspect this is why I have some metrics in datadog that miss some requests. I haven't got the chance to upgrade to 2.9.1 yet but I will soon.

@olix0r any reason you tagged the issue for 2.10 release?

@olix0r
Copy link
Member

olix0r commented Jan 17, 2021

@mbelang

any reason you tagged the issue for 2.10 release?

I want to make sure that we take a deeper look at issues like this before we cut another stable release.

@shamsalmon
Copy link

in my case 100% of my requests to the datadog tracer fail if I have linkerd injected into one pod and try to send spans to datadog.

@mbelang
Copy link
Author

mbelang commented Feb 4, 2021

@shamsalmon Didn't faced that problem at all. It could be a bad configuration 🤷‍♂️

@kleimkuhler
Copy link
Contributor

If anyone on this thread would like to try the most recent edge edge-21.3.4, PR #5904 is included and it fixed some other issues with Datadog (and host network pods in general). I'll keep this open for a little while longer for follow-up questions or comments.

@kleimkuhler
Copy link
Contributor

If you run into this issue again please feel free to open a new one with more recent logs and description. The Datadog issue was fixed in #5904 and the more recent edges and stable include more helpful logs. Thanks!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants