helm: connection failure #648

HofmannZ · 2021-04-14T20:50:02Z

Yesterday overnight some of our services started throwing the flowing error:

upstream connect error or disconnect/reset before headers. reset reason: connection failure

The odd thing is that the behaviour persists on a fresh cluster.

Allow me to describe the current setup:

We have one deployment (A) that proxies traffic from the internet through a gateway (G). The deployment (A) calls another deployment (B) over HTTP. Both deployments have the protocol in the respective ServiceDefaults set to http2. And the ServiceIntentions are set up so that G can call A and A can call B.

These two deployments and gateway work fine.

We also have another deployment (C), which is called by deployment (B) over gRPC. This deployment has the protocol in the ServiceDefaults set to grpc. And the ServiceIntentions are set up so that B can call C.

This used to work, but broke overnight. Any idea what may have caused this?

Config of deployment (B)

apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceDefaults
metadata:
  name: deployment-b
  namespace: foo
spec:
  protocol: http2
---
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceIntentions
metadata:
  name: deployment-b
  namespace: foo
spec:
  destination:
    name: deployment-b
  sources:
    - name: deployment-a
      action: allow
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: deployment-b
  namespace: foo
---
kind: Service
apiVersion: v1
metadata:
  labels:
    app: deployment-b
  name: deployment-b
  namespace: foo
spec:
  type: ClusterIP
  selector:
    app: deployment-b
  ports:
    - name: http
      port: 80
      targetPort: 3000
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: deployment-b
  name: deployment-b-deployment
  namespace: foo
spec:
  selector:
    matchLabels:
      app: deployment-b
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 75%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-service: "deployment-b"
        consul.hashicorp.com/connect-service-upstreams: deployment-c:3001
      labels:
        app: deployment-b
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: deployment-b
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: deployment-b
        image: deployment-b:v1.0.0
        env:
        - name: DEPLOYMENT_C_ADDRESS
          value: localhost:3001
        livenessProbe:
          ...
        readinessProbe:
          ...
        resources:
          limits:
            cpu: 160m
            memory: 400Mi
          requests:
            cpu: 80m
            memory: 200Mi
      serviceAccountName: deployment-b
      terminationGracePeriodSeconds: 30

Config of deployment (C)

apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceDefaults
metadata:
  name: deployment-c
  namespace: foo
spec:
  protocol: grpc
---
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceIntentions
metadata:
  name: deployment-c
  namespace: foo
spec:
  destination:
    name: deployment-c
  sources:
    - name: deployment-b
      action: allow
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: deployment-c
  namespace: foo
---
kind: Service
apiVersion: v1
metadata:
  labels:
    app: deployment-c
  name: deployment-c
  namespace: foo
spec:
  type: ClusterIP
  selector:
    app: deployment-c
  ports:
    - name: grpc
      port: 3000
      targetPort: 3000
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: deployment-c
  name: deployment-c-deployment
  namespace: foo
spec:
  selector:
    matchLabels:
      app: deployment-c
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 75%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-service: "deployment-c"
      labels:
        app: deployment-c
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: deployment-c
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: deployment-c
        image: deployment-c:v1.0.0
        livenessProbe:
          ...
        readinessProbe:
          ...
        resources:
          limits:
            cpu: 160m
            memory: 400Mi
          requests:
            cpu: 80m
            memory: 200Mi
      serviceAccountName: deployment-c
      terminationGracePeriodSeconds: 30

The text was updated successfully, but these errors were encountered:

ndhanushkodi · 2021-04-18T21:48:02Z

Hi @HofmannZ I tried to reproduce this by deploying a grpc-client(B) ---grpc--> grpc-server(C), and I saw the requests go through sucessfully using a very similar setup. I omitted just the Kubernetes Services and used my own image for the grpc-client and grpc-server. See the details for the exact configuration.

First I deployed using the following `consul-values.yaml` with Consul-helm 0.31.1, then and I did `kubectl apply -f` on both the grpc service files, and looked at the logs for deployment b and saw that it was able to reach deployment c.

consul-values.yaml

global:
  domain: consul
  datacenter: dc1
  metrics:
    enabled: true
    enableAgentMetrics: true

acls:
  manageSystemACLs: true

server:
  replicas: 1
  bootstrapExpect: 1
client:
  enabled: true
  grpc: true
controller:
  enabled: true

# Installs Prometheus and Grafana
prometheus:
  enabled: true

grafana:
  enabled: true

# UI metrics values
ui:
  enabled: true
  metrics:
    enabled: true
    baseURL: http://prometheus-server
    provider: "prometheus"

# Connect service metrics and metrics merging values
connectInject:
  enabled: true
  metrics:
    defaultEnabled: true

grpc-server.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: deployment-c
  namespace: default
---
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceDefaults
metadata:
  name: deployment-c
  namespace: default
spec:
  protocol: grpc
---
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceIntentions
metadata:
  name: deployment-c
  namespace: default
spec:
  destination:
    name: deployment-c
  sources:
    - name: deployment-b
      action: allow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: deployment-c
  name: deployment-c-deployment
  namespace: default
spec:
  selector:
    matchLabels:
      app: deployment-c
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-service: "deployment-c"
      labels:
        app: deployment-c
    spec:
      serviceAccountName: deployment-c
      containers:
      - name: deployment-c
        image: gcr.io/nitya-293720/grpc-demo-server
        imagePullPolicy: Always
        ports:
          - containerPort: 50051

and grpc-client.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: deployment-b
  namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: deployment-b
  name: deployment-b-deployment
  namespace: default
spec:
  selector:
    matchLabels:
      app: deployment-b
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-service: "deployment-b"
        consul.hashicorp.com/connect-service-upstreams: deployment-c:50051
      labels:
        app: deployment-b
    spec:
      serviceAccountName: deployment-b
      containers:
      - name: deployment-b
        image: gcr.io/nitya-293720/grpc-demo-client
        imagePullPolicy: Always

In testing, I originally did not have any container ports on (C) which caused an upstream connect error. Once I added the container port, that went away.

Can you confirm that deployment-c has container port 3001 exposed? In your config, deployment-b is trying to reach deployment-c on that port and I don't see that port in your deployment-c yaml. (I know you had this working before and then it stopped working but can you check)

     containers:
      - name: deployment-c
        image: deployment-c:v1.0.0
        ports:
          - containerPort: 3001

If that is not the issue, maybe we can narrow this down to environment. Would you be able to deploy the grpc-client and grpc-server files to your kubernetes cluster with your setup and see if the grpc-client is logging "Greeting: Hello World" successfully? If it fails in your environment, we can continue to debug that further.

HofmannZ · 2021-04-19T09:08:16Z

Hi @ndhanushkodi! Thanks for having a look.

The client code seems to miss one import part from our case. It takes the server address as an environment variable:

       env:
        - name: DEPLOYMENT_C_ADDRESS
          value: localhost:3001

With hardcoded addresses, we were already able to get it to work 😅.

Regarding the containerPort; that is set in Docker when the container is built. In K8s, as far as I'm aware, containerPort simply is a way to override that the port set in the container.

thisisnotashwin · 2021-04-19T14:19:28Z

Hey @HofmannZ ! Im glad that you were able to get it to work! Which address is it that you had to hardcode in this case in order to get things to work?

HofmannZ · 2021-04-21T21:49:49Z

@thisisnotashwin - What I mean to say is that we already got it working with hard coded ports before opening the issue. The problem occurs when we use Consul for service discovery for localhost:3001 (which used to work fine and stopped working overnight 🤷.)

We're happy to move back to K8s service discovery once we get our hands on the new transparent proxy 🎉.

Nevertheless, I thought it was a good idea to report the issue and dig into why this is happening suddenly.

In addition to running tests with a license, update the enterprise license job to not suppress error output.

lkysow · 2021-11-04T21:34:34Z

Hi @HofmannZ , I know this is pretty old so sorry for bugging you but I'm wondering if you've tried out transparent proxy and whether the issues described here are still occurring?

lkysow · 2021-11-17T23:41:04Z

I'm going to close this for now but please ping us if we should re-open.

t-eckert changed the title ~~upstream connect error or disconnect/reset before headers. reset reason: connection failure~~ helm: connection failure Aug 24, 2021

t-eckert transferred this issue from hashicorp/consul-helm Aug 24, 2021

lawliet89 pushed a commit to lawliet89/consul-k8s that referenced this issue Sep 13, 2021

Run enterprise license job for all tests (hashicorp#648)

cc60be3

In addition to running tests with a license, update the enterprise license job to not suppress error output.

lkysow added the waiting-reply Waiting on the issue creator for a response before taking further action label Nov 4, 2021

lkysow closed this as completed Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

helm: connection failure #648

helm: connection failure #648

HofmannZ commented Apr 14, 2021

ndhanushkodi commented Apr 18, 2021

HofmannZ commented Apr 19, 2021

thisisnotashwin commented Apr 19, 2021

HofmannZ commented Apr 21, 2021

lkysow commented Nov 4, 2021

lkysow commented Nov 17, 2021

helm: connection failure #648

helm: connection failure #648

Comments

HofmannZ commented Apr 14, 2021

ndhanushkodi commented Apr 18, 2021

HofmannZ commented Apr 19, 2021

thisisnotashwin commented Apr 19, 2021

HofmannZ commented Apr 21, 2021

lkysow commented Nov 4, 2021

lkysow commented Nov 17, 2021