http2 keepalive issue when mixing http2 and http1 which are routing to the same pod port #4836

thirdeyenick · 2019-12-16T10:50:29Z

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

NGINX Ingress controller version:
0.26.1

Kubernetes version (use kubectl version):
v1.13.11-gke.14

Environment:

Cloud provider or hardware configuration:
GKE cluster at GCP
OS (e.g. from /etc/os-release):
COS
Kernel (e.g. uname -a):
Install tools:
Others:

What happened:

We are using the following 2 ingress resources in a GKE cluster (taken from the documentation of ArgoCD with example hostnames):

apiVersion: v1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    generation: 1
    labels:
      app.kubernetes.io/instance: argocd
      app.kubernetes.io/managed-by: Tiller
      app.kubernetes.io/name: argocd
      app.kubernetes.io/part-of: argocd
    name: argocd-server
    namespace: argocd
  spec:
    rules:
    - host: examplehost.example.com
      http:
        paths:
        - backend:
            serviceName: argocd-server
            servicePort: 80
          path: /
    tls:
    - hosts:
      - examplehost.example.com
      secretName: argocd-ingress-https
- apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: GRPC
    creationTimestamp: "2019-10-31T13:32:39Z"
    generation: 1
    labels:
      app.kubernetes.io/instance: argocd
      app.kubernetes.io/managed-by: Tiller
      app.kubernetes.io/name: argocd
      app.kubernetes.io/part-of: argocd
    name: argocd-server-grpc
    namespace: argocd
  spec:
    rules:
    - host: cli.examplehost.example.com
      http:
        paths:
        - backend:
            serviceName: argocd-server
            servicePort: 443
          path: /
    tls:
    - hosts:
      - cli.examplehost.example.com
      secretName: argocd-ingress-grpc

Both ingress resources are routing to the following k8s service:

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: server
    app.kubernetes.io/instance: argocd
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: argocd-server
    app.kubernetes.io/part-of: argocd
  name: argocd-server
  namespace: argocd
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  - name: https
    port: 443
    protocol: TCP
    targetPort: 8080
  selector:
    app.kubernetes.io/name: argocd-server
  sessionAffinity: None
  type: ClusterIP

That way nginx ingress terminates the SSL session and forwards traffic to the target pod. The application in the target pod then decides on port 8080 if the traffic is http1 or http2 (grpc).

We noticed that when we do grpc calls to the "cli.examplehost.example.com" ingress we sometimes get an error with status code 500. A further call to the same endpoint afterwards works without an issue.

In the log of the nginx ingress controller we see the following message in the case of an error:

2019/12/13 14:35:42 [error] 37#37: *366606 no connection data found for keepalive http2 connection while sending request to upstream, client: <ip address>, server: cli.examplehost.example.com, request: "POST /version.VersionService/Version HTTP/2.0", upstream: "grpc://172.16.4.48:8080", host: "cli.examplehost.example.com:443"

It looks like there is an issue with the keepalive session from nginx to the target pod. We only see the issue when using the grpc ingress host (e.g. calls to the https ingress resource always work).

What you expected to happen:

I do not get an error now and then when doing requests to the grpc ingress.

How to reproduce it (as minimally and precisely as possible):

Have a target application which can server http2 and http1 on the same port (like ArgoCD). Then create 2 ingress resources (one for grpc and one for https) which are routing to the same target pod and port.

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

fejta-bot · 2020-03-15T11:19:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-04-14T12:03:09Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-05-14T12:46:45Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-05-14T12:46:59Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

glittershark · 2020-09-15T19:05:12Z

can we reopen this?

ramanNarasimhan77 · 2021-06-20T11:22:48Z

I am still looking for some solution/workaround for this issue. Can this issue be reopened?

longwuyuan · 2021-06-20T12:26:51Z

I know nothing about multiplexing http1 & http2 on same port.
But I am curious if you want to try different targetPorts and see if the problem persists.
Because the official Argo-CD installer has different targetPorts for http & grpc https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: dex-server
    app.kubernetes.io/name: argocd-dex-server
    app.kubernetes.io/part-of: argocd
  name: argocd-dex-server
spec:
  ports:
  - name: http
    port: 5556
    protocol: TCP
    targetPort: 5556
  - name: grpc
    port: 5557
    protocol: TCP
    targetPort: 5557
  - name: metrics
    port: 5558
    protocol: TCP
    targetPort: 5558
  selector:
    app.kubernetes.io/name: argocd-dex-server

Also curious if the ArgoCD was installed with custom configuration to that uses one single port 8080, on the argocd-server pod, for both http & as well as grpc

brocaar · 2021-06-30T12:01:35Z

I am experiencing the same issue and might have found the solution.

In my case I wanted to expose a single endpoint serving HTTP, WebSockets and gRPC. Without the GRPC backend annotation, HTTP and WebSockets would work, but gRPC did not work. With the annotation, gRPC and HTTP requests did work, but WebSockets were broken.

I created two ingresses, one matching the gRPC endpoint paths with the GRPC backend annotation (in my case everything prefixed with "/api.", e.g. "/api.MyService/Hello) and one for handling all the other requests.

When only making HTTP and WebSockets requests, all would work fine. However, I noticed that the first gRPC request(s) would fail after which they would succeed. After making some gRPC requests I noticed the same for the HTTP and WebSocket requests. Again, a few requests later these would work.

Looking at the configuration that the nginx-ingress generates, there is indeed one location matching the gRPC prefix and it contains all the gRPC related configs. The other location matching the other requests does not contain the gRPC config, so that is good. I did notice that both locations are using the same upstream:

        upstream upstream_balancer {                                                                                                                                                                                                                                                                                        
                ### Attention!!!                                                                                                                                                                                                                                                                                            
                #                                                                                                                                                                                                                                                                                                          
                # We no longer create "upstream" section for every backend.                                                                                                                                                                                                                                   
                # Backends are handled dynamically using Lua. If you would like to debug                                                                                                                                                                                                              
                # and see what backends ingress-nginx has in its memory you can                                                                                                                                                                                                                          
                # install our kubectl plugin https://kubernetes.github.io/ingress-nginx/kubectl-plugin.                                                                                                                                                                                                
                # Once you have the plugin you can use "kubectl ingress-nginx backends" command to                                                                                                                                                                                                                          
                # inspect current backends.                                                                                                                                                                                                                                                                                
                #                                                                                                                                                                                                                                                                                                          
                ###                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                           
                server 0.0.0.1; # placeholder                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                           
                balancer_by_lua_block {                                                                                                                                                                                                                                                                                    
                        balancer.balance()                                                                                                                                                                                                                                                                                  
                }                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                           
                keepalive 32;                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                           
                keepalive_timeout  60s;                                                                                                                                                                                                                                                                                    
                keepalive_requests 100;                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                           
        }

My assumption

My assumption is that the error is caused by the keepalive. The upstream keeps a connection-pool open for both HTTP/2 and HTTP/1 requests. These gets mixed up as the connections are handed out randomly. Thus a gRPC request could go to a connection which was previously used for HTTP/1 and the other way around. That would explain that after making enough requests of the same kind, it does work because the initial requests would fail and dropped from the pool.

Possible solution

I'm currently testing overriding these settings using a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-ingress-controller
data:
  upstream-keepalive-requests: "1"

Initially I tried upstream-keepalive: "0", but this removed all keepalive settings from the upstream section and did not work (as the default values might not be 0). Setting upstream-keepalive-requests: "1" means that each connection will only be used for a single request, which effectively disables the connection pool as well.

This seems to be working for me!

Note: it took me some time to figure out the correct name for the ConfigMap as different ConfigMap related questions gave different namings. To find out the correct name to use, I used kubectl describe deployment nginx-ingress-controller. Somewhere in the output you'll find:

    Args:
      /nginx-ingress-controller
      --default-backend-service=default/nginx-ingress-default-backend
      --election-id=ingress-controller-leader
      --ingress-class=nginx
      --configmap=default/nginx-ingress-controller

The correct ConfigMap name can be taken from the last line (in my case in the default namespace).

thetruechar · 2021-11-10T07:53:32Z

Can we reopen this?

k8s-ci-robot · 2021-11-10T07:53:50Z

@thetruechar: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RemiBou · 2023-10-23T09:50:57Z

Can we reopen this ? We have to do some nasty workaround :

use 2 k8s service, one clusterIP for rest, the other headless for grpc
set the ingress for rest with service-upstream true
Now nginx doesn't reuse connections between http2 and http1 but
headless service means client side load balancing, which has issues
we have to still use service upstream for http1 which might cause bad load balancing

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 14, 2020

k8s-ci-robot closed this as completed May 14, 2020

thirdeyenick mentioned this issue Jul 3, 2020

Enhance separate ingress solution argoproj/argo-cd#3896

Open

ensonic mentioned this issue Jun 29, 2022

[bug/question]: Is multiplexing HTTP/1 and gRPC supported? #4095

Closed

jujiale mentioned this issue Oct 27, 2022

help request: proxy grpc sometimes occurs 502 apache/apisix#8166

Closed

LeszekBlazewski mentioned this issue Apr 26, 2023

keepalive issue when mixing http2 (grpc) and http1.1 which are routing to the same upstream ip:port Kong/kong#10757

Closed

1 task

ruimarinho mentioned this issue Jun 30, 2023

Suggestion: (optionally) separate HTTP and gRPC service ports cirruslabs/orchard#102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http2 keepalive issue when mixing http2 and http1 which are routing to the same pod port #4836

http2 keepalive issue when mixing http2 and http1 which are routing to the same pod port #4836

thirdeyenick commented Dec 16, 2019

fejta-bot commented Mar 15, 2020

fejta-bot commented Apr 14, 2020

fejta-bot commented May 14, 2020

k8s-ci-robot commented May 14, 2020

glittershark commented Sep 15, 2020

ramanNarasimhan77 commented Jun 20, 2021

longwuyuan commented Jun 20, 2021

brocaar commented Jun 30, 2021

thetruechar commented Nov 10, 2021 •

edited

Loading

k8s-ci-robot commented Nov 10, 2021

RemiBou commented Oct 23, 2023

http2 keepalive issue when mixing http2 and http1 which are routing to the same pod port #4836

http2 keepalive issue when mixing http2 and http1 which are routing to the same pod port #4836

Comments

thirdeyenick commented Dec 16, 2019

fejta-bot commented Mar 15, 2020

fejta-bot commented Apr 14, 2020

fejta-bot commented May 14, 2020

k8s-ci-robot commented May 14, 2020

glittershark commented Sep 15, 2020

ramanNarasimhan77 commented Jun 20, 2021

longwuyuan commented Jun 20, 2021

brocaar commented Jun 30, 2021

My assumption

Possible solution

thetruechar commented Nov 10, 2021 • edited Loading

k8s-ci-robot commented Nov 10, 2021

RemiBou commented Oct 23, 2023

thetruechar commented Nov 10, 2021 •

edited

Loading