Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow TLS handshake when injecting the proxy with --tls=optional on an nginx-ingress pod #1880

Closed
jon-walton opened this issue Nov 28, 2018 · 15 comments
Assignees
Labels
area/identity Automatic transport identity area/proxy bug needs/more priority/P1 Planned for Release

Comments

@jon-walton
Copy link
Contributor

jon-walton commented Nov 28, 2018

Bug Report

What is the issue?

When using nginx-ingress and injecting the linkerd2 proxy with --tls=optional, any new TLS handshake originating externally from the cluster has a 2+ second delay between the ClientHello and ServerHello packets.

client                     server
      >>>> ClientHello >>>>
      <<<<     ACK     <<<<
      ---- 2+  Seconds ----
      <<<< ServerHello <<<<

How can it be reproduced?

  • install nginx ingress
  • terminate ssl on the ingress controller (we use cert-manager with letsencrypt)
  • inject the linkerd2 proxy with --tls=optional into the nginx ingress deployment
# helm fetch --untar stable/nginx-ingress
# helm template ./nginx-ingress | linkerd inject --tls=optional - | kubectl apply -f -
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-ingress
  labels:
    app.kubernetes.io/name: test-ingress
  annotations:
    certmanager.k8s.io/cluster-issuer: letsencrypt
    nginx.ingress.kubernetes.io/upstream-vhost: $service_name.$namespace.svc.cluster.local
spec:
  tls:
    - hosts:
        - example.com
      secretName: example-com-tls
  rules:
    - host: example.com
      http:
        paths:
          - path: /
            backend:
              serviceName: example-svc
              servicePort: http
# curl -v https://[ingress ip]
curl -v https://[ingress ip]
* Rebuilt URL to: https://[ingress ip]
*   Trying 10.1.1.94...
* Connected to [ingress ip] port 443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1


2+ second delay



* SSL connection using TLS1.2 / ECDHE_RSA_AES_256_GCM_SHA384
*        server certificate verification OK
*        server certificate status verification SKIPPED
*        common name: [ingress ip/domain] (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=[ingress ip/domain]
*        start date: Fri, 23 Nov 2018 06:11:49 GMT
*        expire date: Thu, 21 Feb 2019 06:11:49 GMT
*        issuer: C=US,O=Let's Encrypt,CN=Let's Encrypt Authority X3
*        compression: NULL
* ALPN, server accepted to use http/1.1

Test again without linkerd TLS on the nginx-ingress-controller:

# helm template ./nginx-ingress | linkerd inject - | kubectl apply -f -
# curl -v https://[ingress ip]

There will not be a delay between ClientHello and ServerHello

Logs, error output, etc

(If the output is long, please create a gist and
paste the link here.)

linkerd check output

kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-api: control plane namespace exists................................[ok]
linkerd-api: control plane pods are ready..................................[ok]
linkerd-api: can initialize the client.....................................[ok]
linkerd-api: can query the control plane API...............................[ok]
linkerd-api[kubernetes]: control plane can talk to Kubernetes..............[ok]
linkerd-api[prometheus]: control plane can talk to Prometheus..............[ok]
linkerd-api: no invalid service profiles...................................[ok]
linkerd-version: can determine the latest version..........................[ok]
linkerd-version: cli is up-to-date.........................................[ok]
linkerd-version: control plane is up-to-date...............................[ok]

Status check results are [ok]

Environment

  • Kubernetes Version: v1.12.1
  • Cluster Environment: Azure. acs-engine
  • Host OS:
  • Linkerd version: edge-18.11.2

Possible solution

Additional context

@grampelberg grampelberg added area/proxy priority/P1 Planned for Release bug area/identity Automatic transport identity labels Nov 28, 2018
@klingerf klingerf added this to To do in Post 2.1 Polish via automation Nov 29, 2018
@hawkw
Copy link
Member

hawkw commented Dec 3, 2018

Hi @jon-walton, I've tried to reproduce this using a self-signed certificate, and I haven't seen the delay (running Kubernetes in Docker for Mac Desktop).

Creating the TLS secret:

$ openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout ${KEY_FILE} -out ${CERT_FILE} -subj "/CN=${HOST}/O=${HOST}"
$ kubectl create secret tls example-tls ${CERT_NAME} --key ${KEY_FILE} --cert ${CERT_FILE}

Installing nginx-ingress:

$ helm template ./nginx-ingress --name ingress --set controller.extraArgs.default-ssl-certificate=default/example-tls | bin/linkerd inject --tls=optional --proxy-log-level=info,linkerd_proxy=trace - | kubectl apply -f -

Ingress configuration:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-ingress
  labels:
    app.kubernetes.io/name: test-ingress
spec:
  tls:
    - hosts:
        - example.com
      secretName: example-tls
  rules:
    - host: example.com
      http:
        paths:
          - path: /
            backend:
              serviceName: ingress-nginx-ingress-default-backend
              servicePort: http

Curl output (with timestamps, truncated):

$ curl -v https://127.0.0.1 t -H "Host:example.com" 2>&1 -k | ts '[%Y-%m-%d %H:%M:%S]'
[2018-12-03 15:27:36] * Rebuilt URL to: https://127.0.0.1/
[2018-12-03 15:27:36]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[2018-12-03 15:27:36]                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 127.0.0.1...
[2018-12-03 15:27:36] * TCP_NODELAY set
[2018-12-03 15:27:36] * Connected to 127.0.0.1 (127.0.0.1) port 443 (#0)
[2018-12-03 15:27:36] * ALPN, offering h2
[2018-12-03 15:27:36] * ALPN, offering http/1.1
[2018-12-03 15:27:36] * Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
[2018-12-03 15:27:36] * successfully set certificate verify locations:
[2018-12-03 15:27:36] *   CAfile: /etc/ssl/cert.pem
[2018-12-03 15:27:36]   CApath: none
[2018-12-03 15:27:36] * TLSv1.2 (OUT), TLS handshake, Client hello (1):
[2018-12-03 15:27:36] } [197 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (IN), TLS handshake, Server hello (2):
[2018-12-03 15:27:36] { [98 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (IN), TLS handshake, Certificate (11):
[2018-12-03 15:27:36] { [738 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
[2018-12-03 15:27:36] { [300 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (IN), TLS handshake, Server finished (14):
[2018-12-03 15:27:36] { [4 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
[2018-12-03 15:27:36] } [37 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (OUT), TLS change cipher, Client hello (1):
[2018-12-03 15:27:36] } [1 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (OUT), TLS handshake, Finished (20):
[2018-12-03 15:27:36] } [16 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (IN), TLS change cipher, Client hello (1):
[2018-12-03 15:27:36] { [1 bytes data]
[2018-12-03 15:27:36] * TLSv1.2 (IN), TLS handshake, Finished (20):
[2018-12-03 15:27:36] { [16 bytes data]
[2018-12-03 15:27:36] * SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
[2018-12-03 15:27:36] * ALPN, server accepted to use h2
[2018-12-03 15:27:36] * Server certificate:
[2018-12-03 15:27:36] *  subject: CN=example.com; O=example.com
[2018-12-03 15:27:36] *  start date: Dec  3 23:06:29 2018 GMT
[2018-12-03 15:27:36] *  expire date: Dec  3 23:06:29 2019 GMT
[2018-12-03 15:27:36] *  issuer: CN=example.com; O=example.com
[2018-12-03 15:27:36] *  SSL certificate verify result: self signed certificate (18), continuing anyway.
[2018-12-03 15:27:36] * Using HTTP2, server supports multi-use
[2018-12-03 15:27:36] * Connection state changed (HTTP/2 confirmed)
[2018-12-03 15:27:36] * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
[2018-12-03 15:27:36] * Using Stream ID: 1 (easy handle 0x7f8878800400)
[2018-12-03 15:27:36] > GET / HTTP/2
[2018-12-03 15:27:36] > Host:example.com
[2018-12-03 15:27:36] > User-Agent: curl/7.54.0
[2018-12-03 15:27:36] > Accept: */*
...

Is there anything I've overlooked? I notice that your example-svc seems to only support HTTP/1, while the default-backend prefers HTTP/2 --- will continue trying to reproduce this with a HTTP/1-only backend.

@jon-walton
Copy link
Contributor Author

hi @hawkw

Sorry for the slow reply...

I'm also not able to reproduce this in docker for windows/mac. We're currently using kubernetes 1.12.1 deployed via acs-engine, so i'll see if I can track down if it's an azure/acs-engine thing or kubernetes version thing

@hawkw
Copy link
Member

hawkw commented Dec 7, 2018

@jon-walton okay, thanks for the additional information. I'll look into reproducing the issue in Azure as well.

@hawkw
Copy link
Member

hawkw commented Dec 7, 2018

Hmm, interesting --- attempting to reproduce this on Azure, I've noticed that, when injected with the proxy, the nginx ingress controller starts failing health checks and gets killed by Kubernetes. The new pod will also immediately fail health checks and get killed, so this continues endlessly. When the proxy isn't injected, the health checks seem to pass. I wonder if that's related?

Update: When TLS isn't enabled on the injected proxy, nginx doesn't fail the health checks. This only happens with tls=optional.

@hawkw
Copy link
Member

hawkw commented Dec 7, 2018

After reinjecting several times, my health check is no longer failing, and I can confirm that there's about a 3-second delay between the Client hello and the Server hello:

curl --connect-to example.com:443:40.114.127.2:443 -vv https://example.com --cacert ca1.pem 2>&1 | ts '[%Y-%m-%d %H:%M:%S]'
[2018-12-07 15:48:42] * Rebuilt URL to: https://example.com/
[2018-12-07 15:48:42] * Connecting to hostname: 40.114.127.2
[2018-12-07 15:48:42] * Connecting to port: 443
[2018-12-07 15:48:42]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[2018-12-07 15:48:42]                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 40.114.127.2...
[2018-12-07 15:48:42] * TCP_NODELAY set
[2018-12-07 15:48:42] * Local port: 6666
[2018-12-07 15:48:42] * Connected to 40.114.127.2 (40.114.127.2) port 443 (#0)
[2018-12-07 15:48:42] * ALPN, offering h2
[2018-12-07 15:48:42] * ALPN, offering http/1.1
[2018-12-07 15:48:42] * Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
[2018-12-07 15:48:42] * successfully set certificate verify locations:
[2018-12-07 15:48:42] *   CAfile: ca1.pem
[2018-12-07 15:48:42]   CApath: none
[2018-12-07 15:48:42] * TLSv1.2 (OUT), TLS handshake, Client hello (1):
[2018-12-07 15:48:42] } [217 bytes data]
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0* TLSv1.2 (IN), TLS handshake, Server hello (2):
[2018-12-07 15:48:45] { [102 bytes data]
[2018-12-07 15:48:45] * TLSv1.2 (IN), TLS handshake, Certificate (11):
[2018-12-07 15:48:45] { [443 bytes data]
[2018-12-07 15:48:45] * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
[2018-12-07 15:48:45] { [115 bytes data]
[2018-12-07 15:48:45] * TLSv1.2 (IN), TLS handshake, Server finished (14):
[2018-12-07 15:48:45] { [4 bytes data]
[2018-12-07 15:48:45] * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
[2018-12-07 15:48:45] } [37 bytes data]
[2018-12-07 15:48:45] * TLSv1.2 (OUT), TLS change cipher, Client hello (1):
[2018-12-07 15:48:45] } [1 bytes data]
[2018-12-07 15:48:45] * TLSv1.2 (OUT), TLS handshake, Finished (20):
[2018-12-07 15:48:45] } [16 bytes data]
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0* TLSv1.2 (IN), TLS change cipher, Client hello (1):
[2018-12-07 15:48:45] { [1 bytes data]
[2018-12-07 15:48:45] * TLSv1.2 (IN), TLS handshake, Finished (20):
[2018-12-07 15:48:45] { [16 bytes data]
[2018-12-07 15:48:45] * SSL connection using TLSv1.2 / ECDHE-ECDSA-AES256-GCM-SHA384
[2018-12-07 15:48:45] * ALPN, server accepted to use h2
...

The linkerd proxy logs the following for that connection:

2018-12-07T23:48:46.406446138Z TRCE proxy={server=in listen=0.0.0.0:4143} linkerd2_proxy::transport::tls::conditional_accept match_client_hello: parsed correctly up to SNI; matches: NotMatched
2018-12-07T23:48:46.40869624Z DBUG proxy={server=in listen=0.0.0.0:4143 remote=10.244.0.1:6666} linkerd2_proxy::app::inbound::rewrite_loopback_addr rewriting inbound address to loopback; target=Target { addr: V4(10.244.0.30:443), tls: None(NoIdentity(NotHttp)), _p: () }
2018-12-07T23:48:46.40874274Z DBUG proxy={server=in listen=0.0.0.0:4143 remote=10.244.0.1:6666} linkerd2_proxy::transport::metrics::io client connection open
2018-12-07T23:48:46.621439281Z INFO proxy={server=in listen=0.0.0.0:4143 remote=10.244.0.1:6666} linkerd2_proxy::proxy::tcp forward duplex error: Connection reset by peer (os error 104)

I wonder if the delay is due to the proxy having to parse the client hello up to SNI in order to determine whether or not it should terminate TLS for the connection? I don't think that should take three whole seconds, though...

@hawkw
Copy link
Member

hawkw commented Dec 11, 2018

I've tested the same exact reproduction on GKE, and there doesn't seem to be any significant delay between the ClientHello and the ServerHello. So, this issue certainly appears to be Azure-specific.

@jon-walton
Copy link
Contributor Author

ok thanks. I'll see if I'm able to see a difference between using Azure CNI or Kubenet (probably late this week or next week)

@hawkw
Copy link
Member

hawkw commented Dec 11, 2018

@jon-walton great, keep me posted!

@klingerf klingerf moved this from To do to In progress in Post 2.1 Polish Dec 13, 2018
@jon-walton
Copy link
Contributor Author

jon-walton commented Dec 27, 2018

hi @hawkw

Happy Christmas, here's a present for you 🤣

add externalTrafficPolicy: Local to the nginx controller service (this is not set by default in the nginx helm chart)

and there's no longer a delay when injecting linkerd into the ingress controller pods with tls

edit: it also fixed the weird behavior where the controller pods would fail healthchecks with --tls=optional

09:26:04.858855 == Info:   Trying [ip]...
09:26:04.860391 == Info: Connected to [url] ([ip]) port 443 (#0)
09:26:04.869530 == Info: found 148 certificates in /etc/ssl/certs/ca-certificates.crt
09:26:04.913916 == Info: found 592 certificates in /etc/ssl/certs
09:26:04.914011 == Info: ALPN, offering http/1.1
09:26:04.961205 == Info: SSL connection using TLS1.2 / ECDHE_RSA_AES_256_GCM_SHA384
09:26:04.962189 == Info:         server certificate verification OK
09:26:04.962216 == Info:         server certificate status verification SKIPPED
09:26:04.962315 == Info:         common name: [url] (matched)
09:26:04.962334 == Info:         server certificate expiration date OK
09:26:04.962346 == Info:         server certificate activation date OK
09:26:04.962363 == Info:         certificate public key: RSA
09:26:04.962374 == Info:         certificate version: #3
09:26:04.962394 == Info:         subject: CN=[url]
09:26:04.962410 == Info:         start date: Mon, 24 Dec 2018 02:44:26 GMT
09:26:04.962423 == Info:         expire date: Sun, 24 Mar 2019 02:44:26 GMT
09:26:04.962453 == Info:         issuer: C=US,O=Let's Encrypt,CN=Let's Encrypt Authority X3
09:26:04.962503 == Info:         compression: NULL
09:26:04.962519 == Info: ALPN, server accepted to use http/1.1
09:26:04.962596 => Send header, 126 bytes (0x7e)
0000: HEAD [path] HTTP/1.1
002c: Host: [url]
0056: User-Agent: curl/7.47.0
006f: Accept: */*
007c:
09:26:04.966213 <= Recv header, 33 bytes (0x21)
0000: HTTP/1.1 405 Method Not Allowed
HTTP/1.1 405 Method Not Allowed
09:26:04.966268 <= Recv header, 37 bytes (0x25)
0000: Date: Thu, 27 Dec 2018 09:26:04 GMT
Date: Thu, 27 Dec 2018 09:26:04 GMT
09:26:04.966294 <= Recv header, 19 bytes (0x13)
0000: Content-Length: 0
Content-Length: 0
09:26:04.966331 <= Recv header, 24 bytes (0x18)
0000: Connection: keep-alive
Connection: keep-alive
09:26:04.966371 <= Recv header, 12 bytes (0xc)
0000: allow: GET
allow: GET
09:26:04.966384 <= Recv header, 68 bytes (0x44)
0000: request-context: appId=cid-v1:618c6dfc-0116-44d5-8341-44fbc2af2f
0040: 80
request-context: appId=cid-v1:618c6dfc-0116-44d5-8341-44fbc2af2f80
09:26:04.966405 <= Recv header, 64 bytes (0x40)
0000: Strict-Transport-Security: max-age=15724800; includeSubDomains
Strict-Transport-Security: max-age=15724800; includeSubDomains

09:26:04.966425 <= Recv header, 2 bytes (0x2)
0000:
09:26:04.966441 == Info: Connection #0 to host [url] left intact

@jon-walton
Copy link
Contributor Author

I spoke too soon on the healthchecks. The pods fell over when I put a few hundred rps though them. something for tomorrow i think 🤣

@jon-walton
Copy link
Contributor Author

@hawkw So after further testing and having the ingress controllers running mostly ok for the past week, I believe the slowness is related to not setting externalTrafficPolicy: Local while to other crashes are most likely due to #2012

let's put this issue on ice for now and I'll come back to it once #2012 is out of the way 👍

@hawkw
Copy link
Member

hawkw commented Jan 5, 2019

@jon-walton sounds good to me. I think #2012 will be resolved soon!

@jon-walton
Copy link
Contributor Author

i think this can be closed now. both externalTrafficPolicy: Local and #2012 were needed 👍

Post 2.1 Polish automation moved this from In progress to Done Jan 21, 2019
@hawkw
Copy link
Member

hawkw commented Jan 22, 2019

@grampelberg do you think we should document that externalTrafficPolicy: Local should be set when injecting nginx ingress controllers?

@grampelberg
Copy link
Contributor

@hawkw it looks like there isn't a ton of support for externalTrafficPolicy: Local, why does it fix the problem?

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/identity Automatic transport identity area/proxy bug needs/more priority/P1 Planned for Release
Projects
No open projects
Development

No branches or pull requests

4 participants