Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

404 NR when using browser on multiple ingress gateways #9429

Closed
yciabaud opened this issue Oct 19, 2018 · 28 comments
Closed

404 NR when using browser on multiple ingress gateways #9429

yciabaud opened this issue Oct 19, 2018 · 28 comments

Comments

@yciabaud
Copy link

yciabaud commented Oct 19, 2018

Describe the bug
When using browsers (tested on multiple browsers, multiple OS, multiple devices and multiple connections), we are having many 404 NR responses when contacting our services on an additional ingress gateway.

The service is loading well every time I'm cleaning the browsers cache and when using command line clients.

Routing is done with hostname in separate gateway and virtual services configurations. All services are exposed on port 443 using different port names but the same TLS certificate per gateway.
Ingress log is printing the good hostname but envoy is not finding any routes.

Expected behavior
We expect to have the requests routed to the service or a way to find if there is something missing in the configuration.

Steps to reproduce the bug

  • Install another ingress gateway using helm
  • Deploy 2 services and expose it using a gateway with TLS on port 443
  • Use a browser to access the services

Version
Kubernetes v1.10.5 on AWS (same issue on v1.9.9)
Calico
Istio: 1.0.2

Installation
Istio installed using helm chart with a first ingress gateway.
Second gateway installed using helm in a namespace.
Services installed using helm charts.

Environment
Kubernetes deployed on AWS using Kops with coreos images.

Cluster state
istio-dump.tar.gz
I could not dump pods and deployments due to private information in environment variables

@bernardmo
Copy link

We faced the same issue and it seems to be connected with opened TCP connection from browser. Ended up using single gateway with multiple services.

Scenario:

  • Use wildcard certificate (*.test.com)
  • Have 2 different services where each service has its own gateway (service1.test.com, service2.test.com)
  • Opening tab with service1.test.com works fine
  • Opening tab with service2.test.com results with 404 error

Checking TCP connections (used incognito window on chrome and connection list from chrome://net-internals/#sockets):

  • After opening tab with service1.test.com TCP connection is created and it stays open even after request is finished (even thought there's no keep-alive connection setting)
  • After opening tab with service2.test.com there are no new TCP connections so it must be using same conn
  • Force close all connections and refreshing tab with service2.test.com will work fine (but service1.test.com will fail)

@yciabaud
Copy link
Author

Thanks @bernardmo this is exactly the issue we are facing! I am tying your workaround and I will let you know.

@yciabaud
Copy link
Author

I can confirm that this workaround is effective but managing all services in the same configuration object makes automation really difficult.

Thanks again @bernardmo for your help.

@mandarjog
Copy link
Contributor

Can you provide
‘Curl http://localhost:15000/config_dump’ from there ingress gateway?

@rshriram this looks like envoy making decisions about connections rather than a request based on sni.

@stirno
Copy link

stirno commented Nov 18, 2018

I'm fairly certain I am now encountering this in a 1.1 test cluster as well... As soon as I added a second service I began seeing 404s from the browser. Whichever service I hit first in a browser session would work, following requests to other services would 404 consistently.

@blackbaud-brandonstirnaman
Copy link

@mandarjog My cluster is currently using a 1.1 daily build. About to update to a newer daily to verify it still happens.

Config dump from current cluster: https://gist.github.com/blackbaud-brandonstirnaman/a21d578814ba4abe54eda480b5a95674

Edit: Can reproduce on latest daily as well.

@rshriram
Copy link
Member

Sorry I missed this. When you said two ingress gateways, what are their gateway specs like? Do they have distinct ways to be identified using the selector labels in the gateway?

@rshriram
Copy link
Member

Also what do you mean by adding a service to the gateway? Did you mean adding a gateway spec and a virtual service?

@yciabaud
Copy link
Author

yciabaud commented Nov 19, 2018

Well it looks like this is not related to having multiple ingressgateway instances.

This occurs when you have 2 virtualservices configured on 2 subdomains and each have its own gateway configured on the subdomain.
The ingressgateway is using a wildcard TLS certificate and the handshakes are ok but persistent connections from browser are failing when switching from one service to the other.

Am I clear @rshriram?

@hpohl
Copy link

hpohl commented Nov 21, 2018

Can confirm, istio 1.0.3, same behavior. Going to chrome://net-internals/#sockets and clicking Close idle sockets closes the connection and lets you 'switch' services.

@frankbu
Copy link
Contributor

frankbu commented Nov 21, 2018

@yciabaud Let me try to describe an example the fits the problem scenario.

  • wildcard certificate *.test.com installed in istio-ingressgateway
  • Gateway configuration gw1 with host service1.test.com, selector istio: ingressgateway, and tls config using ingressgateway's mounted (wildcard) certs
  • Gateway configuration gw2 with host service2.test.com, selector istio: ingressgateway, and tls config using ingressgateway's mounted (wildcard) certs
  • VirtualService configuration vs1 with host service1.test.com and gateway gw1
  • VirtualService configuration vs2 with host service2.test.com and gateway gw2

Am I correct that this describes a configuration that will have the connection problem?

If so, I'm wondering why the two Gateway configurations (gw1 and gw2) are used, given that they are both using the same wildcard certs. Wouldn't it make more sense to have one Gateway and bind both VirtualServices to it like this:

  • Gateway configuration gw with host *.test.com, selector istio: ingressgateway, and tls config using ingressgateway's mounted (wildcard) certs
  • VirtualService configuration vs1 with host service1.test.com and gateway gw
  • VirtualService configuration vs2 with host service2.test.com and gateway gw

Would that have the same problem, or is the problem only because there are two Gateways using the same certs?

@blackbaud-brandonstirnaman

Changed my Gateway configurations to make sure 2 gateways don't share a certificate (as described above by @frankbu), no longer have the issue. The downside of this is that I'm unable to differentiate things like TLS protocol for services that use this cert.. but at least I can connect to services correctly now.

@yciabaud
Copy link
Author

Well @frankbu the original need was to have a working service independently from the certificates used in the gateways so each service exposed has its own gateway using a common ingressgateway. It may not be the way it was designed but we wanted to define each host without a wilcard in the gateways and we wanted each service to be configured independently.

I will try to use different configuration but this still looks like the side effect is an issue.

@frankbu
Copy link
Contributor

frankbu commented Nov 26, 2018

@yciabaud I think it also works if each gateway uses its own certs, as is shown in this example: https://preliminary.istio.io/docs/tasks/traffic-management/secure-ingress/#configure-a-tls-ingress-gateway-for-multiple-hosts. Unless, maybe that also has the same browser issue? I'm trying to understand exactly what does and doesn't work so we can figure out if it's a bug that can be fixed, or an intrinsic limitation.

@yciabaud
Copy link
Author

OK I will try using separate certs for each host and then I will try to duplicate my wilcard cert in the gateway.

@yciabaud
Copy link
Author

yciabaud commented Dec 6, 2018

@frankbu I can confirm that using different certificates on different gateways is working as expected.
The issue is only related to having a wildcard certificate used in 2 gateways on 2 hosts.

@rshriram
Copy link
Member

rshriram commented Dec 7, 2018

@yciabaud can you give me the envoy configuration from the gateway when you have the wildcard cert?

curl localhost:15000/config_dump should be sufficient. I am specifically looking for the configuration of the listeners on the 443 port that is hosting the two services with single wildcard cert.

@rshriram
Copy link
Member

rshriram commented Dec 7, 2018

how long does this issue last after adding the new gateway with same wildcard cert?

@yciabaud
Copy link
Author

yciabaud commented Dec 7, 2018

I don't have access to my cluster ATM but the issue occurs right at the moment we add the second gateway.
The persistent connection for the first service in the browser conflicts with the second one and it lasts as long as I could wait.

@rshriram
Copy link
Member

rshriram commented Dec 7, 2018

Okay, now I know the issue and only part of it is solvable with Istio.
Both services (service1.test.com and service2.test.com) are resolving to same IP. And service1.test.com is already returning a wildcard cert (*.test.com) indicating that connections to service2.test.com can use the same cert. so both chrome and firefox reuse the existing connection to talk to the gateway.

In the background, the gateway has been updated with new configs (for service2.test.com). This causes envoy to spin up new listener threads (crudely speaking) while the old one is still serving the browser with service1.test.com. The new one has both but no one is talking to it yet. When you clear the browser cache, you are effectively releasing the old one, allowing Envoy to shutdown the older listener and always use the new one.

These are H2 semantics unfortunately: https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/ ..

Now, we could partially mitigate this problem by not doing a listener update in Envoy (just a route update should work for your use case). But doing so involves making the following compromise:

  1. two hosts, with same path to the certificate (i.e. what you are doing today) on same port
    we treat these as the same and merge them.
    i.e. two gateway specs like
gateway:
 server:
 - port: 443
    name: https1
    protocol: HTTPS
  hosts:
   service1.test.com
 tls:
  mode: Simple
  serverCertificate: foo/bar/cert.pem
  privateKey: foo/bar/privatekey
---
gateway:
 server:
 - port: 443
    name: https1
    protocol: HTTPS
  hosts:
   service2.test.com
 tls:
  mode: Simple
  serverCertificate: foo/bar/cert.pem
  privateKey: foo/bar/privatekey
---

will be treated effectively as one gateway

gateway:
 server:
 - port: 443
    name: https
    protocol: HTTPS
  hosts:
  - service1.test.com
  - service2.test.com
 tls:
  mode: Simple
  serverCertificate: foo/bar/cert.pem
  privateKey: foo/bar/privatekey

This is the best we can do code wise. That said, I would suggest the following alternative. Instead of creating redundant gateways, create a gateway once for every unique certificate and use the certificate's domain (in your case *.test.com). For example

name: test-com-wildcard-gateway
gateway:
 servers:
 - port: 443
     name: https
     protocol: HTTPS
   hosts:
   - *.test.com
   tls:
    ...wildcard certs matching *.test.com

And use individual virtual services for each host, i.e. one for service1.test.com, and another for service2.test.com, both referring to the same gateway. This way, you have to create the gateway only once. You can keep adding virtual services for the hosts you expose out of the *.test.com domain. And you can refer to this gateway in your individual virtual services.

VirtualService:
 hosts:
 - service1.test.com # all the fun stuff for service1.test.com routing
gateways:
 - test-com-wildcard-gateway # bind to this gateway so that you use the wildcard cert
http:
 ...
---
VirtualService:
 hosts:
 - service2.test.com # all the fun stuff for service2.test.com routing
gateways:
 - test-com-wildcard-gateway # bind to this gateway so that you use the wildcard cert
http:
 ...

@yciabaud
Copy link
Author

yciabaud commented Dec 7, 2018

Thank you @rshriram you're right and I get it know. The first solution mitigates the problem but I will follow your advice and use a wildcard gateway even if this is less easy to automate for me.

I guess this issue can be closed now since your investigation explained the root cause and that there is no perfect solution.

Thinking about it, maybe istio should not validate creating 2 gateways with the same certificate since it may lead to this issue.

@frankbu
Copy link
Contributor

frankbu commented Dec 12, 2018

Added documentation for this: istio/istio.io#2970

@yciabaud
Copy link
Author

Closing since the problem is identified and there is no solution to provide.
Thanks @frankbu for updating the documentation

@trevorlinton
Copy link

trevorlinton commented Apr 18, 2019

Not to be a pest, but is there a way in which one can have a wildcard (*.example.com) on one gateway and then a regular certificate (foo.example.com) on another gateway where Chromium/Firefox will not route accidently down the wildcard connection if they both are on the same ingress-gateway? In my experience we encountered this problem and the proposed work around in this ticket didn't help (because, obviously, we can't use two different certificates on one gateway).

I also have opened a bug on chromium to see if there's some consensus on how to properly solve this issue (or at least provoke some dialog).

https://bugs.chromium.org/p/chromium/issues/detail?id=954160

@haf
Copy link

haf commented Apr 14, 2020

This issue should be reopened with an external ref envoyproxy/envoy#6767

@istio/wg-networking-maintainers

@haf
Copy link

haf commented Apr 15, 2020

Here's the CVE for this vulnerability https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11767

@craigbox
Copy link
Contributor

/reopen

@craigbox craigbox reopened this Apr 20, 2020
@howardjohn
Copy link
Member

This is tracked in #13589

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests