Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress gateway Envoy Memory/Connections leak #13355

Closed
howardjohn opened this issue Apr 15, 2019 · 20 comments

Comments

8 participants
@howardjohn
Copy link
Member

commented Apr 15, 2019

Running with ingressgateway-sds on 1.1.1

Memory usage grows ~2.5gb over days

envoy stats shows a ton of connections:

server.memory_allocated: 421730408
server.memory_heap_size: 2643460096
server.parent_connections: 0
server.total_connections: 83036
server.uptime: 580934
server.version: 5621769
server.watchdog_mega_miss: 2
server.watchdog_miss: 7
root@istio-ingressgateway-598b5d6fdc-95ftf:/# netstat -npa | grep envoy | wc -l
83469

Config dump doesn't show anything interesting

cc @jcetkov @mandarjog @JimmyCYJ

@costinm

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

What are the connections ? Can you include a sample ? If from external hosts - it's quite normal.

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

Is the memory usage from ingress-sds or istio-proxy(envoy)? Did you build the ingress-sds image with latest fix (#13251)? Without that fix the performance is going down.

@howardjohn

This comment has been minimized.

Copy link
Member Author

commented Apr 18, 2019

Cause is because Envoy isn't closing connections

@howardjohn

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

@JimmyCYJ what ever came out of this?

@jcetkov

This comment has been minimized.

Copy link

commented Apr 22, 2019

I believe the conclusion was that the agent has to expose the envoy timeout idle connection property and set it to some more reasonable value than the default 'no timeout'. (Probably both for ingress and the injected sidecar)

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

Let me check the timeout settings at gateway agent. We already have long running tests for workload SDS, and I am not ware of such issue.

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

@jcetkov May I ask for more context about the issue? I remember last time we found that there were lots of connections to port 80 which caused the issue. Are there any new findings after that?

@howardjohn

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

@JimmyCYJ Open a bunch of connections to ingress gateway with netcat or other, look at :15000/stats on the gateway and see server.total_connections: some_big_number. The connections are never closed

@jcetkov

This comment has been minimized.

Copy link

commented Apr 22, 2019

In essence, if client opens keep-alive connection and abandons it, it hangs there possibly indefinitely. We observed this going up to ~100k connections, at which point we were forced to restart the gateway, as the connections were coming through NAT, from limited set of IPs which started causing port collisions and timeouts.

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

I need to take a look at the ingress-sds log to find out whether the number of SDS connections grows up. By default it should keep one connection. The SDS is using gRPC streaming service over one connection.

@howardjohn

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

I don't think its related to SDS, th same thing happens without SDS

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

@howardjohn could you ask Costin or other ingress gateway experts from networking team to take a look? I don't think this is related to SDS, since it happens without SDS.

@JimmyCYJ JimmyCYJ assigned costinm and unassigned JimmyCYJ Apr 22, 2019

@JimmyCYJ

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

Tentatively reassign to Costin for triage.

@JimmyCYJ JimmyCYJ changed the title Ingress SDS Envoy Memory/Connections leak Ingress gateway Envoy Memory/Connections leak Apr 22, 2019

@duderino duderino assigned PiotrSikora and unassigned costinm Apr 22, 2019

@duderino

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

@silentdai will take a look (while I figure out how to get him added to the github org so I can assign issues to him). @PiotrSikora is OOO this week but should take a look too when he can

@howardjohn

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

// connectionManager.IdleTimeout = &notimeout

Setting this will close the connections. Not sure what the difference between that and the stream idle timeout is, or what impact it may have on other areas though. @silentdai

@jcetkov

This comment has been minimized.

Copy link

commented Apr 22, 2019

I don't think this requires much code change, just allow the envoy wrapper (agent) to pass in the configuration @howardjohn mentioned (and possibly update the helm charts to use it by default)

@silentdai

This comment has been minimized.

Copy link
Member

commented Apr 22, 2019

@JimmyCYJ @howardjohn Thank you for confirming that it's not SDS issue. Make life easier :)
@fpesce No need for action for you. It may(or may not) have relationship with your discovered keepalive bug

@duderino

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2019

Current state of this bug is that we have exposed the ability to set idle timeouts in #13515 which will be in 1.1.4, but it's currently opt-in.

@geeknoid

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2019

Can we close this, given #13515, or is there still more work needed?

@howardjohn

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2019

In 1.1.4 set it with --set gateways.istio-ingressgateway.env.ISTIO_META_IDLE_TIMEOUT=5s

Value should be go duration format

@howardjohn howardjohn closed this Apr 24, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.