Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python memory leaks on exit with outstanding calls #7121

Closed
kpayson64 opened this issue Jun 28, 2016 · 24 comments
Closed

Python memory leaks on exit with outstanding calls #7121

kpayson64 opened this issue Jun 28, 2016 · 24 comments
Assignees

Comments

@kpayson64
Copy link
Contributor

kpayson64 commented Jun 28, 2016

When python exits with outstanding calls, a hanging reference at the Cython level prevents proper cleanup, leading to a memory leak warning.

D0628 16:18:07.661735081   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:08.663269688   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:09.664535323   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:10.665863923   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:11.667386620   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:12.668871835   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:13.670009823   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:14.671078765   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:15.672284821   10465 iomgr.c:99]                 Waiting for 3 iomgr objects to be destroyed
D0628 16:18:16.673620721   10465 iomgr.c:118]                Failed to free 3 iomgr objects before shutdown deadline: memory leaks are likely
D0628 16:18:16.673701411   10465 iomgr.c:80]                 LEAKED OBJECT: tcp-client:ipv4:216.58.194.106:443 fd=6 0x7fea300045a8
D0628 16:18:16.673723755   10465 iomgr.c:80]                 LEAKED OBJECT: tcp-client:ipv4:216.58.218.138:443 fd=12 0x7fea30005168
D0628 16:18:16.673746282   10465 iomgr.c:80]                 LEAKED OBJECT: tcp-client:ipv4:216.58.194.42:443 fd=23 0x7fea30006888
D0628 16:18:16.673830891   10465 metadata.c:238]             WARNING: 1 metadata elements were leaked
D0628 16:18:16.673872448   10465 metadata.c:238]             WARNING: 3 metadata elements were leaked
D0628 16:18:16.673903482   10465 metadata.c:238]             WARNING: 2 metadata elements were leaked
D0628 16:18:16.673927907   10465 metadata.c:238]             WARNING: 2 metadata elements were leaked
D0628 16:18:16.673955019   10465 metadata.c:238]             WARNING: 1 metadata elements were leaked
D0628 16:18:16.673982805   10465 metadata.c:238]             WARNING: 9 metadata elements were leaked
D0628 16:18:16.674006559   10465 metadata.c:238]             WARNING: 3 metadata elements were leaked
D0628 16:18:16.674031913   10465 metadata.c:238]             WARNING: 3 metadata elements were leaked
D0628 16:18:16.674055336   10465 metadata.c:238]             WARNING: 2 metadata elements were leaked
D0628 16:18:16.674124369   10465 metadata.c:238]             WARNING: 2 metadata elements were leaked
D0628 16:18:16.674160625   10465 metadata.c:238]             WARNING: 2 metadata elements were leaked
D0628 16:18:16.674188014   10465 metadata.c:238]             WARNING: 3 metadata elements were leaked
D0628 16:18:16.674212776   10465 metadata.c:238]             WARNING: 2 metadata elements were leaked
D0628 16:18:16.674313468   10465 metadata.c:238]             WARNING: 7 metadata elements were leaked
D0628 16:18:16.674342804   10465 metadata.c:238]             WARNING: 4 metadata elements were leaked
D0628 16:18:16.674363545   10465 metadata.c:238]             WARNING: 2 metadata elements were leaked
D0628 16:18:16.674384519   10465 metadata.c:251]             WARNING: 2 metadata strings were leaked
D0628 16:18:16.674404965   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:40 GMT
D0628 16:18:16.674425889   10465 metadata.c:255]             LEAKED: 3
D0628 16:18:16.674449029   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.674470734   10465 metadata.c:255]             LEAKED: 443:quic
D0628 16:18:16.674491956   10465 metadata.c:251]             WARNING: 2 metadata strings were leaked
D0628 16:18:16.674518349   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:18:05 GMT
D0628 16:18:16.674541054   10465 metadata.c:255]             LEAKED: quic=":443"; ma=2592000; v="34,33,32,31,30,29,28,27,26,25"
D0628 16:18:16.674566538   10465 metadata.c:251]             WARNING: 5 metadata strings were leaked
D0628 16:18:16.674587736   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:55 GMT
D0628 16:18:16.674609252   10465 metadata.c:255]             LEAKED: /google.logging.v2.ConfigServiceV2/CreateSink
D0628 16:18:16.674633129   10465 metadata.c:255]             LEAKED: x-goog-api-client
D0628 16:18:16.674654447   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:35 GMT
D0628 16:18:16.674677346   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:29 GMT
D0628 16:18:16.674699780   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.674719762   10465 metadata.c:255]             LEAKED: Requested entity was not found.
D0628 16:18:16.674741234   10465 metadata.c:251]             WARNING: 2 metadata strings were leaked
D0628 16:18:16.674763658   10465 metadata.c:255]             LEAKED: logging.googleapis.com:443
D0628 16:18:16.674786116   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:18:02 GMT
D0628 16:18:16.674809218   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.674831358   10465 metadata.c:255]             LEAKED: 29600m
D0628 16:18:16.674853004   10465 metadata.c:251]             WARNING: 3 metadata strings were leaked
D0628 16:18:16.674875729   10465 metadata.c:255]             LEAKED: /google.logging.v2.LoggingServiceV2/ListLogEntries
D0628 16:18:16.674933951   10465 metadata.c:255]             LEAKED: Bearer ya29.CjAPA_i1yKBinTNZ2b9wS5FgZElPA8fZG4Xg24nF9mkKcyVlTSNZvNGKGLQVKah_X0U
D0628 16:18:16.674954541   10465 metadata.c:255]             LEAKED: 5
D0628 16:18:16.674974456   10465 metadata.c:251]             WARNING: 2 metadata strings were leaked
D0628 16:18:16.674993382   10465 metadata.c:255]             LEAKED: alt-svc
D0628 16:18:16.675010531   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:36 GMT
D0628 16:18:16.675029725   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675048026   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:56 GMT
D0628 16:18:16.675066608   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675084165   10465 metadata.c:255]             LEAKED: 30S
D0628 16:18:16.675104658   10465 metadata.c:251]             WARNING: 5 metadata strings were leaked
D0628 16:18:16.675123718   10465 metadata.c:255]             LEAKED: alternate-protocol
D0628 16:18:16.675142083   10465 metadata.c:255]             LEAKED: attachment
D0628 16:18:16.675159526   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:59 GMT
D0628 16:18:16.675177156   10465 metadata.c:255]             LEAKED: Invalid resource: id is missing
D0628 16:18:16.675194886   10465 metadata.c:255]             LEAKED: /google.logging.v2.MetricsServiceV2/ListLogMetrics
D0628 16:18:16.675217420   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675236510   10465 metadata.c:255]             LEAKED: /google.logging.v2.MetricsServiceV2/CreateLogMetric
D0628 16:18:16.675259646   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675278980   10465 metadata.c:255]             LEAKED: /google.logging.v2.LoggingServiceV2/WriteLogEntries
D0628 16:18:16.675306325   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675325605   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:58 GMT
D0628 16:18:16.675345407   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675364507   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:48 GMT
D0628 16:18:16.675384439   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675402924   10465 metadata.c:255]             LEAKED: Requested metric was not found
D0628 16:18:16.675422288   10465 metadata.c:251]             WARNING: 4 metadata strings were leaked
D0628 16:18:16.675440743   10465 metadata.c:255]             LEAKED: /google.logging.v2.LoggingServiceV2/DeleteLog
D0628 16:18:16.675520524   10465 metadata.c:255]             LEAKED: /google.logging.v2.MetricsServiceV2/UpdateLogMetric
D0628 16:18:16.675546900   10465 metadata.c:255]             LEAKED: Bearer ya29.CjAPAzw3OKeovYMPda0Dx5OOoAtN8irv6v_WXnxPZqHU0Ryo2cs_gyxXbnVdnrUaGh8
D0628 16:18:16.675567006   10465 metadata.c:255]             LEAKED: /google.logging.v2.ConfigServiceV2/GetSink
D0628 16:18:16.675588591   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675608092   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:18:00 GMT
D0628 16:18:16.675626798   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675644204   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:18:06 GMT
D0628 16:18:16.675663752   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675682391   10465 metadata.c:255]             LEAKED: Python-gRPC-0.15.0 grpc-c/0.15.0 (manylinux; chttp2)
D0628 16:18:16.675703338   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675722238   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:31 GMT
D0628 16:18:16.675740482   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675758867   10465 metadata.c:255]             LEAKED: /google.logging.v2.ConfigServiceV2/DeleteSink
D0628 16:18:16.675778098   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675794892   10465 metadata.c:255]             LEAKED: /google.logging.v2.ConfigServiceV2/UpdateSink
D0628 16:18:16.675813321   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.675831682   10465 metadata.c:255]             LEAKED: gax/0.12.1 gapic/0.1.0 gax/0.12.1 python/2.7.11
D0628 16:18:16.675852964   10465 metadata.c:251]             WARNING: 6 metadata strings were leaked
D0628 16:18:16.675870778   10465 metadata.c:255]             LEAKED: /google.logging.v2.ConfigServiceV2/ListSinks
D0628 16:18:16.675891221   10465 metadata.c:255]             LEAKED: Requested sink was not found
D0628 16:18:16.675909202   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:18:01 GMT
D0628 16:18:16.675927607   10465 metadata.c:255]             LEAKED: Bearer ya29.CjAPA5-BwfB1hVHRXhtfoCZPg6D-7EYUUxcVTSPm3hvnTOd3Z25uwnFbd1MSyH_Tz58
D0628 16:18:16.675945394   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:27 GMT
D0628 16:18:16.675963916   10465 metadata.c:255]             LEAKED: /google.logging.v2.MetricsServiceV2/DeleteLogMetric
D0628 16:18:16.675984275   10465 metadata.c:251]             WARNING: 2 metadata strings were leaked
D0628 16:18:16.676002156   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:53 GMT
D0628 16:18:16.676017818   10465 metadata.c:255]             LEAKED: Tue, 28 Jun 2016 20:17:39 GMT
D0628 16:18:16.676033360   10465 metadata.c:251]             WARNING: 1 metadata strings were leaked
D0628 16:18:16.676048260   10465 metadata.c:255]             LEAKED: /google.logging.v2.MetricsServiceV2/GetLogMetric
@nathanielmanistaatgoogle
Copy link
Member

What's the reference cycle?

@kpayson64
Copy link
Contributor Author

Sorry, I mispoke. It is a hanging reference. When we put an operation on the wire, we manually call Py_INCREF() on the operation tag which holds a reference to the call. The call never gets cancelled when it goes out of context, and so we never retrieve the operation tag on the other side of the completion queue.

@nathanielmanistaatgoogle
Copy link
Member

I think I'm still not understanding "When python exits with outstanding calls" - doesn't the use of a cleanup thread inside the channel mean that calls will be cancelled when that thread is joined?

Are we seeing this leak-on-exit in the output of our unit tests? If so, which ones? If not, what would a unit test to exercise this defect look like? Or a sample user story, if a unit test would be too weird?

How certain are we that leak-on-exit-and-spew-some-logging is a GA blocker? To me it sounds like a minor annoyance that would be merely nice to fix.

@kpayson64
Copy link
Contributor Author

Yes, the _exit_tests.py show this behavior.

gcloud + gax feels like a tier 1 use case to me. I'm amenable to pushing this off the GA milestone if you disagree.

@nathanielmanistaatgoogle
Copy link
Member

I'm still not understanding the defect itself - why doesn't the call get cancelled? It's a managed call, so why doesn't the cleanup thread cancel it? You mention a Py_INCREF on the operation tag and the tag holding a reference to the call, but why isn't there a corresponding Py_DECREF on that same tag? What's happening that the completion queue isn't getting drained at exit? Or is it, but somehow the tags that come out of it aren't getting Py_DECREFed as they should be?

@kpayson64
Copy link
Contributor Author

kpayson64 commented Jun 30, 2016

Sorry, to clarify this is for synchronous calls. They cannot use the cleanup thread.

@nathanielmanistaatgoogle
Copy link
Member

Synchronous response-unary calls?

@kpayson64
Copy link
Contributor Author

Yes.

@nathanielmanistaatgoogle
Copy link
Member

... and what's the interpreter exit mechanism? Ctrl-C at the keyboard, or a signal, or something the application initiates?

@kpayson64
Copy link
Contributor Author

Ctrl-C at the keyboard. In the tests it is simulated as a SIGINT sent to the process, but the intent is for Ctrl-C interpreter exit. I haven't tested calling sys.exit(0) from another thread, but I imagine it would have a similar effect.

@nathanielmanistaatgoogle
Copy link
Member

I'm just not making the connections I need to make. I'm not getting "the operation tag which holds a reference to the call", because for synchronous (blocking) response-unary calls we use None as the tag (right?). None certainly doesn't hold a reference to the call, right?

@kpayson64
Copy link
Contributor Author

The operation tag reference is done at the Cython layer:
https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_cython/_cygrpc/call.pyx.pxi#L46

@kpayson64
Copy link
Contributor Author

This issue could probably be solved by removing only the Cython reference, and leaving our reference to managed calls to be cleaned up by the CleanupThread.

I'd like to move in the direction of less state tracking because of possible performance gains, but I'd be fine with making a minimal change to fix the issue pre-GA and revisiting un-needed state tracking post-GA

@soltanmm
Copy link
Contributor

I'd originally added in the tag-as-GC-root thing to attempt to ensure memory safety whether or not things went wrong in the Python layers (attempting to ensure no use-after frees from Python). That said, I can't immediately come up with a situation where removing that reference could threaten that, so, I think I'm (tentatively) fine with that reference going away.

@soltanmm
Copy link
Contributor

Although... regardless of that reference being there (or not), I'm feeling like a more proper solution would be pumping all queues on exit...

@kpayson64
Copy link
Contributor Author

AFAICT, its only safe to pump a queue after shutting down (so you know when to stop), and that should only be done on completion queue destruction. The issue there is that calls have references to their completion queues:
https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi#L69

I actually played around with removing that reference to do what you are suggesting #7122

As it turns out, it is possible for the completion queue to go out of scope before the calls, and if that happens, nothing cancels the calls, and then the completion queue hangs trying to shut down, so I abandoned that idea.

@nathanielmanistaatgoogle
Copy link
Member

As I read and reread this issue, I get the sense that the invariant that's being violated is "calls that have operations started on them either get those operations finished or get cancelled", and that it's being violated just in the blocking sections of _channel.py, and that an appropriate fix would be to wrap the blocking synchronous RPC invocation code in a try/finally that cancels the call in the finally. Would that not be a small fix to make? Would it remedy this issue? Would it also remedy #7090?

@kpayson64
Copy link
Contributor Author

I don't believe that will work. We still need to get the call off the completion queue by calling grpc_completion_queue_next() This completion queues get pumped as part of their destruction, but outstanding calls hold references to the completion queues. We might be able to make this work, but it would end up being a larger change than what is proposed.

As far as #7090, the issue is much harder to address. #7090 is a result of a cancellation before connection. An AVL tree is set up with "possible" connections, and gets destroyed when one of the connections is actually used. When grpc_shutdown() attempts to free the AVL tree, it needs a valid "ext_ctx" to do so, but no such context is available during final cleanup. Ensuring the cancelled call gets destroyed seems to fix the issue, and I've verified that grpc_call_destroy() never gets called when the error occurs.

@kpayson64
Copy link
Contributor Author

@tseaver
#7275 should have resolved the client side leaks with outstanding calls. Could you verify that gcloud + gax no longer has this leak on the master branch?

@tseaver
Copy link

tseaver commented Jul 8, 2016

@kpayson64 I'd be glad to try it. Given that the repository is not Python-only, can you suggest a pip command-line that would install from the master?

@kpayson64
Copy link
Contributor Author

@tseaver
From the repo root you could try:
python setup.py install

If that doesn't work you could try,

python setup.py sdist
pip install dist/grpcio-0.16.0.dev0.tar.gz

@soltanmm
Copy link
Contributor

soltanmm commented Jul 8, 2016

Quick note:
GRPC_PYTHON_BUILD_WITH_CYTHON=1 needs to be set and you must have Cython installed if you're going from a fresh repository check-out., e.g.:

GRPC_PYTHON_BUILD_WITH_CYTHON=1 pip install .

@kpayson64
Copy link
Contributor Author

I've verified that this is fixed with the helloworld example.

@tejasmanohar
Copy link

Has this change been released? If not, is there an ETA? I'd also prefer to use pip and whatnot.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants