fd leaks #9

tarekziade · 2012-06-04T07:12:15Z

The token server is leaking fds on stage2.

This is probably in powerhose, either in the client Pool, or in the workers restarting.

Will write a test that counts the number of fds before and after each request to find out where the problem happens

/cc @fetep @ametaireau

tarekziade · 2012-06-04T08:29:57Z

It's clear that it's coming from the client-side of powerhose, since there's no runner crypto workers on stage2, and most gunicorn process eat more than 10k fds:

token2
[root@token2 tarek]# ps aux|grep guni
token      847 19.9  0.5 19398712 393676 ?     Sl   01:24   0:04 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token     9619 11.7  0.7 32770976 487452 ?     Sl   01:24   0:05 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    10343 24.8  0.2 6670920 155860 ?      Sl   01:25   0:02 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    10984 13.7  0.7 33275668 498964 ?     Sl   01:24   0:06 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    11796  0.3  0.0 109372 11800 ?        S    Jun01  12:51 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    14189  6.7  0.7 31951172 468020 ?     Sl   01:24   0:04 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    14191  8.8  0.7 39765532 487920 ?     Sl   01:24   0:06 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    14193  7.3  0.7 34042692 465516 ?     Sl   01:24   0:05 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    14201 10.1  0.7 41437216 503880 ?     Sl   01:24   0:07 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    14203 11.0  0.7 42490944 517628 ?     Sl   01:24   0:07 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    23989 14.2  0.7 28161028 464228 ?     Sl   01:24   0:05 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    26006 11.1  0.7 40459876 509476 ?     Sl   01:24   0:07 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
token    30224 15.8  0.7 25082540 472260 ?     Sl   01:24   0:04 /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application

[root@token2 tarek]# ls /proc/14189/fd|wc -l
9674
[root@token2 tarek]# ls /proc/14191/fd|wc -l
14481
[root@token2 tarek]# ls /proc/14193/fd|wc -l
14260
[root@token2 tarek]# ls /proc/14201/fd|wc -l
17261
[root@token2 tarek]# ls /proc/14203/fd|wc -l
18490

Now looking what's happening with the powerhose pool.

tarekziade · 2012-06-04T10:30:44Z

I was unable to find any leaks in Powerhose with these tests:

count how many FDS when a powerhose cluster is started
every worker has --max-age = 10 seconds
a client is constantly sending jobs

However, running gunicorn in stage2, I found out that the number of used FDs is very high even when you just start it and
do nothing.

The current formula is:

FDs = 12 + NUMWORKERS + for I IN NUMWORKERS (1305 + I)

For 12 workers:
12 + 12 + 1305 + 1306 + 1307 + .... + 1317 = 15715

So, just running gunicorn with 12 workers eats 15715 already, which is a lot

We have 60 to 65 sql connectors per worker, and 50 connectors for the crypto clients pool

So I don't know where the 1000+ extra fds are going... maybe membase ? :s

Continuing the investigation

tarekziade · 2012-06-04T10:49:45Z

Each powerhose worker eats 25 fds. So, that's why we have 1250+ fds per gunicorn worker.

Now looking why and if this can be reduced.

But so far, I have seen no leaks, just big amounts of fds used by pyzmq

tarekziade · 2012-06-04T11:02:12Z

we have 6 KQUEUE and 19 sockets launched in a client.

The number of KQUEUEs can be reduced to 2 by setting the iothread value from 5 to 1 at https://github.com/mozilla-services/powerhose/blob/master/powerhose/client.py#L42

That should not impact the speed, and bring us back to 21 fds per worker, so down to 1050 per gunicorn worker.

I don't think I can reduce the sockets -- looking

tarekziade · 2012-06-04T11:08:39Z

These lines (which is the gist of the client) create already 2 KQUEUEs and 10 sockets :

import zmq

c = zmq.Context()
s = c.socket(zmq.REQ)
poller = zmq.Poller()
poller.register(s, zmq.POLLIN)

tarekziade · 2012-06-04T11:26:33Z

I have found a way to share some FDs between clients of the same pool. Doing the change now and trying in stage2

jbonacci · 2012-06-15T16:52:03Z

@tarekziade how did the change and retest go? Is this considered fixed?

rfk · 2014-06-11T04:35:31Z

Closing this out since we no longer have powerhose stuff directly in this repo, if it's still a problem it could be moved to powerhose repo

ghost assigned tarekziade Jun 4, 2012

rfk closed this as completed Jun 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fd leaks #9

fd leaks #9

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

jbonacci commented Jun 15, 2012

rfk commented Jun 11, 2014

fd leaks #9

fd leaks #9

Comments

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

tarekziade commented Jun 4, 2012

jbonacci commented Jun 15, 2012

rfk commented Jun 11, 2014