Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some sockets are opened but never read. UDP socket buffer memory usage increases #171

Closed
nicoladefranceschi opened this issue May 24, 2019 · 24 comments

Comments

@nicoladefranceschi
Copy link

In the Zeroconf class init, some sockets:
https://github.com/jstasiak/python-zeroconf/blob/c7876108150cd251786db4ab52dadd1b2283d262/zeroconf.py#L1810

are created in both the unicast and multicast case.

But here:
https://github.com/jstasiak/python-zeroconf/blob/c7876108150cd251786db4ab52dadd1b2283d262/zeroconf.py#L1858-L1862
in case of multicast, the _respond_sockets are never read.

This causes some problem with the Recv queue of UDP, because the OS (in my case Ubuntu) keeps all the packets in memory waiting for the socket to read them, but this never happens and the memory keeps growing "forever"!

This is my output of sudo ss -nlpu:

State     Recv-Q     Send-Q                                Local Address:Port            Peer Address:Port                                                     
UNCONN    156416     0                                           0.0.0.0:5353                 0.0.0.0:*         users:(("python",pid=10571,fd=54))             

To solve:
I tried running these two lines
https://github.com/jstasiak/python-zeroconf/blob/c7876108150cd251786db4ab52dadd1b2283d262/zeroconf.py#L1861-L1862
even in the case of multicast.
The problem is "solved".

I'm sure that is not the right solution and may actually be logically wrong doing that for this case.
I'm not an expert, but maybe there is a way to actually tell the OS that those sockets are not interested in listening for packets.

@psidlo
Copy link

psidlo commented Oct 19, 2019

@nicoladefranceschi I have the same issue. May you please provide be more specific how did you hotfixed it? Or provide hotfix version of your zeroconf.py?

@kdvlr
Copy link

kdvlr commented Nov 5, 2019

I tried using @nicoladefranceschi's suggestion, however, while the UDP receive packets issue seems to be mitigiated, there is a CPU spike whenever the sockets are cleaned up which is causing other services to fail.

The problem is zeroconf seems to be attaching to all the interfaces. If the host is a docker host, then obviously there are multiple interfaces and zeroconf is attaching to all of them which then leads to UDP memory continuously increasing and eventually crashing.

@meme1337
Copy link

I have the same issue that is being reported: docker host saturating memory because zeroconf is attaching to all interfaces.

@kdvlr
Copy link

kdvlr commented Nov 26, 2019

I solved the problem by using macvlan on my docker container - https://blog.oddbit.com/post/2018-03-12-using-docker-macvlan-networks/

@meme1337
Copy link

Thanks @kdvlr!
I will definitely look into that.

@fireinice
Copy link

fireinice commented Jan 26, 2020

i use macvlan as @kdvlr suggestion, but still bind to five 5353 port while only read one.

@meme1337
Copy link

i use macvlan as @kdvlr suggestion, but still bind to five 5353 port while only read one.

Sorry for the late reply. I haven’t solved this issue as well and I’m wondering if this happens also on other installations, but is simply ignored; or if it’s something that can be solved by configuring properly zeroconf, but I doubt it.

@meme1337
Copy link

meme1337 commented May 1, 2020

Bump.
I didn't manage to solve this issue. Are there any updates on this?

@ioerror
Copy link

ioerror commented May 20, 2020

The recvq issue hit a few of my programs as well. I'd be willing to write a fix for zeroconf if the authors would be interested in and willing to merge a fix. Here's some code that should fix it for users of the zeroconf library if upstream doesn't fix the problem or until they do release a fix for this issue.

For ServiceBrowsers:

zeroconf = Zeroconf()
# This is a workaround for a memory leak that exists in the zeroconf library
zeroconf_fix_memleak(zeroconf)
browser = ServiceBrowser(zeroconf, _LABEL, YourCustomServiceListener())
...

For publishing services:

zeroconf = Zeroconf(interfaces=interfaces)
zeroconf.register_service(service_info)
# This is a workaround for a memory leak that exists in the zeroconf library
zeroconf_fix_memleak(zeroconf)
...

The zeroconf_fix_memleak code needs the following:

from logging import getLogger
from zeroconf import _MAX_MSG_ABSOLUTE

class FixMemoryLeak(object):
    """
    See https://github.com/jstasiak/python-zeroconf/issues/171 to understand
    why we need this hack.
    """

    def __init__(self):
        self.log = getLogger()

    def handle_read(self, socket_):
        try:
            data, (addr, port) = socket_.recvfrom(_MAX_MSG_ABSOLUTE)
            self.log.debug('FixMemoryLeak: %s', data)
            return
        except Exception as ex:
            self.log.debug('FixMemoryLeak exception: %s', ex)
            return

def zeroconf_fix_memleak(zeroconf):
    """
    This is a workaround for a memory leak that exists in the zeroconf library.
    """
    log = getLogger()
    for s in zeroconf._respond_sockets:
        log.debug('zeroconf leaky socket: %s', s)
        zeroconf.engine.add_reader(FixMemoryLeak(), s)

This hacky fix is released under the same license as python-zeroconf.

@emontnemery
Copy link
Collaborator

@ioerror please open a PR if you have a fix in mind👍

@jstasiak
Copy link
Collaborator

Agreed, a fix would be most appreciated.

@jstasiak
Copy link
Collaborator

All: #188 (and rebased version, #270) is likely to resolve this.

@jstasiak
Copy link
Collaborator

jstasiak commented Jun 2, 2020

#270 is almost ready to be merged (I think specifying IPv6 interfaces by index may be broken now but I'll verify) – I'd appreciate if someone verified that that PR (branch rebase-188) works for them without regressions.

@ioerror
Copy link

ioerror commented Jun 13, 2020

@jstasiak I'm happy to test a branch once the conflicts are resolved. I have a test application which regularly triggers this issue once my helper function (see above) is removed.

@jstasiak
Copy link
Collaborator

@ioerror the branch (rebase-188, PR #270) is ready to be tested. I'm happy to merge it after I get a confirmation that it works.

@jstasiak
Copy link
Collaborator

@ioerror did you have a chance to test this? I'd like to merge and release this ASAP but I'll at least need a confirmation that this doesn't break stuff that was working before. :)

@ioerror
Copy link

ioerror commented Jun 30, 2020

@ioerror did you have a chance to test this? I'd like to merge and release this ASAP but I'll at least need a confirmation that this doesn't break stuff that was working before. :)

@jstasiak the branch appears to fail in CI - is it really ready for testing? I could ignore that and try anyway, I thought it better to wait for the CI to report that it is A-OK first?

@jstasiak
Copy link
Collaborator

@ioerror it's just linting failing, the actual tests pass.

@ioerror
Copy link

ioerror commented Jul 6, 2020

@jstasiak I checked out https://github.com/jstasiak/python-zeroconf/tree/rebase-188 today. I'm sad to report that without my zeroconf_fix_memleak(zeroconf) fix from above, I still see the Recv-Q as having the same problem as the current git tip:

UNCONN 58880  0                                  0.0.0.0:5353       0.0.0.0:*    users:(("discover",pid=2681534,fd=5))                                   
UNCONN 58880  0                                  0.0.0.0:5353       0.0.0.0:*    users:(("discover",pid=2681534,fd=4))                                   
UNCONN 0      0                                  0.0.0.0:5353       0.0.0.0:*    users:(("discover",pid=2681534,fd=3))                                   

Branch rebase-188 doesn't look like it solves the problem.

@jstasiak
Copy link
Collaborator

jstasiak commented Jul 6, 2020

Thank you. This is really strange, are you absolutely sure you're running the checked out version of the library as opposed to one installed on your system in a virtualenv or globally? Because only one socket should be bound to 0.0.0.0 and I see three such sockets here which makes me suspicious.

Can you set the zeroconf logger level to DEBUG and provide the logs from the startup until the problematic situation occurs?

@ioerror
Copy link

ioerror commented Jul 6, 2020

Just to test, I've removed every trace of python-zeroconf from my system. All packages, and all source installs except a single repo:

$ git clone https://github.com/jstasiak/python-zeroconf.git
Cloning into 'python-zeroconf'...
remote: Enumerating objects: 90, done.
remote: Counting objects: 100% (90/90), done.
remote: Compressing objects: 100% (65/65), done.
remote: Total 1516 (delta 41), reused 63 (delta 25), pack-reused 1426
Receiving objects: 100% (1516/1516), 823.93 KiB | 1.90 MiB/s, done.
Resolving deltas: 100% (893/893), done.
$ git checkout remotes/origin/rebase-188
HEAD is now at eecb9b1 Fix the typing
$ sudo python3 setup.py install
[sudo] password for user: 
running install
running bdist_egg
running egg_info
creating zeroconf.egg-info
writing zeroconf.egg-info/PKG-INFO
writing dependency_links to zeroconf.egg-info/dependency_links.txt
writing requirements to zeroconf.egg-info/requires.txt
writing top-level names to zeroconf.egg-info/top_level.txt
writing manifest file 'zeroconf.egg-info/SOURCES.txt'
reading manifest file 'zeroconf.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'zeroconf.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/zeroconf
copying zeroconf/__init__.py -> build/lib/zeroconf
copying zeroconf/test.py -> build/lib/zeroconf
copying zeroconf/py.typed -> build/lib/zeroconf
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/zeroconf
copying build/lib/zeroconf/__init__.py -> build/bdist.linux-x86_64/egg/zeroconf
copying build/lib/zeroconf/test.py -> build/bdist.linux-x86_64/egg/zeroconf
copying build/lib/zeroconf/py.typed -> build/bdist.linux-x86_64/egg/zeroconf
byte-compiling build/bdist.linux-x86_64/egg/zeroconf/__init__.py to __init__.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/zeroconf/test.py to test.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying zeroconf.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying zeroconf.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying zeroconf.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying zeroconf.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying zeroconf.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying zeroconf.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating dist
creating 'dist/zeroconf-0.27.0-py3.8.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing zeroconf-0.27.0-py3.8.egg
creating /usr/local/lib/python3.8/dist-packages/zeroconf-0.27.0-py3.8.egg
Extracting zeroconf-0.27.0-py3.8.egg to /usr/local/lib/python3.8/dist-packages
Adding zeroconf 0.27.0 to easy-install.pth file

Installed /usr/local/lib/python3.8/dist-packages/zeroconf-0.27.0-py3.8.egg
Processing dependencies for zeroconf==0.27.0
Searching for ifaddr==0.1.7
Best match: ifaddr 0.1.7
Processing ifaddr-0.1.7-py3.8.egg
ifaddr 0.1.7 is already the active version in easy-install.pth

Using /usr/local/lib/python3.8/dist-packages/ifaddr-0.1.7-py3.8.egg
Finished processing dependencies for zeroconf==0.27.0

In one terminal, I have opened my discover program (without the hacky fix applied) and in another I've checked for open sockets:

# ss -nlpu|grep -i discover
UNCONN 0      0                                127.0.0.1:5353       0.0.0.0:*    users:(("discover",pid=2693357,fd=5))                                   
UNCONN 0      0                            172.1.2.3:5353       0.0.0.0:*    users:(("discover",pid=2693357,fd=4))                                   
UNCONN 0      0                                  0.0.0.0:5353       0.0.0.0:*    users:(("discover",pid=2693357,fd=3))      

When I also open my publish program (without the hacky fix applied), I additionally see the following:

UNCONN 0      0                                  0.0.0.0:5353       0.0.0.0:*    users:(("publish",pid=2693801,fd=4))                                   
UNCONN 0      0                                  0.0.0.0:5353       0.0.0.0:*    users:(("publish",pid=2693801,fd=3))     

I think that means that I was wrong! Hooray! I must have somehow had a stale branch interfering with my test. If there are other tests, I'm happy to do more. I think it looks like now it is fixed, though there are two 0.0.0.0 sockets with the publisher but perhaps that is expected as one is for reading and one is for writing? My discover script should only have one for reading, I think.

@jstasiak
Copy link
Collaborator

jstasiak commented Jul 6, 2020

There should be one socket bound to 0.0.0.0 for every Zeroconf instance, so I can't say for sure what's going on in there without seeing the code. No, thank you, I think this is enough test-wise, I'll see about merging and releasing this soon.

jstasiak added a commit that referenced this issue Jul 7, 2020
This contains two major changes:

* Listen on data from respond_sockets in addition to listen_socket
* Do not bind respond sockets to 0.0.0.0 or ::/0

The description of the original change by Emil:

<<<
Without either of these changes, I get no replies at all when browsing for
services using the browser example. I'm on a corporate network, and when
connecting to a different network it works without these changes, so maybe
it's something about the network configuration in this particular network
that breaks the previous behavior.

Unfortunately, I have no idea how this affects other platforms, or what
the changes really mean. However, it works for me and it seems reasonable
to get replies back on the same socket where they are sent.
>>>

The tests pass and it's been confirmed to a reasonable degree that this
doesn't break the previously working use cases.

Additionally this removes a memory leak where data sent to some of the
respond sockets would not be ever read from them (#171).

Co-authored-by: Emil Styrke <emil.styrke@axis.com>
@jstasiak
Copy link
Collaborator

jstasiak commented Jul 7, 2020

@ioerror When I run examples/registration.py I only have one socket bound to 0.0.0.0 so I think your publish application creates two Zeroconf instances.

@jstasiak
Copy link
Collaborator

jstasiak commented Jul 7, 2020

Well, this took a while but it should be fixed in 0.28.0, please reopen if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants