Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_address leaks memory on Python2.7 when passing language and/or country #10

Closed
a9rolf-nb opened this issue Jul 6, 2016 · 3 comments

Comments

@a9rolf-nb
Copy link

Tested with Python2.7 and Python3.4.
Passing any string value for either language or country in a call to postal.parser.parse_address leaks memory on Python2.7. Unicode vs string does not change it. Garbage collection has no impact. Neither has regular reloading of the Python code modules. I've tried reloading the _postal shared library, but apparently the Python reload machinery does not actually release the .so once loaded, so I gave up on this.

Known workarounds:
use Python3
Only pass a value, never pass non-empty language, never pass non-empty country

We're talking around 40 bytes lost per passed country/language argument, per invocation, starting from two-character str. Empty strings do not leak. Single-character strings do not leak (assuming internal Python optimizations kick in to reuse single-char str instances).

Observed on this system:

$ uname -a
Linux ron-VirtualBox 3.13.0-91-generic #138-Ubuntu SMP Fri Jun 24 17:00:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/version
Linux version 3.13.0-91-generic (buildd@lgw01-21) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) ) #138-Ubuntu SMP Fri Jun 24 17:00:34 UTC 2016
$ python2.7 --version && python3 --version
Python 2.7.6
Python 3.4.3
$ (pip2.7 freeze && pip3 freeze) | grep postal
postal==0.3
postal==0.3

Note: the 2.7.6 is Ubuntu's standard Python version. I have seen the same behavior with Python2.7.11 compiled from source (with --enable-unicode=ucs4) on the same machine, and also on a Centos box.
Libpostal (the C library) is installed system global, so both Python versions' postal packages definitely layer onto the same libpostal.so.

Reproduction script (supports both Python2 and Python3)

"""Demonstrate leaking memory on calling postal.parser.parse_address with
`language` and/or `country` argument(s).
Leak only observable on Python 2.7. Python 3.4 memory usage completely stable.
Module reloading and garbage collection have zero impact.
Using unicode or str for country/language arguments has zero impact.
Length of country/language strings has impact: longer strings leak more memory.
"""


from __future__ import print_function

import resource
import gc
import imp

import postal.parser

try:
    # Py3k
    _ = xrange
except NameError:
    xrange = range

def reload_all_postal_python_modules():
    # reload everything implicitly pulled in by importing postal.parse
    # >>> import sys
    # >>> import postal.parse
    # >>> [name for (name, mod) in sys.modules.items() if 'post' in name and mod]
    # ['postal', 'postal.utils.encoding', 'postal._parser', 'postal.parser', 'postal.utils']
    import postal.parser
    import postal
    import postal.utils
    import postal.utils.encoding
    postal.utils.encoding = imp.reload(postal.utils.encoding)
    postal.utils = imp.reload(postal.utils)
    globals()['postal'].parser = imp.reload(postal.parser)
    globals()['postal'] = imp.reload(postal)

def format_maxrss():
    return "%dkiB" % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss, )

def run(per_spin=50000, spins=20):
    parse_invocations = 0
    print("maxrss=%s at startup" % (format_maxrss(), ))
    for spin in xrange(spins):
        for invocation in xrange(per_spin):
            # memory continues to grow linearly per invocation
            # ONLY IF either country or language or both are passed.
            # Length of passed value for country / language directly affects rate of memory usage growth.
            # * language='idontknow' => ~47MB per million invocations
            # * country='idontknow' => ~47MB per million invocations
            # * language='idontknowbutletsmakethisalittlelongernow' => ~80MB per million invocations
            # * country='idontknow', language='idontknow' => ~95MB per million invocations
            _ = postal.parser.parse_address("Hello", language='idontknow')
        parse_invocations += per_spin
        print("maxrss=%s after spin %d (%d calls to postal.parser.parse_address)" % (format_maxrss(), spin, parse_invocations))

        # reloading postal Python modules regularly does not influence memory usage at all
        reload_all_postal_python_modules()
        # garbage collection does not influence memory usage at all
        gc.collect()

if __name__ == '__main__':
    run()

Example output:

$ python2.7 pypostal_memory_leak_demo.py 
maxrss=961376kiB at startup
maxrss=963748kiB after spin 0 (50000 calls to postal.parser.parse_address)
maxrss=966008kiB after spin 1 (100000 calls to postal.parser.parse_address)
maxrss=968380kiB after spin 2 (150000 calls to postal.parser.parse_address)
maxrss=971024kiB after spin 3 (200000 calls to postal.parser.parse_address)
maxrss=973140kiB after spin 4 (250000 calls to postal.parser.parse_address)
maxrss=975784kiB after spin 5 (300000 calls to postal.parser.parse_address)
maxrss=977904kiB after spin 6 (350000 calls to postal.parser.parse_address)
maxrss=980548kiB after spin 7 (400000 calls to postal.parser.parse_address)
maxrss=982664kiB after spin 8 (450000 calls to postal.parser.parse_address)
maxrss=985308kiB after spin 9 (500000 calls to postal.parser.parse_address)
maxrss=987688kiB after spin 10 (550000 calls to postal.parser.parse_address)
maxrss=990072kiB after spin 11 (600000 calls to postal.parser.parse_address)
maxrss=992188kiB after spin 12 (650000 calls to postal.parser.parse_address)
maxrss=994832kiB after spin 13 (700000 calls to postal.parser.parse_address)
maxrss=997212kiB after spin 14 (750000 calls to postal.parser.parse_address)
maxrss=999596kiB after spin 15 (800000 calls to postal.parser.parse_address)
maxrss=1001712kiB after spin 16 (850000 calls to postal.parser.parse_address)
maxrss=1004356kiB after spin 17 (900000 calls to postal.parser.parse_address)
maxrss=1006736kiB after spin 18 (950000 calls to postal.parser.parse_address)
maxrss=1009120kiB after spin 19 (1000000 calls to postal.parser.parse_address)
$ python3 pypostal_memory_leak_demo.py 
maxrss=962880kiB at startup
maxrss=962880kiB after spin 0 (50000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 1 (100000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 2 (150000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 3 (200000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 4 (250000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 5 (300000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 6 (350000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 7 (400000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 8 (450000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 9 (500000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 10 (550000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 11 (600000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 12 (650000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 13 (700000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 14 (750000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 15 (800000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 16 (850000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 17 (900000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 18 (950000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 19 (1000000 calls to postal.parser.parse_address)

HTH

@albarrentine
Copy link
Contributor

Thanks for the report - was able to reproduce the issue and committed a fix.

Generally, I wouldn't recommend passing country/language as they have no real effect on the current parser. Language technically had a small effect in that it controlled which street-level dictionaries were searched, but that may change the model's decision thresholds, so libpostal now ignores that parameter.

Those two options were originally added as a placeholder in case predicting country/language or knowing them a priori was useful in parsing. Intuitively it seemed that must be true, but the global model performed better than one with per-country/per-language parameters. They may reappear at some point, but more for the purpose of training smaller country-specific/language-specific parser models, in which case they'd be used to select the appropriate model.

@a9rolf-nb
Copy link
Author

Thank you for the quick resolution and the explanation.
I wasn't sure what the those arguments really do, but I think I've seen some slight effects passing a country argument into expand_address, though I'm not using that method at all anymore and don't have any details on hand. I just thought it couldn't really hurt to pass in a country when we already have that information with high confidence.

These arguments seem ... slightly underdocumented ;)

@albarrentine
Copy link
Contributor

Totally right about documentation. For expand_address the option is actually language code rather than country code (for Germany I suppose it doesn't matter since the two are identical). If the language is known a priori it's a good idea to pass it to expand_address. That should yield better/fewer results, as otherwise a language classifier has to be used to predict the language. For statistically close languages, if the classifier said e.g. "there's a 90% probability that it's German and a 10% probability it's Dutch" then it's possible that both German and Dutch expansions would be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants