parse_address leaks memory on Python2.7 when passing language and/or country #10

a9rolf-nb · 2016-07-06T11:54:05Z

Tested with Python2.7 and Python3.4.
Passing any string value for either language or country in a call to postal.parser.parse_address leaks memory on Python2.7. Unicode vs string does not change it. Garbage collection has no impact. Neither has regular reloading of the Python code modules. I've tried reloading the _postal shared library, but apparently the Python reload machinery does not actually release the .so once loaded, so I gave up on this.

Known workarounds:
use Python3
Only pass a value, never pass non-empty language, never pass non-empty country

We're talking around 40 bytes lost per passed country/language argument, per invocation, starting from two-character str. Empty strings do not leak. Single-character strings do not leak (assuming internal Python optimizations kick in to reuse single-char str instances).

Observed on this system:

$ uname -a
Linux ron-VirtualBox 3.13.0-91-generic #138-Ubuntu SMP Fri Jun 24 17:00:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/version
Linux version 3.13.0-91-generic (buildd@lgw01-21) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) ) #138-Ubuntu SMP Fri Jun 24 17:00:34 UTC 2016
$ python2.7 --version && python3 --version
Python 2.7.6
Python 3.4.3
$ (pip2.7 freeze && pip3 freeze) | grep postal
postal==0.3
postal==0.3

Note: the 2.7.6 is Ubuntu's standard Python version. I have seen the same behavior with Python2.7.11 compiled from source (with --enable-unicode=ucs4) on the same machine, and also on a Centos box.
Libpostal (the C library) is installed system global, so both Python versions' postal packages definitely layer onto the same libpostal.so.

Reproduction script (supports both Python2 and Python3)

"""Demonstrate leaking memory on calling postal.parser.parse_address with
`language` and/or `country` argument(s).
Leak only observable on Python 2.7. Python 3.4 memory usage completely stable.
Module reloading and garbage collection have zero impact.
Using unicode or str for country/language arguments has zero impact.
Length of country/language strings has impact: longer strings leak more memory.
"""


from __future__ import print_function

import resource
import gc
import imp

import postal.parser

try:
    # Py3k
    _ = xrange
except NameError:
    xrange = range

def reload_all_postal_python_modules():
    # reload everything implicitly pulled in by importing postal.parse
    # >>> import sys
    # >>> import postal.parse
    # >>> [name for (name, mod) in sys.modules.items() if 'post' in name and mod]
    # ['postal', 'postal.utils.encoding', 'postal._parser', 'postal.parser', 'postal.utils']
    import postal.parser
    import postal
    import postal.utils
    import postal.utils.encoding
    postal.utils.encoding = imp.reload(postal.utils.encoding)
    postal.utils = imp.reload(postal.utils)
    globals()['postal'].parser = imp.reload(postal.parser)
    globals()['postal'] = imp.reload(postal)

def format_maxrss():
    return "%dkiB" % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss, )

def run(per_spin=50000, spins=20):
    parse_invocations = 0
    print("maxrss=%s at startup" % (format_maxrss(), ))
    for spin in xrange(spins):
        for invocation in xrange(per_spin):
            # memory continues to grow linearly per invocation
            # ONLY IF either country or language or both are passed.
            # Length of passed value for country / language directly affects rate of memory usage growth.
            # * language='idontknow' => ~47MB per million invocations
            # * country='idontknow' => ~47MB per million invocations
            # * language='idontknowbutletsmakethisalittlelongernow' => ~80MB per million invocations
            # * country='idontknow', language='idontknow' => ~95MB per million invocations
            _ = postal.parser.parse_address("Hello", language='idontknow')
        parse_invocations += per_spin
        print("maxrss=%s after spin %d (%d calls to postal.parser.parse_address)" % (format_maxrss(), spin, parse_invocations))

        # reloading postal Python modules regularly does not influence memory usage at all
        reload_all_postal_python_modules()
        # garbage collection does not influence memory usage at all
        gc.collect()

if __name__ == '__main__':
    run()

Example output:

$ python2.7 pypostal_memory_leak_demo.py 
maxrss=961376kiB at startup
maxrss=963748kiB after spin 0 (50000 calls to postal.parser.parse_address)
maxrss=966008kiB after spin 1 (100000 calls to postal.parser.parse_address)
maxrss=968380kiB after spin 2 (150000 calls to postal.parser.parse_address)
maxrss=971024kiB after spin 3 (200000 calls to postal.parser.parse_address)
maxrss=973140kiB after spin 4 (250000 calls to postal.parser.parse_address)
maxrss=975784kiB after spin 5 (300000 calls to postal.parser.parse_address)
maxrss=977904kiB after spin 6 (350000 calls to postal.parser.parse_address)
maxrss=980548kiB after spin 7 (400000 calls to postal.parser.parse_address)
maxrss=982664kiB after spin 8 (450000 calls to postal.parser.parse_address)
maxrss=985308kiB after spin 9 (500000 calls to postal.parser.parse_address)
maxrss=987688kiB after spin 10 (550000 calls to postal.parser.parse_address)
maxrss=990072kiB after spin 11 (600000 calls to postal.parser.parse_address)
maxrss=992188kiB after spin 12 (650000 calls to postal.parser.parse_address)
maxrss=994832kiB after spin 13 (700000 calls to postal.parser.parse_address)
maxrss=997212kiB after spin 14 (750000 calls to postal.parser.parse_address)
maxrss=999596kiB after spin 15 (800000 calls to postal.parser.parse_address)
maxrss=1001712kiB after spin 16 (850000 calls to postal.parser.parse_address)
maxrss=1004356kiB after spin 17 (900000 calls to postal.parser.parse_address)
maxrss=1006736kiB after spin 18 (950000 calls to postal.parser.parse_address)
maxrss=1009120kiB after spin 19 (1000000 calls to postal.parser.parse_address)
$ python3 pypostal_memory_leak_demo.py 
maxrss=962880kiB at startup
maxrss=962880kiB after spin 0 (50000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 1 (100000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 2 (150000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 3 (200000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 4 (250000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 5 (300000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 6 (350000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 7 (400000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 8 (450000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 9 (500000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 10 (550000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 11 (600000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 12 (650000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 13 (700000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 14 (750000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 15 (800000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 16 (850000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 17 (900000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 18 (950000 calls to postal.parser.parse_address)
maxrss=962880kiB after spin 19 (1000000 calls to postal.parser.parse_address)

HTH

The text was updated successfully, but these errors were encountered:

albarrentine · 2016-07-06T19:33:10Z

Thanks for the report - was able to reproduce the issue and committed a fix.

Generally, I wouldn't recommend passing country/language as they have no real effect on the current parser. Language technically had a small effect in that it controlled which street-level dictionaries were searched, but that may change the model's decision thresholds, so libpostal now ignores that parameter.

Those two options were originally added as a placeholder in case predicting country/language or knowing them a priori was useful in parsing. Intuitively it seemed that must be true, but the global model performed better than one with per-country/per-language parameters. They may reappear at some point, but more for the purpose of training smaller country-specific/language-specific parser models, in which case they'd be used to select the appropriate model.

a9rolf-nb · 2016-07-11T04:52:59Z

Thank you for the quick resolution and the explanation.
I wasn't sure what the those arguments really do, but I think I've seen some slight effects passing a country argument into expand_address, though I'm not using that method at all anymore and don't have any details on hand. I just thought it couldn't really hurt to pass in a country when we already have that information with high confidence.

These arguments seem ... slightly underdocumented ;)

albarrentine · 2016-07-13T07:34:57Z

Totally right about documentation. For expand_address the option is actually language code rather than country code (for Germany I suppose it doesn't matter since the two are identical). If the language is known a priori it's a good idea to pass it to expand_address. That should yield better/fewer results, as otherwise a language classifier has to be used to predict the language. For statistically close languages, if the classifier said e.g. "there's a 90% probability that it's German and a 10% probability it's Dutch" then it's possible that both German and Dutch expansions would be used.

albarrentine added a commit that referenced this issue Jul 6, 2016

[fix] #10, memory leak in Python 2 when passing country/language

771d545

albarrentine closed this as completed Aug 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_address leaks memory on Python2.7 when passing language and/or country #10

parse_address leaks memory on Python2.7 when passing language and/or country #10

a9rolf-nb commented Jul 6, 2016

albarrentine commented Jul 6, 2016

a9rolf-nb commented Jul 11, 2016

albarrentine commented Jul 13, 2016

parse_address leaks memory on Python2.7 when passing language and/or country #10

parse_address leaks memory on Python2.7 when passing language and/or country #10

Comments

a9rolf-nb commented Jul 6, 2016

albarrentine commented Jul 6, 2016

a9rolf-nb commented Jul 11, 2016

albarrentine commented Jul 13, 2016