New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse_address leaks memory on Python2.7 when passing language and/or country #10
Comments
Thanks for the report - was able to reproduce the issue and committed a fix. Generally, I wouldn't recommend passing country/language as they have no real effect on the current parser. Language technically had a small effect in that it controlled which street-level dictionaries were searched, but that may change the model's decision thresholds, so libpostal now ignores that parameter. Those two options were originally added as a placeholder in case predicting country/language or knowing them a priori was useful in parsing. Intuitively it seemed that must be true, but the global model performed better than one with per-country/per-language parameters. They may reappear at some point, but more for the purpose of training smaller country-specific/language-specific parser models, in which case they'd be used to select the appropriate model. |
Thank you for the quick resolution and the explanation. These arguments seem ... slightly underdocumented ;) |
Totally right about documentation. For expand_address the option is actually language code rather than country code (for Germany I suppose it doesn't matter since the two are identical). If the language is known a priori it's a good idea to pass it to expand_address. That should yield better/fewer results, as otherwise a language classifier has to be used to predict the language. For statistically close languages, if the classifier said e.g. "there's a 90% probability that it's German and a 10% probability it's Dutch" then it's possible that both German and Dutch expansions would be used. |
Tested with Python2.7 and Python3.4.
Passing any string value for either
language
orcountry
in a call to postal.parser.parse_address leaks memory on Python2.7. Unicode vs string does not change it. Garbage collection has no impact. Neither has regular reloading of the Python code modules. I've tried reloading the _postal shared library, but apparently the Python reload machinery does not actually release the .so once loaded, so I gave up on this.Known workarounds:
use Python3
Only pass a value, never pass non-empty language, never pass non-empty country
We're talking around 40 bytes lost per passed country/language argument, per invocation, starting from two-character str. Empty strings do not leak. Single-character strings do not leak (assuming internal Python optimizations kick in to reuse single-char str instances).
Observed on this system:
Note: the 2.7.6 is Ubuntu's standard Python version. I have seen the same behavior with Python2.7.11 compiled from source (with
--enable-unicode=ucs4
) on the same machine, and also on a Centos box.Libpostal (the C library) is installed system global, so both Python versions' postal packages definitely layer onto the same libpostal.so.
Reproduction script (supports both Python2 and Python3)
Example output:
HTH
The text was updated successfully, but these errors were encountered: