Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expansion fails with some addresses #351

Closed
rinigus opened this issue May 20, 2018 · 5 comments
Closed

Expansion fails with some addresses #351

rinigus opened this issue May 20, 2018 · 5 comments

Comments

@rinigus
Copy link
Contributor

rinigus commented May 20, 2018

Hi,

I am working on incorporating the latest libpostal version into my geocoder. Changes of API from .37 to 1.0 are accounted for and it seems to work on mobile as well, after splitting address parser into country based datasets (#132).

As a part of the geocoder data import, I run all address parts through libpostal_expand_address, as in https://github.com/rinigus/geocoder-nlp/blob/master/importer/src/main.cpp#L537. As a result, all addresses from the planet are expanded and, later, used in the search. You could also consider it as a test for expansion code of libpostal. So, I am hitting some examples which fail.

Let's start with the first one:

Somewhere in Cyprus, Russian word is used leading to hanging of expansion function when used with the datasets provided by libpostal (not country-based ones):

./libpostal Хозтовары

Backtrace:

#0  0x00007ffff74152f9 in trie_get_node (index=<optimized out>, self=0x5555558122a0) at trie.c:109
#1  trie_get_prefix_from_index (self=self@entry=0x5555558122a0, key=key@entry=0x5555558110d0 "ы", len=len@entry=2, start_index=start_index@entry=1113220, tail_pos=tail_pos@entry=0)
    at trie.c:803
#2  0x00007ffff740b955 in next_prefix_or_set (trie=trie@entry=0x5555558122a0, str=str@entry=0x5555558110d0 "ы", len=len@entry=2, last_result=..., in_set=false, check_set_only=true)
    at transliterate.c:120
#3  0x00007ffff740e995 in transliterate (trans_name=trans_name@entry=0x555555811310 "russian-latin-bgn", str=0x5555558110c0 "Хозтовары", 
    str@entry=0x7fffffffdc36 "Хозтовары", len=18) at transliterate.c:855
#4  0x00007ffff74208c4 in normalize_string_languages (str=str@entry=0x7fffffffdc36 "Хозтовары", options=options@entry=575, num_languages=num_languages@entry=6, 
    languages=languages@entry=0x5555558111f0) at normalize.c:315
#5  0x00007ffff74096a5 in expand_address_phrase_option (input=0x7fffffffdc36 "Хозтовары", options=..., n=0x7fffffffd640, phrase_option=phrase_option@entry=EXPAND_PHRASES)
    at expand.c:1556
#6  0x00007ffff74099d7 in expand_address (input=<optimized out>, options=..., n=<optimized out>) at expand.c:1619
#7  0x00007ffff7401bdd in libpostal_expand_address (input=<optimized out>, options=..., n=<optimized out>) at libpostal.c:51
#8  0x000055555557b3ad in print_output (use_json=false, options=..., address=0x7fffffffdc36 "Хозтовары") at main.c:19
#9  main (argc=<optimized out>, argv=<optimized out>) at main.c:94

By running it further, I get similar backtraces. The memory gets allocated more and more (could get to 10+GB RAM, have also seen 100+GB RAM) until its killed by the kernel.

When using country specific datasets (CY or RU), all works nicely and gets normalized as expected.

I am using 43795a3 version, but I think it would be similar issue for 1.0.0 as well. Running on Linux, Gentoo.

PS: I was checking out the awesome libpostal, and saw something that could be improved <--- could not agree more :)

@rinigus
Copy link
Contributor Author

rinigus commented May 20, 2018

Correction - it fails when using country-based database as well. There seem to be cases when it does and when it doesn't. So, while the issue is there, it maybe not resolved by using smaller database for address parser.

@rinigus rinigus changed the title Expansion fails with some addresses when using global address parser Expansion fails with some addresses May 20, 2018
@rinigus
Copy link
Contributor Author

rinigus commented May 20, 2018

Just confirmed it on Ubuntu 16.04 as well. From further experimentation and looking at the log (LOG_LEVEL_DEBUG), it looks to be locked in transliterate. Let me know if you want to see the log as well, its quite large and gets populated quickly. It maybe easier to generate it onsite if you can reproduce the issue

@rinigus
Copy link
Contributor Author

rinigus commented May 24, 2018

Since I didn't have this issue earlier, I looked into recent commits of transliteration.c and found that reversal of 2290b09 (for that file) fixes the problem.

Without fix:

src/libpostal "Хозтовары" hangs

src/libpostal "Хозтовары " works and produces:

хозтовары
khoztovary
khoztovarы
hoztovarы
hoztovary

With "fix"

src/libpostal "Хозтовары" works and produces the same as above.

Same issue is with expansion of "Sopruse sild / Нарвскии мост Дружбы" that was hanging before "fix".

Hence the question, what was the reason for adding state = start_state; and how could we fix this issue.

@kidmeier
Copy link

kidmeier commented Jul 5, 2018

I don't really understand this code, but working backwards from the logs "at&t", "at&t" and "Хозтовары" I came up with the following patch:

diff --git a/src/transliterate.c b/src/transliterate.c
index 7f5fff7..83e502b 100644
--- a/src/transliterate.c
+++ b/src/transliterate.c
@@ -938,7 +938,9 @@ char *transliterate(char *trans_name, char *str, size_t len) {
                     if (context_no_match && !prev_state.empty_transition && prev_state.phrase_len > 0) {
                         log_debug("Previous phrase stays as is %.*s\n", (int)prev_state.phrase_len, str+prev_state.phrase_start);
                         char_array_cat_len(new_str, str + prev_state.phrase_start, prev_state.phrase_len);
-                        state = start_state;
+                        if (match_candidate_state.state != TRANS_STATE_PARTIAL_MATCH) {
+                            state = start_state;
+                        }
                     }
                     
                     if (state.state == TRANS_STATE_BEGIN && !prev_state.empty_transition) {

Test suite passes and breaks the infinite loop. Beyond that, I have no idea if this is correct as I barely understand the code I just changed.

@rinigus
Copy link
Contributor Author

rinigus commented Jul 6, 2018

@kidmeier - thank you very much! Have the same issue as you - I don't really understand the code. Let's see if @albarrentine can get a look at it and tell if its a right approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants