Skip to content

expand_address hangs with certain strings, with invalid UTF-8 warnings #448

@nitinvijay94

Description

@nitinvijay94

Hi

Firstly, thanks for all the work you have done. In order to avoid fluff, I'll be posting the context serially.

  1. United States
  2. I'm using Libpostal to dedupe addresses within the Hadoop ecosystem (Hive on Tez).
  3. I have a farily large set of over 200 million addresses, a size-able chunk of which are human entered values. Given the nature of my data, I have encountered a few cases which causes the expand_address function to hang and stop my job.

a) The most baffling case.

>>> expand_address(u'5-19�� Nakamachi')
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None

this is an entirely ASCII string, which halts the program. Using parse_address also throws warnings, but continues gracefully.

>>> parse_address(u'5-19�� Nakamachi')
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: None
[(u'5-19&#56256 &#56321 Nakamachi', u'house')]

b) Address: "No. \uD835\uDFE3\uD835\uDFE3"
This looks like "No. 11". Works fine using pypostal, however, it similarly halts the program when using jpostal. My guess is this has something to do with the C interface's GetStringUTFChars not working well with 4 byte utf-8 characters, since Java converts its internal UTF-16 String to a Modified UTF-8 format.

These cases are rare, but can block processes, which makes them problematic. Is there some way we can have this function exit gracefully in case of utf-8 parsing errors?

Thanks,
Nitin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions