Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 36 million developers.Sign up
This release adds three important groups of functions to libpostal's C API to support the lieu address/venue deduping project. The APIs are somewhat low-level at this point, but should still be useful in a wide range of geo applications, particularly for batch geocoding large data sets. This is the realization of some of the earlier work on address expansion.
Near-dupe hashing builds on the expand_address functionality to allow hashing a parsed address into strings suitable for direct comparison and automatic clustering. The hash keys are used to group similar records together prior to pairwise deduping so that we don't need to compare every record to every other record (i.e. N² comparisons). Instead, if we have a function that can generate the same hash key for records that are possible dupes (like "100 Main" and "100 E Main St"), while also being highly selective, we can ensure that most duplicates will be captured for further comparison downstream, and that dissimilar records can be safely considered non-dupes. In a MapReduce context, near-dupe hashes can be used as keys to ensure that possible dupes will be grouped together on the same shard for pairwise checks, and in a search/database context, they can be used as an index for quick lookups of candidate dupes before running more thorough comparisons with the few records that match the hash. This is the first step in the deduping process to identify candidate dupes, and can be thought of as the blocking function in record linkage (this is a highly selective version of a blocking function) or as a form of locally sensitive hashing in the near-duplicate detection literature. Libpostal's near-dupe hashes use a combination of several new features of the library:
Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "#2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses. For street names, we additionally strip whitespace from the root expansion so "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".
Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.
Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.
Acronym generation: when generating hash keys for names, we also try to guess acronyms when there are two or more tokens. While the acronym alignments downstream can be more exact, there are cases like "Brooklyn Academy of Music" vs "BAM" which would not match a single token using the previous methods, but are the same place nonetheless. Acronym generation attempts to solve this problem for venue names, human initials, etc. It is stopword-aware in various languages, but generates multiple potential forms of the acronym for cases like "Museum of Modern Art" vs. "MoMA" where the stopword is included in the acronym. The method also includes rudimentary support for sub-acronyms i.e. generating new tokens at every other non-contiguous stopword (so "University of Texas at Austin" will produce "UT"), as well as at punctuation marks like commas or colons (e.g. "University of Texas, Austin"). Acronym generation also leaves intact any known dictionary phrases that resolve to multi-word phrases so e.g. "Foo High School" and "Foo HS" can share a hash key, and also keeps any acronyms identified as part of the tokenizer from the internal period structure like "A.B.C." All of this also works across scripts for any non-ideographic languages. Though this method will not cover all valid acronyms directly, we use a double metaphone on the output in Latin scripts so with certain types of more complex acronyms like "sonar" vs. "Sound Navigation and Ranging", where more than the initial character of each word is used, the generated acronym may still be phonetically the same as the real acronym (in this case, both resolve to "SNR"), especially if the other letters used are vowels.
Name quad-grams: in order to handle a variety of spelling differences, as well as languages with longer concatenated words as in German, all of the name hashes mentioned above are converted in to unique sequences of 4 characters. Simple numeric tokens consisting of digits and hyphens, etc. are included as-is, without the quadgrams or double metaphone encoding. For the words that do use quadgrams, there is no token boundary disambiguation i.e. the double metaphone for "Nationalgalerie" would be "NXNLKLR" which would then generate strings
["NXNL", "XNLK", "NLKL", "LKLR"](as opposed to the behavior of our languages classifier features which would generate
["NXNL_", "_XNLK_", "_NLKL_", "_LKLR"], using the underscore to indicate whether the 4-gram is at the beginning, middle, or end of the word). Since the fully concatenated name without whitespace is also double metaphone encoded and hashed to 4-grams in this fashion, it should also account for variance in the phonetic output resulting from spacing/concatenation differences (vowels at the beginning of a word are usually converted to "A" in double metaphone, whereas they're ignored in the middle of a word). Overall, though this method does slightly reduce the precision of near-dupe hashing (causes us to identify more potential dupes and thus make more pairwise comparisons in the next stage), it also increases recall of the process (we don't miss as many true dupes).
Once we have potential candidate dupe pairs, we provide per-component methods for comparing address/name pairs and determining if they're duplicates. Each relevant address component has it own function, with certain logic for each, including which libpostal dictionaries to use, and whether a root expansion match counts as an exact duplicate or not. For instance, in a secondary unit, "# 2", "Apt 2", and "Apt # 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street names e.g. "Park Ave" and "Park Pl". In the latter case, we can still classify the street names as needing to be reviewed by a human.
The duplicate functions return one of the following values:
The likely and exact classifications can be considered duplicates and merged automatically, whereas the needs_review response is for flagging possible duplicates.
Having special functions for each component can also be useful down the line e.g. for deduping with compound house numbers/ranges (though this is not implemented yet).
Since identifying the correct language is crucial to effective matching, and individual components like
unit may not provide any useful information about the language, we also provide a function that returns the language(s) for an entire parsed/labeled address using all of its textual components. The returned language codes can be reused for subsequent calls.
Fuzzy deduping for names
For venue/street names, we also want to be able to handle inexact name matches, minor spelling differences, words out of order (see this often with human names, which can sometimes be listed as Last, First Middle), and removing tokens that may not be ignorable in terms of libpostal's dictionaries but are very common, or very common in a particular geography.
In this release, we build on the idea of Soft-TFIDF, which blends a local similarity function (usually Jaro-Winkler in the literature, though we use a hybrid method), with global corpus statistics (TFIDF weights or other similar term weights, supplied by the user in our case, see the lieu project for details on constructing indices based on TFIDF or information gain (performs better in the venue deduping setting), and their per-geo variants.
Here's how it works:
for strings s1 and s2, each token in s1 is aligned with its most similar token in s2 in terms of a user-specified local similarity metric, provided that it meets a specified similarity threshold. This allows for small spelling mistakes in the individual words and also makes the method invariant to word-order.
given a vector of scores for each string, the final similarity is, for each token t1 in s1 and its closest match t2 in s2 (if local_sim >= theta): local_sim * scores1[t1] * scores2[t2]. In our case, the final fuzzy dot product is then normalized by the product of the L2 norms of each score vector, so it can be thought of as a form of soft cosine similarity. The two main weighting schemes are information gain and TFIDF. Using TFIDF means that rare words are given more weight in the similarity metric than very common words like "Salon" or "Barbershop". Using information gain means terms are weighted by relevance i.e. how much the given word tells us about other words, and can be thought of as running a mini feature selection algorithm or organizing the branches of a small decision tree for each string. Overall information gain tends to widen the gap between high-information, core name words and simple low-information words which can be safely ignored if the rest of the tokens match. Finally, it's possible to give all words equal weight using a uniform distribution (give each token a weight of 1 / # of tokens), and this is the approach we take for fuzzy street name comparisons.
Assuming the chosen token scores add up to 1, the total similarity score for the string that's between 0 and 1, and there are user-specified thresholds for when to consider the records as various classes of dupes. The default threshold is 0.9 for likely dupes and 0.7 for needs_review, but may be changed depending on tolerance for false positives.
Note: for the lieu project we use a linear combination of information gain (or TFIDF) and a geo-specific variant score where the index is computed for a specific, roughly city-sized geohash tile, and smaller tiles are lumped in with their larger neighbors. The geo-specific scores mean that something like "San Francisco Department of Building Inspection" and "Department of Building Inspection" can match because the words "San Francisco" are very common in the region. This approach was inspired by some of the research in https://research.fb.com/publications/deduplicating-a-places-database/.
Unique to this implementation, we use a number of different local similarity metrics to qualify a given string for inclusion in the final similarity score:
Jaro-Winkler similarity: this is a string similarity metric developed for comparing names in the U.S. Census. It detects small spelling differences in words based on the number of matches and transpositions relative to the lengths of the two strings. The Winkler variant gives a more favorable score to words with a shared common prefix. This is the local similarity metric used in most of the Soft-TFIDF literature, and we use the commonly-cited value of 0.9 for the inclusion threshold, which works reasonably well in practice. Note: all of our string similarity methods use unicode characters rather than bytes in their calculations.
Damerau-Levenshtein distance: the traditional edit distance metric, where transpositions of two characters count as a single edit. If a string does not meet the Jaro-Winkler threshold, but has a maximum edit distance of 1 (could be that the first character was transposed), and a minimum length of 4 (many short strings are within edit distance 1 of each other, so don't want to generate too many false positives). Note: since distances and similarities are not on the same scale, we use the Damerau-Levenshtein only as a qualifying threshold, and use the Jaro-Winkler similarity value (even though it did not meet the threshold) for the qualifying pair in the final similarity calculation.
Sequence alignment with affine gap penalty and edit operation subtotals: a new, efficient method for sequence alignment and abbreviation detection. This builds on the Smith-Waterman-Gotoh algorithm with affine gap penalties, which was originally used for alignment of DNA sequences, but works well for other types of text. When we see a rare abbreviation that's not in the libpostal dictionaries, say "Service" and "Svc", the alignment would be "S--v-c-". In other words, we match "S", open a gap, extend that gap for two characters, then match "v", open another gap, extend it one character, match "c", open a gap, and extend it one more character at the end. In the original Smith-Waterman, O(mn) time and space was required to compute this alignment (where m is the length of the first string and n is the length of the second). Gotoh's improvement still needs O(mn) time and O(m) space (where m is the length of the longer string), but it does not store the sequence of operations, only a single cost where each type of edit pays a particular penalty, where the affine gap penalty is the idea that we should pay more for opening a gap than extending it. The problem with the single cost is it's not always clear what to make of that single combined score. The new method we use in libpostal stores and returns a breakdown of the counts and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return or compute the full alignment as in Needleman-Wunsch or Hirschberg's variant. Using this method we know that for "Service" and "Svc", the number of matches is equal to the length of the shorter string, regardless of how many gaps were opened, and the two share a common prefix of "S", so "Svc" can be considered a possible abbreviation for "Service". When we find one of these possible abbreviations, and none of the other thresholds are met (which can easily happen with abbreviations), it qualifies both tokens for inclusion in the final similarity, again using their Jaro-Winkler similarity as the weight in the final calculation. For strict abbreviations (match the criteria for possible abbreviations and also share a common suffix e.g. "Festival" and "Fstvl") that are greater than a certain length, an optional static score is used (if it's higher than the Jaro-Winkler score), so that cases that really look like abbreviations can be weighted as more similar than the Jaro-Winkler score might indicate. We also use the square of the larger of the two term scores in place of their product, since we're effectively considering these words an exact match, and we calculate an appropriate offset for the lower-scoring vector's L2 norm to compensate.
Acronym alignments: especially prevalent in universities, non-profits, museums, government agencies, etc. We provide a language-based stopword-aware acronym alignment method which can match "Museum of Modern Art" to "moma" (no capitalization needed), "University of California Berkeley" to "UC Berkeley", etc. If tokens in the shorter string are an acronym for tokens in the longer string, all of the above are included in the similarity score with a 1.0 local similarity (so those tokens' scores will be counted as evidence for a match, not against it). For cases like "AB" vs. "A B" where the scores may be significantly different between the combined version and the separate single letters version, we take the greater of a) the acronym score in the shorter string or b) the L2 norm of the individual words for the longer string, and again use an offset for the norm of the lower-scoring vector.
Known-phrase alignments: similar to acronym alignment, libpostal's dictionaries are used at deduping time as well, and it two strings contain known phrases with the same canonical, those phrase spans are given the maximum similarity (product of L2 norms of both).
Multi-word phrase alignments for spacing variations: again similar to acronym alignment, but for handling spacing variations like "de la Cruz" vs. "dela Cruz" or "Sea Grape Dr" vs. "Seagrape Dr" as exact matches. In a token-based approach, the first word in a phrase might match on Jaro-Winkler similarity alone, but the subsequent words in the string with spaces would count against the similarity value. Having multiple words align to their concatenated form pays no penalty for the whitespace.
Match demotion for differing sets of initials: depending on the term weighting scheme used and the type of names in the corpus, the weights for single letters (found in human initials) may have a low enough weight that they don't affect the similarity much. This is especially true for information gain, where a single letter like "A" or "B" may cooccur with many different names and end up conveying no more information than a common stopword like "the" or "of". When this is true, "A & B Jewelry" vs. "B & C Jewelry" might generate a false positive because the scores for "A" and "C" are so low compared to "Jewelry" that they are ignored. In cases where there's a likely dupe and, given sets of single letter tokens S1 and S2 in the two strings respectively, we demote the match to needs_review if there is a symmetric difference and both sides (S1 - S2 and S2 - S1) are non-empty. Note: the typical case of human names with/without a middle initial will still match under this heuristic i.e. "Yvette Clarke" matches "Yvette D Clarke" because the initial only exists in one string and there is no conflict. However, something like "J Dilla" vs. "K Dilla" would be classified as needing human review.
The above assumes non-ideographic strings. In Chinese, Japanese, Korean, etc. we currently use the Jaccard similarity of the set of individual ideograms instead. In future versions it might be useful to weight the Jaccard similarity by TFIDF/information gain scores as well, and if we ever add a statistical word segmentation model for CJK languages, the word boundaries from that model could be used instead of ideograms.
The fuzzy duplicate methods are currently implemented for venue names and street names, which seemed to make the most sense. The output for these methods is a struct containing the dupe classification as well as the similarity value itself.
For fuzzy deduplication of street names, we implement a special case of "soft set containment". Here, the Soft-TFIDF/Soft-Information-Gain method reports the number of tokens that were matched (where matched means the soft notion of equality under local similarity functions, minus acronym alignments and the sequence alignment method so i.e. only Jaro Winkler and Levenshtein). When one set of tokens is contained within the other i.e. the number of matched tokens is equal to the number of tokens in the shorter of the two strings, we classify the pair as a likely duplicate regardless of the similarity value. This allows e.g. "Park" and "Park Ave" to match without allowing "Park Ave" and "Park St" to match (though they would be classified as needing review under the exact dupe method since they share a root expansion).
The great parser-data merge is complete. Libpostal 1.0 features a better-than-ever international address parser which achieves 99.45% full-parse accuracy on held-out addresses. The release title is a reference to the TV show (libpostal was also created in Brooklyn and this was the first version of the model to surpass 99% accuracy). Check out the blog post for the details. Here's a sample of what it can do in a GIF:
Breaking API Changes
- Every function, struct, constant, etc. defined in the public header (libpostal.h) now uses a "libpostal_" prefix . This affects all bindings that call the C API. The bindings that are part of this Github org all have 1.0 branches.
- unit: an apartment, unit, office, lot, or other secondary unit designator
- level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
- staircase: numbered/lettered staircase
- entrance: numbered/lettered entrance
- po_box: post office box: typically found in non-physical (mail-only) addresses
- category: for category queries like "restaurants", etc.
- near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"
New admin tags
- island: named islands e.g. "Maui"
- country_region: informal subdivision of a country without any political status
- world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”
No more accent-stripping/transliteration of input
There's a new transliterator which only makes simple modifications to the input (HTML entity normalization, NFC unicode normalization). Latin-ASCII transliteration is no longer used at runtime. Instead, addresses are transliterated to multiple forms during training so the parser has to deal with all the variants rather than normalizing to a single variant (which previously was not even correct in cases like Finnish, Turkish, etc.) in both places.
Trained on > 1 billion examples in every inhabited country on Earth
The training data for libpostal's parser has been greatly expanded to include every country and dependency in OpenStreetMap. We also train on a places-only data set where every city name from OSM gets some representation even if there are no addresses (higher-population cities get examples proportional to their population). A similar training set is constructed for streets, so even places which have very few addresses but do have a road network in OSM can be included.
1.0 also moves beyond OSM, training on most of the data sets in OpenAddresses, and postal codes + associated admins from Yahoo's GeoPlanet, which includes virtually every postcode in the UK, Canada, etc.
Almost 100GB of public training data
All files can be found under s3://libpostal/training_data/YYYY-MM-DD/parser/ as gzip'd tab-separated values (TSV) files formatted like:
- formatted_addresses_tagged.random.tsv.gz (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples
- formatted_places_tagged.random.tsv.gz (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more.
- formatted_ways_tagged.random.tsv.gz (ODBL): every street in OSM (ways with highway=*, with a few conditions), reverse-geocoded to its admins
- geoplanet_formatted_addresses_tagged.random.tsv.gz (CC-BY): every postal code in Yahoo GeoPlanet (includes almost every postcode in the UK, Canada, etc.) and their parent admins. The GeoPlanet admins have been cleaned up and mapped to libpostal's tagset
- openaddresses_formatted_addresses_tagged.random.tsv.gz (various licenses, mostly CC-BY): most of the address data sets from OpenAddresses, which in turn come directly from government sources
- uk_openaddresses_formatted_addresses_tagged.random.tsv.gz (CC-BY): addresses from OpenAddresses UK
If the parser doesn't perform as well as you'd hoped on a particular type of address, the best recourse is to use grep/awk to look through the training data and try to determine if there's some pattern/style of address that's not being captured.
Better feature extraction
- n-grams for the "unknown" words (occurred fewer than n times in the training set)
- for unknown words that are hyphenated, each of the individual subwords if frequent enough, and their ngrams otherwise
- an index of postcodes and their admin contexts built from the training data (the intuition is that something like "10001" could be a postcode or a house number, but if words like "New York", "NY", "United States", etc. are to its right or left, it's more likely to be a postcode).
- for first words that are unknown (could be part of a venue/business name, could be a rare/misspelled street), a feature which finds the relative position of the next number and the next address phrase if present. Usually if the parser gets the first word in the string correct it will get the entire string correct.
More powerful machine learning model (CRF)
libpostal 1.0 uses a Conditional Random Field (CRF) instead of the greedy averaged perceptron. This more powerful machine learning method scores sequences rather than individual decisions, and can revise its previous decision if that would help a subsequent token score higher (Viterbi inference).
Improves upon the CRFsuite implementation in terms of:
- performance: Viterbi inference sped up by 2x
- scalability: training set doesn't need to fit in memory
- model expressiveness: libpostal's CRF adds state-transition features which can make use of both the state of the current token and the previous tag. These act just like normal features except their weights are LxL matrices (tags we could have transitioned from by tags we could transition to) instead of L vectors.
FTRL-Proximal optimization for the language classifier
The language classifier now uses a multinomial version of Google's FTRL-Proximal method, which uses a combination of L1 and L2 regularization, inducing sparsity while maintaining high accuracy. This results in a model that is more accurate than the previous classifier while being 1/10th the size. The runtime classifier is now able to load either sparse or dense weights depending on the file header.
This release adds functions for configuring the datadir at runtime.
- libpostal_setup_datadir(char *dir)
- libpostal_setup_language_classifier_datadir(char *dir)
- libpostal_setup_parser_datadir(char *dir)
Merged a few of the commits from the parser-data branch of libpostal into master to fix address parser training from the master branch.
Coincides with the release of some of the parser training data generated in the parser-data branch:
- OSM training addresses (27GB, ODBL)
This is (a much-improved version of) the original data set used to train libpostal.
- OSM formatted place names/admins (4GB, ODBL)
Helpful for making sure all the place names (cities, suburbs, etc.) in a country are part of the training set for libpostal, even if there are no addresses for that place.
- GeoPlanet postal codes with admins (11GB, CC-BY)
Contains many postal codes from around the world, including the 1M+ postcodes in the UK, and their associated admins. If training on master, this may or may not help because it still relies pretty heavily on GeoNames for postcodes.
- OpenAddresses training addresses (30GB, various licenses)
By far the largest data set. It's not every source from OpenAddresses, just the ones that are suitable for ingestion into libpostal. It's heavy on North America but also contains many of the EU countries. Most of the sources only require attribution, some have share-alike clauses. See openaddresses.io for more details.
Users are encouraged to QA the data for problematic patterns, etc. Note: while it's possible now to train per-country/language parsers using slices of the data, there will be no support offered for custom parsers.
Release is named after the largest train station the world: https://en.wikipedia.org/wiki/Nagoya_Station.
- Data download script separates S3 files into 64MB chunks (as used in awscli) and uses a process pool
- Various dictionary updates submitted by users
Initial release of libpostal as described in the introductory blog post.