TSearch dictionaries for addresses
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
LICENSE.md
META.json
Makefile
README.md
addressing_dictionary--1.0--1.1.sql fix to have an upgrade path from 1.0 to 1.1 Mar 31, 2015
addressing_dictionary--1.1.sql
addressing_dictionary.control
addressing_en.stop
addressing_en.syn
addressing_en.ths
addressing_fr.stop
addressing_fr.syn
addressing_fr.ths

README.md

TSearch Address Dictionaries

Motivation

Trying to use full-text indexing for address data with the default language configurations doesn't work well: if you use an 'english' configuration, you end up getting street names stemmed by the english dictionary. Using the 'simple' configurations is better but still not optimal: there is no handling of obvious abbreviations, so you might fail to match "12 N Oak st" to "12 North Oak Street", or vice versa.

This extension provides an 'addressing_en' configuration that has common (Canada Post enumerated) street type abbreviations, direction tokens (n, s, e, w, etc) and numeric street variants (first/1st) handled. There's sure to be piles more, as well as many cases that cannot be handled with a simple full-text tokenizing strategy (st = street or saint?). However, using full-text features for basic geocoding is too convenient to ignore.

Using full-text indexing for nationwide address data might still be a bad idea, since there are so many possible false or multiple matches to find, but for a single city, county or even state, this approach can generate a quick'n'dirty address lookup routine perfect to tying to an autocomplete form field on a web page.

Currently there is only support for English (really, North American English) addressing -- pull requests gratefully received.

Installation

Clone the repository, ensure that pg_config is on your path, and run make install to copy the dictionary files into place.

Examples

> CREATE EXTENSION addressing_dictionary;

> SELECT to_tsvector('addressing_en', '1234 n main st');

                 to_tsvector                  
----------------------------------------------
 '1234':1 'main':4 'n':2 'north':3 'street':5

Caveats

If using v1.1, then cardinal directions are tokenized as both the single letter abbreviation, and the word. This is to avoid ambiguity. See example above.