Skip to content

mb16/geocoder

 
 

Repository files navigation

Geocoder::US

Geocoder::US 2.0 is a software package designed to geocode US street addresses. Although it is primarily intended for use with the US Census Bureau’s free TIGER/Line dataset, it uses an abstract US address data model that can be employed with other sources of US street address range data.

Geocoder::US 2.0 implements a Ruby interface to parse US street addresses, and perform fuzzy lookup against an SQLite 3 database. Geocoder::US is designed to return the best matches found, with geographic coordinates interpolated from the street range dataset. Geocoder::US will fill in missing information, and it knows about standard and common non-standard postal abbreviations, ordinal versus cardinal numbers, and more.

Geocoder::US 2.0 is shipped with a free US ZIP code data set, compiled from public domain sources.

Spellfix Update *********

This fork includes several modifications to improve address matching. First, the it was found that in identifying places names, the full address was included in the match. This meant that is the street name was similar to a city, that entry might recieve a higher score, and thus the wrong location city would be selected. To correct this, the city lookup is not done in reverse order, adding one work at a time until a strong match is found. So and address like “10 Columbia St Silver Spring” Would look first for a “Spring” match, then a “Silver Spring” then “Silver Spring St”. Since “Silver Spring” is a valid city in MD, this should provide the strongest match.

The second major error that was fixed involved zip code searches. The original version would search in the specified zip code, however, if not found, it would search in all zip codes matching the first three digits of the zip code. This is unsatisphactory since adjacanet zip codes were often being omitted in the search. From time to time zip codes can be changed, so we needed a method to address this problem. Since it is likely that location is nominally correct, we decided to identify the county of the original zip code, and then search all zip codes in that county for the supplied city. This did a much better job finding locations were the zip code was probably correct at some time in the past.

The last major problem involved spelling errors. The metaphone technique employed does not do an adequate job even handing valid omissions like “St John Street” versus “Saint John Street” After looking at various options, it was discovered taht sqlite has a spellfix library well suited for address spelling lookups where there are limmited errors. This helped identify many streets were which formerly unfound in searches. While this addition proved very helpful, there are still some areas where the match is incorrect. These might be addresses at a future time, but overall the spellfix addition handled a large number of addresses formerly poorly matched. I have included a section at the bottom of this document describing how to build and install this library. Note, this version of the project loads the spellfix library near the top of the database.rb file. So it will not work without this addition.

Finally, some score thresholds were added to force additinal searches if weak candidates were returned.

Synopsis

>> require 'geocoder/us'
>> db = Geocoder::US::Database.new("/opt/tiger/geocoder.db")
>> p db.geocode("1600 Pennsylvania Av, Washington DC")

[{:pretyp=>"", :street=>"Pennsylvania", :sufdir=>"NW", :zip=>"20502",
  :lon=>-77.037528, :number=>"1600", :fips_county=>"11001", :predir=>"",
  :precision=>:range, :city=>"Washington", :lat=>38.898746, :suftyp=>"Ave",
  :state=>"DC", :prequal=>"", :sufqual=>"", :score=>0.906, :prenum=>""}]

Web service

There is a small example webservice included in the gem at lib/geocoder/rest.rb. It requires Sinatra to run as well as setting the GEOCODER_DB environment variable. To run it:

GEOCODER_DB=/path/to/geocoder/geocoder.db ruby /path/to/gem/geocoder/lib/geocoder/us/rest.rb

You can then query the GeoCommons geocoder by calling the web service like:

http://localhost:9393/geocode.html?q=303+11th St NE,Washington,DC

Optional Options

dbtype Geocoder::US::Database.new(“/opt/tiger/geocoder.db”, {:dbtype => 1})

The dbtype option is used when your datasbase encodes it’s geometry blogs according to a special format. The first and default value, is in a series of little-endian 4-byte ints (“V*”) The second is for (‘CVVD*’) format. Use option value 2 for this second type

debug Geocoder::US::Database.new(“/opt/tiger/geocoder.db”, {:debug => false})

cache Geocoder::US::Database.new(“/opt/tiger/geocoder.db”, {:cache => 50000}) The cache_size argument is measured in kilobytes and is used to set the SQLite cache size;

larger values will trade memory for speed in long-running processes.

Prerequisites

To build Geocoder::US, you will need gcc/g++, make, bash or equivalent, the standard *NIX ‘unzip’ utility, and the SQLite 3 executable and development files installed on your system.

To use the Ruby interface, you will need the ‘Text’ gem installed from rubyforge. To run the tests, you will also need the ‘fastercsv’ gem.

Additionally, you will need a custom build of the ‘sqlite3-ruby’ gem that supports loading extension modules in SQLite. You can get a patched version of this gem from github.com/schuyler/sqlite3-ruby/. Until the sqlite3-ruby maintainers roll in the relevant patch, you will need this version.

NOTE: If you do not have /usr/include/sqlite3ext.h installed, then your sqlite3 binaries are probably not configured to support dynamic extension loading. If not, you must compile and install SQLite from source, or rebuild your system packages. This is not believed to be a problem on Debian/Ubuntu, but is known to be a problem with Red Hat/CentOS.

NOTE: If you do have to install from source, make sure that the source-installed ‘sqlite3’ program is in your path before proceeding (and not the system-installed version), using ‘which sqlite3`. Also, be sure that you’ve added your source install prefix (usually /usr/local) to /etc/ld.so.conf (or its moral equivalent) and that you’ve run /sbin/ldconfig.

Thread safety

SQLite 3 is not designed for concurrent use of a single database handle across multiple threads. Therefore, to prevent segfaults, Geocoder::US::Database implements a global mutex that wraps all database access. The use of this mutex will ensure stability in multi-threaded applications, but incurs a performance penalty. However, since the database is read-only from Ruby, there’s no reason in principle why multi-threaded apps can’t each have their own database handle.

To disable the mutex for better performance, you can do the following:

* Read the following and make sure you understand them:
   * http://www.sqlite.org/faq.html#q6
   * http://www.sqlite.org/cvstrac/wiki?p=MultiThreading
* Make sure you have compiled SQLite 3 with thread safety enabled.
* Instantiate a separate Geocoder::US::Database object for *each* thread
  in your Ruby script, and pass :threadsafe => true to new() to disable mutex
  synchronization.

Per the SQLite 3 documentation, do not attempt to retain a Geocoder::US::Database object across a fork! “Problems will result if you do.”

Building Geocoder::US

Unpack the source and run ‘make’. This will compile the SQLite 3 extension needed by Geocoder::US, the Shapefile import utility, and the Geocoder-US gem.

You can run ‘make install’ as root to install the gem systemwide.

Generating a Geocoder::US Database

Build the package from source as described above. Generating the database involves three basic steps:

  • Import the Shapefile data into an SQLite database.

  • Build the database indexes.

  • Optionally, rebuild the database to cluster indexed rows.

We will presume that you are building a Geocoder::US database from TIGER/Line, and that you have obtained the complete set of TIGER/Line ZIP files, and put the entire tree in /opt/tiger. Please adjust these instructions as needed.

A full TIGER/Line database import takes ten hours to run on a normal Amazon EC2 instance, and takes up a little over 5 gigabytes after all is said and done. You will need to have at least 12 gigabytes of free disk space after downloading the TIGER/Line dataset, if you are building the full database.

Import TIGER/Line

From inside the Geocoder::US source tree, run the following:

$ build/tiger_import /opt/tiger/geocoder.db /opt/tiger

This will unpack each TIGER/Line ZIP file to a temporary directory, and perform the extract/transform/load sequence to incrementally build the database. The process takes about 10-12 hours on a normal Amazon EC2 instance, or about 5 CPU hours flat out on a modern PC. Note that not all TIGER/Line source files contain address range information, so you will see error messages for some counties, but this is normal.

If you only want to import specific counties, you can pipe a list of TIGER/Line county directories to tiger_import on stdin. For example, the following will install just the data for the state of Delaware:

$ ls -d /opt/tiger/10_DELAWARE/1* | build/tiger_import ~/delaware.db

The tiger_import process uses a binary utility, shp2sqlite, which is derived from shp2pgsql, which ships with PostGIS. The shp2sqlite utility converts .shp and .dbf files into SQL suitable for import into SQLite. This SQL is then piped into the sqlite3 command line tool, where it is loaded into temporary tables, and then a set of static SQL statements (kept in the sql/ directory) are used to transform this data and import it into the database itself.

Build metaphones using Ruby metaphone

run bin/rebuild_metaphones /opt/tiger/geocoder.db

This creates the metaphones using Ruby’s metaphone function and will produce better geocoding results.

Build the indexes

After the database import is complete, you will want to construct the database indexes:

$ build/build_indexes /opt/tiger/geocoder.db

This process takes 25 minutes on an EC2 instance (8 CPU minutes), but it’s a lot faster than building the indexes incrementally during the import process. Basically, this process simply feeds SQL statements to the sqlite3 utility to construct the indexes on the existing database.

Cluster the database tables (optional)

As a final optional step, you can cluster the database tables according to their indexes, which will make the database smaller, and lookups faster. This process will take an hour or two, and may be a micro-optimization.

$ build/rebuild_cluster /opt/tiger/geocoder.db

You will need as much free disk space to run rebuild_cluster as the database takes up, because the process essentially reconstructs the database in a new file, and then it renames the new database over top of the old.

Running the unit tests

From within the source tree, you can run the following:

$ ruby tests/run.rb

This tests the libraries, except for the database routines. If you have a database built, you can run the test harness like so:

$ ruby tests/run.rb /opt/tiger/geocoder.db

The full test suite may take 30 or so seconds to run completely.

Sqlite Spellfix Install

(Note, paths are optional, but provided to help prevent errors.)

Create an sqlite directory in /opt.

cd /opt/sqlite:

Download source code into this directory. (There is likely a newer version of the source code, but testing was done this this version)

sudo wget http://www.sqlite.org/2013/sqlite-src-3071700.zip

sudo unzip sqlite-src-3071700.zip

cd sqlite-src-3071700

cd ext

cd misc

The spellfix file will not compile at it stands in this version. Therefore we need to make several modifications to the C code.

Note worthy links: (www.sqlite.org/c3ref/c_abort_rollback.html for Extended result codes: SQLITE_CONSTRAINT_NOTNULL) (sqlite.1065341.n5.nabble.com/SQLite-version-3-7-16-td67776.html see middle of page for missing init function)

edit spellfix.c, and add at top:

#define SQLITE_CONSTRAINT_NOTNULL      (SQLITE_CONSTRAINT | (5<<8))

static int spellfix1Register(sqlite3 *db); // need prototype for this function since it is at the bottom.

int sqlite3_extension_init(sqlite3 *db, char ** pxErrMsg, const sqlite3_api_routines *pApi)
{ SQLITE_EXTENSION_INIT2(pApi); return spellfix1Register(db); }

int sqlite3_stricmp(const char *zLeft, const char *zRight)
{ return strcasecmp(zLeft, zRight); }

Note worthy links: (to compilte stackoverflow.com/questions/18421465/compiling-spellfix-for-sqlite3)

Execute this command to compile library. ( Note, we leave .so off the library since we need to load this in sqlite but sqlite treats “.” as a special character.)

gcc -shared -fPIC -Wall -I/opt/sqlite/sqlite-src-3071700/ spellfix.c -o spellfix1

Now start the sqlite binary and open the database where all the data was loaded.

sqlite3 geocoder.db

From the sqlite commandline, run the following to load the new spellfix1 library. (Note, fix path is necessary.)

.load /opt/sqlite/sqlite-src-3071700/spellfix1  (or path to spellfix1)

Note worthy links: (www.sqlite.org/spellfix1.html spellfix documentation)

CREATE VIRTUAL TABLE feature_spellfix1 USING spellfix1;

(if this produces and error ‘unable to open database’, then be sure the database has write permissions, and the directory also must be writable).

INSERT INTO feature_spellfix1(word, langid) SELECT street, zip FROM feature;

(using langid as the zipcode field)

Note, the file database.rb contains the path to the spellfix library, so if this is different on your system, the ruby file containing this path needs updating.

License

Geocoder::US 2.0 was based on earlier work by Schuyler Erle on a Perl module of the same name. You can find it at search.cpan.org/~sderle/.

Geocoder::US 2.0 was written by Schuyler Erle, of Entropy Free LLC, with the gracious support of FortiusOne, Inc. Please send bug reports, patches, kudos, etc. to patches at geocoder.us.

Copyright © 2009 FortiusOne, Inc.

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

About

Modular Street Address Geocoder

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • PLpgSQL 92.0%
  • C 6.4%
  • Ruby 1.1%
  • C++ 0.4%
  • Bison 0.1%
  • HTML 0.0%