Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Mapnik vector tile-based geocoder with support for swappable data sources
JavaScript

5.1.1

latest commit 796e840f0f
@yhahn yhahn authored

README.md

carmen

Mapnik vector tile-based geocoder with support for swappable data sources. This is an implementation of some of the concepts of Error-Correcting Geocoding by Dennis Luxen.

Build Status

Depends

  • Node v0.10.x

Install

npm install

Carmen no longer ships with any default or sample data. Sample data will be provided in a future release.

Command line

Carmen comes with command line utilities that also act as examples of API usage.

To query the default indexes:

./scripts/carmen.js --query="new york"

To analyze an index:

./scripts/carmen-analyze.js tiles/01-ne.country.mbtiles

Carmen API

Carmen(options)

Create a new Carmen geocoder instance. Takes a hash of index objects to use, keyed by each id. Each index object should be an instance of a CarmenSource object.

var Carmen = require('carmen');
var MBTiles = require('mbtiles');
var geocoder = new Carmen({
    country: new MBTiles('./country.mbtiles'),
    province: new MBTiles('./province.mbtiles')
});

geocoder.geocode('New York', {}, callback);

Each CarmenSource is a tilelive API source that has additional geocoder methods (see Carmen Source API below). In addition following tilelive#getInfo keys affect how Carmen source objects operate.

attribute description
maxzoom The assumed zoom level of the zxy geocoder grid index.
geocoder_layer Optional. A string in the form layer.field. layer is used to determine what layer to query for context operations. Defaults to the first layer found in a vector source.
geocoder_address Optional. A flag (0/1) to indicate that an index can geocode address (house numbers) queries. Defaults to 0. Or a string containing how to format the street name and address. eg: "{name} {num}". Carmen defaults to "{num} {name}"
geocoder_resolution Optional. Integer bonus against maxzoom used to increase the grid index resolution when indexing. Defaults to 0.
geocoder_group Optional + advanced. For indexes that share the exact same tile source, IO operations can be grouped. No default.
geocoder_tokens Optional + advanced. An object with a 1:1 from => to mapping of token strings to replace in input queries. e.g. 'Streets' => 'St'.
geocoder_name Optional + advanced. A string to use instead of the provided config index id/key allowing multiple indexes to be treated as a single "logical" index.
geocoder_version Required. Should be set to 3 for carmen@v5.1.x. Index versions <= 1 can be used for reverse geocoding but not forward.

Note: The sum of maxzoom + geocoder_resolution must be no greater than 14.

geocoder_version < 1

attribute description
geocoder_shardlevel Deprecated. An integer order of magnitude that geocoder data is sharded. Defaults to 0.

geocode(query, options, callback)

Given a query string, call callback with (err, results) of possible contexts represented by that string.

index(from, to, pointer, callback)

Indexes docs using from as the source and to as the destination. Options can be passed to pointer or omitted.

analyze(source, callback)

Analyze index relations for a given source. Generates stats on degenerate terms, term => phrase relations, etc.

wipe(source, callback)

Clear all geocoding indexes on a source.

copy(from, to, callback)

Copy an index wholesale between from and to.


Carmen Source API

Carmen sources often inherit from tilelive sources.

getFeature(id, callback)

Retrieves a feature given by id, calls callback with (err, result)

putFeature(id, data, callback)

Inserts feature data and calls callback with (err, result).

startWriting(callback)

Create necessary indexes or structures in order for this carmen source to be written to.

putGeocoderData(index, shard, buffer, callback)

Put buffer into a shard with index index, and call callback with (err)

getGeocoderData(index, shard, callback)

Get carmen record at shard in index and call callback with (err, buffer)

getIndexableDocs(pointer, callback)

Get documents needed to create a forward geocoding datasource.

pointer is an optional object that has different behavior between sources - it indicates the state of the database or dataset like a cursor would, allowing you to page through documents.

callback is called with (error, documents, pointer), in which documents is a list of objects. Each object may have any attributes but the following are required:

attribute description
_id An integer ID for this feature.
_text Text to index for this feature. Synonyms, translations, etc. should be separated using commas.
_geometry A geojson geometry object. Required if no _zxy provided.
_zxy An array of xyz tile coordinates covered by this feature. Required if no _geometry provided.
_center An array in the form [lon,lat]. _center must be on the _geometry surface, or the _center will be recalculated. Required only if no _geometry provided.
_bbox Optional. A bounding box in the form [minx,miny,maxx,maxy].
_score Optional. A float or integer to sort equally relevant results by. Higher values appear first. Docs with negative scores can contribute to a final result but are only returned if included in matches of a forward search query.
_cluster Optional. Used with geocoder_address. A json object of clustered addresses in the format { number: { geojson point geom } }

TIGER address interpolation

Carmen has basic support for interpolating geometries based on TIGER address range data. To make use of this feature the following additional keys must be present.

attribute description
_rangetype The type of range data available. Only possible value atm is 'tiger'.
_geometry A LineString or MultiLineString geometry object.
_lfromhn Single (LineString) or array of values (Multi) of TIGER LFROMHN field.
_ltohn Single (LineString) or array of values (Multi) of TIGER LTOHN field.
_rfromhn Single (LineString) or array of values (Multi) of TIGER RFROMHN field.
_rtohn Single (LineString) or array of values (Multi) of TIGER RTOHN field.
_parityl Single (LineString) or array of values (Multi) of TIGER PARITYL field.
_parityr Single (LineString) or array of values (Multi) of TIGER PARITYR field.

How does carmen work?

A user searches for

West Lake View Rd Englewood

How does an appropriately indexed carmen geocoder come up with its results?

For the purpose of this example, we will assume the carmen geocoder is working with the following indexes:

01 country
02 region
03 place
04 street

0. Indexing

The heavy lifting in carmen occurs when indexes are generated. As an index is generated for a datasource carmen tokenizes the text into distinct terms. For example, for a street feature:

"West Lake View Rd" => ["west", "lake", "view", "rd"]

Each term in the dataset is tallied, generating a frequency index which can be used to determine the relative importance of terms against each other. In this example, because west and rd are very common terms while lake and view are comparatively less common the following weights might be assigned:

west lake view rd
0.2  0.5  0.2  0.1

The indexer then generates all possible subqueries that might match this feature:

0.2 west
0.7 west lake
0.9 west lake view
1.0 west lake view rd
0.5 lake
0.7 lake view
0.8 lake view rd
0.2 view
0.3 view rd
0.1 rd

It drops any of the subqueries below a threshold (e.g. 0.4). This will also save bloating our index for phrases like rd:

0.5 lake
0.7 west lake
0.7 lake view
0.8 lake view rd
0.9 west lake view
1.0 west lake view rd

Finally the indexer generates degenerates for all these subqueries, making it possible to match using typeahead, like this:

0.5 l
0.5 la
0.5 lak
0.5 lake
0.7 w
0.7 we
0.7 wes
0.7 west
0.7 west l
0.7 west la
...

Finally, the indexer stores the results of all this using phrase_id in the grid index:

lake      => [ grid, grid, grid, grid ... ]
west lake => [ grid, grid, grid, grid ... ]

The phrase_id uses the final bit to mark whether the phrase is a "degen" or "complete". e.g

west lak          0
west lake         1

Grids encode the following information for each XYZ x,y coordinate covered by a feature geometry:

x            14 bits
y            14 bits
feature id   20 bits  (previously 25)
phrase relev  2 bits  (0 1 2 3 => 0.4, 0.6, 0.8, 1)
score         3 bits  (0 1 2 3 4 5 6 7)

This is done for both our 01 place and 02 street indexes. Now we're ready to search.

1. Phrasematch

Ok so what happens at runtime when a user searches?

We take the entire query and break it into all the possible subquery permutations. We then lookup all possible matches in all the indexes for all of these permutations:

West Lake View Englewood USA

Leads to 15 subquery permutations:

1  west lake view englewood usa
2  west lake view englewood
3  lake view englewood usa
4  west lake view
5  lake view englewood
6  view englewood usa
7  west lake
8  lake view
9  view englewood
10 englewood usa
11 west
12 lake
13 view
14 englewood
15 usa

Once phrasematch results are retrieved any subqueries that didn't match any results are eliminated.

4  west lake view   11100 street
7  west lake        11000 street
8  lake view        01100 street
11 west             10000 street, place, country
12 lake             01000 street, place
13 view             00100 street
14 englewood        00010 street, place
15 usa              00001 country

By assigning a bitmask to each subquery representing the positions of the input query it represents we can evaluate all the permutations that could be "stacked" to match the input query more completely. We can also calculate a potential max relevance score that would result from each permutation if the features matched by these subqueries do indeed stack spatially. Examples:

4  west lake view   11100 street
14 englewood        00010 place
15 usa              00001 country

potential relev 5/5 query terms = 1

14 englewood        00010 street
11 west             10000 place
15 usa              00001 country

potential relev 3/5 query terms = 0.6

etc.

Now we're ready to use the spatial properties of our indexes to see if these textual matches actually line up in space.

2. Spatial matching

To make sense of the "result soup" from step 1 -- sometimes thousands of potential resulting features match the same text -- the zxy coordinates in the grid index are used to determine which results overlap in geographic space. This is the grid index, which maps phrases to individual feature IDs and their respective zxy coordinates.

04 street
................
............x... <== englewood st
................
...x............
.......x........ <== west lake view rd
.........x......
................
................
.x..............

03 place
................
................
................
.......xx.......
......xxxxxx.... <== englewood
........xx......
x...............
xx..............
xxxx............ <== west town

Features which overlap in the grid index are candidates to have their subqueries combined. Non-overlapping features are still considered as potential final results, but have no partnering features to combine scores with, leading to a lower total relev.

4  west lake view   11100 street
14 englewood        00010 place
15 usa              00001 country

All three features stack, relev = 1

14 englewood        00010 street
11 west             10000 place
15 usa              00001 country

Englewood St does not overlap others, relev = 0.2

The stack of subqueries has has a score of 1.0 if,

  1. all query terms are accounted for by features with 1.0 relev in the grid index,
  2. no two features are from the same index,
  3. no two subqueries have overlapping bitmasks.

3. Verify, interpolate

The grid index is fast but not 100% accurate. It answers the question "Do features A + B overlap?" with No/Maybe -- leaving open the possibility of false positives. The best results from step 4 are now verified by querying real geometries in vector tiles.

Finally, if a geocoding index support address interpolation, an initial query token that might represent a housenumber like 350 can be used to interpolate a point position along the line geometry of the matching feature.

4. Challenging cases

Most challenging cases are solvable but stress performance/optimization assumptions in the carmen codebase.

Continuity of feature hierarchy

5th st new york

The user intends to match 5th st in New York City with this query. She may, instead, receive equally relevant results that match a 5th st in Albany or any other 5th st in the state of New York. To address this case, carmen introduces a slight penalty for "index gaps" when query matching. Consider the two following query matches:

04 street   5th st    1100
03 place    new york  0011

04 street   5th st    1100
02 region   new york  0011

Based on score and subquery bitmask both should have a relevance of 1.0. However, because there is a "gap" in the index hierarchy for the second match it receives an extremely small penalty (0.01) -- one that would not affect its standing amongst other scores other than a perfect tie.

Carmen thus prefers queries that contain contiguous hierarchy over ones that do not. This works:

seattle usa => 0.99

But this works better:

seattle washington => 1.00

5. Carmen is more complex

Unfortunately, the carmen codebase is more complex than this explanation.

  1. There's more code cleanup, organization, and documentation to do.
  2. Indexes are sharded, designed for updates and hot-swapping with other indexes. This means algorithmic code is sometimes interrupted by lazy loading and other I/O.
  3. The use of integer hashes, bitmasks, and other performance optimizations (inlined code rather than function calls) makes it extremely challenging to identify the semantic equivalents in the middle of a geocode.

Dev notes

Some incomplete notes about the Carmen codebase.

Terminology

  • Cache: an object that quickly loads sharded data from JSON or protobuf files
  • Source: a Carmen source, such as S3, MBTiles, or memory

Source structure

lib/
  [operations that are exposed in the public ui and do i/o]
  util/
    [algorithmically simple utilities]
  pure/
    [pure algorithms]

Index structure

There are two types of index stores in Carmen.

  • cxxcache is used for storing the grid, and freq indexes. Each index is sharded and each shard contains a one-to-many hash with 32-bit integer keys that map to arrays of arbitrary length containing 32-bit integer elements.
  • feature is used to store feature docs. Each index is sharded and each shard contains a one-to-many hash with 32-bit integer keys that map to a bundle of features. Each bundle contains feature documents keyed by their original, full id.

32-bit unsigned integers are widely used in the Carmen codebase because of their performance, especially in V8 as keys of a hash object. To convert arbitrary text (like tokenized text) to integers the FNV1a hash is used and sometimes truncated to make room for additional encoded data.

freq

Stores a mapping of term frequencies for all docs in an index. Terms are ID'd using a fnv1a hash.

term_id => [ count ]

Conceptual exapmle with actual text rather than fnv1a hashes for readability:

street => [ 103120 ]
main   => [ 503 ]
market => [ 31 ]

grid

Stores a mapping of phrase/phrase degenerate to feature cover grids.

phrase_id => [ grid, grid, grid, grid ... ]

A lookup against this index effectively answers the question: what and where are all the features that match (whole or partially) a given text phrase?

Grids are encoded as 53-bit integers (largest possible JS integer) storing the following information:

info bits description
x 14 x tile cover coordinate, up to z14
y 14 y tile cover coordinate, up to z14
relev 2 relev between 0.4 and 1.0 (possible values: 0.4, 0.6, 0.8, 1.0)
score 3 score scaled to a value between 0-7
id 20 feature id, truncated to 20 bits

phrase_id

phrase degen
31-1 0

The first 31 bits of a phrase ID are the fnv1a hash of the phrase text. The last remaining bit is used to store whether the phrase_id is for a complete or degenerate phrase.

Something went wrong with that request. Please try again.