last-line parser for unstructured geographic text
JavaScript HTML Shell
Clone or download

README.md

This repository is part of the Pelias project. Pelias is an open-source, open-data geocoder originally sponsored by Mapzen. Our official user documentation is here.

Pelias Placeholder Service

NPM Build Status Greenkeeper badge

natural language parser for geographic text

This engine takes unstructured input text, such as 'Neutral Bay North Sydney New South Wales' and attempts to deduce the geographic area the user is referring to.

Human beings (familiar with Australian geography) are able to quickly scan the text and establish that there 3 distinct token groups: 'Neutral Bay', 'North Sydney' & 'New South Wales'.

The engine uses a similar technique to our brains, scanning across the text, cycling through a dictionary of learned terms and then trying to establish logical token groups.

Once token groups have been established, a reductive algorithm is used to ensure that the token groups are logical in a geographic context. We don't want to return New York City for a term such as 'nyc france', so we need to only return things called 'nyc' inside places called 'france'.

The engine starts from the rightmost group, and works to the left, ensuring token groups represent geographic entities contained within those which came before. This process is repeated until it either runs out of groups, or would return 0 results.

The best estimation is then returned, either as a set of integers representing the ids of those regions, or as a JSON structure which also contains additional information such as population counts etc.

The data is sourced from the whosonfirst project, this project also includes different language translations of place names.

Placeholder supports searching on and retrieving tokens in different languages and also offers support for synonyms and abbreviations.


nodejs version

nodejs v6.11.4 or greater is required, running the library on an older version of node will result in an error:

bash-3.2$ node --version
v4.9.1

bash-3.2$ node -e 'require("better-sqlite3")'
FATAL ERROR: v8::ToLocalChecked Empty MaybeLocal.
Abort trap: 6

install

$ git clone git@github.com:pelias/placeholder.git && cd placeholder
$ npm install

download the required database files

$ mkdir data
$ curl -s https://s3.amazonaws.com/pelias-data.nextzen.org/placeholder/store.sqlite3.gz | gunzip > data/store.sqlite3;

confirm the build was successful

$ npm test
$ npm run cli -- san fran

> pelias-placeholder@1.0.0 cli
> node cmd/cli.js "san" "fran"

san fran

took: 3ms
 - 85922583	locality 	San Francisco

run server

$ PORT=6100 npm start;

open browser

the server should now be running and you should be able to access the http API:

http://localhost:6100/

try the following paths:

/demo
/parser/search?text=london
/parser/findbyid?ids=101748479
/parser/query?text=london
/parser/tokenize?text=sydney new south wales

changing languages

the /parser/search endpoint accepts a ?lang=xxx property which can be used to vary the language of data returned.

for example, the following urls will return strings in Japanese / Russian where available:

/parser/search?text=germany&lang=jpn
/parser/search?text=germany&lang=rus

documents returned by /parser/search contain a boolean property named languageDefaulted which indicates if the service was able to find a translation in the language you request (false) or whether it returned the default language (true).

the demo is also able to serve responses in different languages by providing the language code in the URL anchor:

/demo#jpn
/demo#chi
/demo#eng
/demo#fra
... etc.

filtering by placetype

the /parser/search endpoint accepts a ?placetype=xxx parameter which can be used to control the placetype of records which are returned.

the API does not provide any performance benefits, it is simply a convenience API to filter by a whitelist.

you may specify multiple placetypes using a comma to separate them, such as ?placetype=xxx,yyy, these are matched as OR conditions. eg: (xxx OR yyy)

for example:

the query search?text=luxemburg will return results for the country, region, locality etc.

you can use the placetype filter to control which records are returned:

# all matching results
search?text=luxemburg

# only return matching country records
search?text=luxemburg&placetype=country

# return matching country or region records
search?text=luxemburg&placetype=country,region

live mode (BETA)

the /parser/search endpoint accepts a ?mode=live parameter pair which can be used to enable an autocomplete-style API.

in this mode the final token of each input text is considered as 'incomplete', meaning that the user has potentially only typed part of a token.

this mode is currently in BETA, the interface and behaviour may change over time.


run the interactive shell

$ npm run repl

> pelias-placeholder@1.0.0 repl
> node cmd/repl.js

placeholder >

try the following commands:

placeholder > london on
 - 101735809	locality 	London

placeholder > search london on
 - 101735809	locality 	London

placeholder > tokenize sydney new south wales
 [ [ 'sydney', 'new south wales' ] ]

placeholder > token kelburn
 [ 85772991 ]

placeholder > id 85772991
 { name: 'Kelburn',
   placetype: 'neighbourhood',
   lineage:
    { continent_id: 102191583,
      country_id: 85633345,
      county_id: 102079339,
      locality_id: 101915529,
      neighbourhood_id: 85772991,
      region_id: 85687233 },
   names: { eng: [ 'Kelburn' ] } }

configuration for pelias API

While Placeholder can be used as a stand-alone application or included with other geographic software / search engines, it is designed for the Pelias geocoder.

To connect Placeholder service to the Pelias API, configure the pelias config file with the port that placeholder is running on.


tests

run the test suite

$ npm test

run the functional cases

there are more exhaustive test cases included in test/cases/.

to run all the test cases:

$ npm run funcs

generate a ~500,000 line test file

this command requires the data/wof.extract file mentioned below in the 'building the database' section.

$ npm run gentests

once complete you can find the generated test cases in test/cases/generated.txt.


docker

build the service image

$ docker-compose build

run the service in the background

$ docker-compose up -d

building the database

prerequisites

  • jq 1.5+ must be installed
    • on ubuntu: sudo apt-get install jq
    • on mac: brew install jq
  • Who's on First data download

steps

the database is created from geographic data sourced from the whosonfirst project.

the whosonfirst project is distributed as geojson files, so in order to speed up development we first extract the relevant data in to a file: data/wof.extract.

the following command will iterate over all the geojson files under the WOF_DIR path, extracting the relevant properties in to the file data/wof.extract.

this process can take 30-60 minutes to run and consumes ~350MB of disk space, you will only need to run this command once, or when your local whosonfirst-data files are updated.

$ WOF_DIR=/data/whosonfirst-data/data npm run extract

alternatively you can download the extract file from our s3 bucket:

$ mkdir data
$ curl -s https://s3.amazonaws.com/pelias-data.nextzen.org/placeholder/wof.extract.gz | gunzip > data/wof.extract;

now you can rebuild the data/store.json file with the following command:

this should take 2-3 minutes to run:

$ npm run build

Using the Docker image

rebuild the image

you can rebuild the image on any system with the following command:

$ docker build -t pelias/placeholder .

download pre-built image

Up to date Docker images are built and automatically pushed to Docker Hub from our continuous integration pipeline

You can pull the latest stable image with

$ docker pull pelias/placeholder

download custom image tags

We publish each commit and the latest of each branch to separate tags

A list of all available tags to download can be found at https://hub.docker.com/r/pelias/placeholder/tags/


uploading a new build to s3

this section is applicable to mapzen employees only and requires s3 credentials and the aws command to be installed and configured prior to running.

other organizations may elect to change the bucket name in the config and utilize the same script.

the script takes care of creating a date stamped archive and promoting the most recent build to the root of the bucket (with a public ACL).

$ AWS_PROFILE=nextzen ./cmd/s3_upload.sh

--- gzipping data files ---
--- uploading archive ---
upload: data/store.sqlite3.gz to s3://pelias-data.nextzen.org/placeholder/archive/2017-09-29/store.sqlite3.gz
upload: data/wof.extract.gz to s3://pelias-data.nextzen.org/placeholder/archive/2017-09-29/wof.extract.gz
--- list remote archive ---
2017-09-29 14:52:33   46.6 MiB store.sqlite3.gz
2017-09-29 14:53:08   53.8 MiB wof.extract.gz

> would you like to promote this build to production (yes/no)?
no
you did not answer yes, the build was not promoted to production