Geographic Name Resolution Service (GNRS)

Author: Brad Boyle (bboyle@email.arizona.edu)

Overview

The GNRS is a batch application for resolving & standardizing political division names against the GADM Global Administrative Divisions Database (https://gadm.org/), with additional names and codes from Geonames (http://www.geonames.org/) and Natural Earth (https://www.naturalearthdata.com/). The GNRS resolves political division names at three levels: country (admin_0), state/province (admin_1) and county/parish (admin_2). Resolution is performed in a series steps, beginning with direct matching to standard names, followed by direct matching to alternate names in different languages, followed by direct matching to standard codes (such as ISO and FIPS codes). If direct matching fails, the GNRS attempts to match to standard and then alternate names using fuzzy matching, but does not perform fuzzing matching of political division codes. The GNRS works down the political division hierarchy, stopping at the current level if all matches fail. In other words, if a country cannot be matched, the GNRS does not attempt to match state or county.

Results returned by the GNRS include the original political division names, the resolved political division names and IDs from GADM and Geonames, with additional information on how each name was resolved and the quality of the overal match.

Software

Ubuntu 16.04 or higher
PostgreSQL 12.2 or higher
Perl v5.26.1 or higher
Perl module Text::CSV
PHP 7.2.24 or higher
PHP extensions:

php-cli
php-mbstring
php-curl
php-xml
php-json
php-services-json
php-pgsql

Dependencies

Local installation of database geonames

Required for building the GNRS database
See repo: `https://github.com/ojalaquellueva/geonames.git'

Local installation of database gadm

Required for building the GNRS database
See repo: `https://github.com/ojalaquellueva/gadm.git'

Installation and configuration

I recommend the following setup:

# Create application base directory (call it whatever you want)
mkdir -p gnrs
cd gnrs

# Create application code directory
mkdir src

# Install application code to application code directory
cd src
git clone https://github.com/ojalaquellueva/gnrs

# Move data and sensitive parameters directories outside of code directory
# Be sure to change paths to these directories (in params.sh) accordingly
mv data ../
mv config ../

Note: temporary data directory in /tmp/gnrs (used by gnrs api) is installed on the fly by the application.

Maintenance

To avoid filling up the gnrs temp directory, consider adding a crontab entry to delete files older than a certain number of days. For example, the following cron job find and deletes all files older than 7 days, every day at 4:02 am:

02 4 * * * find /tmp/gnrs/* -type f -mtime +7 -print0 | xargs -0 rm

Another version for systems that don't support -print0:

02 4 * * * find /tmp/gnrs/* -type f -mtime +7 -exec rm {} \;

Whichever you use, be sure to test first to verify that the list of files makes sense:

find /tmp/gnrs/* -type f -mtime +7

Input/Output

Input File

The input file for the TNRS must be utf-8 plain text file name with the following fields:

Field name	Required?	Meaning
user_id	No	User-supplied integer id for each row, if desired
country	Yes	Country name
state_province	No	State/province name
county_parish	No	County/parish name

Header user_id,country,state_province,county_parish must be the first line of the file. Place this file in the GNRS user data directory (data/user/; path and directory name set in file params.sh).

Example tab-delimited data

Numeric IDs optional but must include header & all tabs

user_id<tab>country<tab>state_province<tab>county_parish  
1<tab>Russia<tab>Lipetsk<tab>Dobrovskiy rayon  
2<tab>Mexico<tab>Sonora, Estado de<tab>Huépac  
3<tab>Guatemala<tab>Izabal<tab>  
4<tab>USA<tab>Arizona<tab>Pima County  
5<tab>U.S.A<tab>Arizona<tab>Pima<tab>  
6<tab>Mexico<tab>Quintana Roo<tab>Lázaro Cárdenas

Input File Type

gnrspar.pl: must be tab delimited
gnrs_batch.sh: tab delimited or comma delimited. Specify on command line (see below).

Output File

GNRS output is saved as a utf-8 CSV file with header. By default, the name of the output file is the basename of the input file, plus suffix "gnrs_results.csv". Fields are as follows:

Field name	Meaning
id	gnrs ID of each record
poldiv_full	Verbatim country, state/province and county/parish, concatenated with '@' dellimiter
country_verbatim	Verbatim country
state_province_verbatim	Verbatim state/province
county_parish_verbatim	Verbatim county/parish
country	Resolved country
state_province	Resolved state/province
county_parish	Resolve couny/parish
country_id	Geonames ID of resolved country
state_province_id	Geonames ID of resolved state/province
county_parish_id	Geonames ID of resolve county/parish
match_method_country	Method used to match country
match_method_state_province	Method used to match state/province
match_method_county_parish	Method used to match county/parish
match_score_country	Country match score (if fuzzy matched)
match_score_state_province	State/province match score (if fuzzy matched)
match_score_county_parish	County/parish match score (if fuzzy matched)
poldiv_submitted	Lowest political division submitted
poldiv_matched	Lowest political division matched
match_status	Completeness of overall match
user_id	User id, if supplied

Place your input file in the gnrs user data directory (path and directory name set in param file). input file must be named "gnrs_submitted.csv".

Usage

GNRS (parallel processing)

This should be considered the default application as it is by far the fastest
Splits submitted file into batches, removing duplicates, and processes several batches at once using multiple cores.
Reassembles batches into single file when all batches complete
Invokes gnrs_batch.sh (see below)

Syntax

./gnrspar.pl -in <input_filename_and_path> -nbatch <batches> -opt <makeflow_options> <other options>

Options

Option	Meaning	Required?	Default value	Values
-in	Input file and path	Yes
-out	Output file and path	No	/path/to/_gnrs_results.tsv
-nbatch	Number of batches	Yes
-opt	Makeflow options	No
-d	Output file delimiter	No	t	c (CSV), t (TSV)

Example:

./gnrspar.pl -in "../data/user/gnrs_testfile.csv" -nbatch 3

GNRS batch (non-parallel)

Import, name resolution and export of results are run as a single operation by invoking the following script:

./gnrs_batch.sh [-option1] [-option2] ...

Options

Option	Purpose	Required?	Default value	Comments
-f	Input file and path	Yes
-o	Output file and path	No	/path/to/_gnrs_results.csv
-d	Output file delimiter	No	c	c=comma (CSV), t=tab (TSV)
-n	No header	No	FALSE	Input file does not contain header. Default value (FALSE) means file contains header as first line.
-a	Api call	No (yes for api)		invokes other options such as -s and -p
-s	Silent mode: suppress all (confirmations & progress messages)	No
-m	Send notification emails	No		Must be followed by valid email
-r	Remove from cache	No	FALSE	Remove any results corresponding to submitted political divisions from cache. Forces resolution from scratch of all values in current batch.
-c	Clear cache	No	FALSE	Clear entire cache

Example:

./gnrs_batch.sh -f "../data/user/gnrs_testfile.csv" -o "/home/boyle/testing/gnrs_testfile_scrubbed.csv"

The above assumes command is being run from same directory as target script, gnrs_batch.sh.
If running from a different directory, pre-prend the command with path to gnrs_batch.sh, unless you have added this path to your environment
In this example, path to data directory is relative to working directory. Yoiu could also use the full path.
Output file "gnrs_testfile_scrubbed.csv" will be dumped to directory "/home/boyle/testing/"

API

For up-to-date examples of API usage in php and R, see the following example files in the api/ subdirectory of this reposotory:

gnrs_api_example.php
gnrs_api_example.R

Also see API documentation at http://bien.nceas.ucsb.edu/bien/tools/gnrs/gnrs-api/

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
api		api
config		config
data		data
gnrs_db		gnrs_db
includes @ a79330e		includes @ a79330e
manual_fixes		manual_fixes
sql		sql
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
clear_cache.sh		clear_cache.sh
consolidator.pl		consolidator.pl
gnrs.sh		gnrs.sh
gnrs_batch.sh		gnrs_batch.sh
gnrs_export.sh		gnrs_export.sh
gnrs_import.sh		gnrs_import.sh
gnrspar.pl		gnrspar.pl
import_user_data.sql		import_user_data.sql
params.sh		params.sh
params.sh.example		params.sh.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geographic Name Resolution Service (GNRS)

Table of Contents

Overview

Software

Dependencies

Installation and configuration

Maintenance

Input/Output

Input File

Example tab-delimited data

Input File Type

Output File

Usage

GNRS (parallel processing)

Syntax

Options

Example:

GNRS batch (non-parallel)

Options

API

About

Releases

Packages

Languages

License

ojalaquellueva/gnrs

Folders and files

Latest commit

History

Repository files navigation

Geographic Name Resolution Service (GNRS)

Table of Contents

Overview

Software

Dependencies

Installation and configuration

Maintenance

Input/Output

Input File

Example tab-delimited data

Input File Type

Output File

Usage

GNRS (parallel processing)

Syntax

Options

Example:

GNRS batch (non-parallel)

Options

API

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages