gnmatcher

import "github.com/gnames/gnmatcher"

The GNmatcher is a utility for checking scientific name-strings against canonical names derived from a variety of biodiversity datasets. The matching can be exact, partial, or fuzzy. Usually GNmatcher is used as a component of a scientific names verification service (GNames API).

Introduction

The GNmatcher takes an array of strings and returns back zero or more canonical names for each string. If it is not important which biodiversity repositories provided matched canonical names, the project can be used as a stand-alone RESTful service. If such information is important, the project is used as a component of a scientific names verification service (GNames API).

The project aims to do canonical names matching as fast and accurate as possible. Quite often, humans or character-recognition software (OCR) introduce misspellings in the strings. For this reason, GNmatcher uses fuzzy-matching algorithms when no exact match exists. Also, for cases where full string does not have a match, gnmatcher tries to match it against parts of the string. For example, if a string did not get a match on a subspecies level, the algorithm will try to match it on species and genus levels.

Reconciliation is the normalization of lexical variants of the same name. Resolution is a determination of how a nomenclaturally registered name can be interpreted from the point of taxonomy. For example, a name can be an accepted name for species, a synonym, or a discarded one.

The gnmatcher app functions as an HTTP service. A client access it wia HTTP protocol. The API's methods and structures are described in by the RESTful API documentation.

Input and Output

A user calls HTTP resource /match sending an array of strings to the service and gets back matched canonical names, the match type, as well as other metadata described as a Match object in the RESTful API documentation.

The optimal size of the input is 5-10 thousand strings per array. Note that 10,000 is the maximal size, and larger arrays will be truncated.

If the service is used with 'relaxed fuzzy matching' option, only 50 strings can be processed at a time.

Performance

For performance measurement we took 100,000 strings where only 30% of them were 'real' names. On a modern CPU with 12 hyper threads and GNM_JOBS_NUM environment variable set to 8, the service was able to process about 8,000 strings per second. For 'clean' data where most of the names are "real", you should see an even higher performance.

Prerequisites

You will need PostgreSQL with a restored dump of gnames database.
For PostgreSQL collation to work correctly set LC_COLLATE=C in /etc/default/locale
Docker service

Usage

Usage with docker

Install docker gnmatcher image: docker pull gnames/gnmatcher.
Copy .env.example file on user's disk and change values of environment variables accordingly.
Start the service:
```
docker run -p 8080:8080 -d --env-file your-env-file \
gnames/gnmatcher -- rest -p 8080`
```
This command will set the service on port 8080 and will make it available through port 8080 on a local machine.

Usage from command line

Download the [latest verion] of gnmatcher binary, untar and put somewhere in PATH.
Run gnmatcher -V to generate configuration at ~/.config/gnmatcher.yaml
Edit ~/.config/gnmatcher.yaml accordingly.
Run gnmatcher rest -p 1234

The service will run on the given port (the default port is 8080).

Usage as a library

package main

import (
  "fmt"
  gnmatcher "github.com/gnames/gnmatcher/pkg"
  "github.com/gnames/gnmatcher/pkg/config"
  "github.com/gnames/gnmatcher/internal/io/bloom"
  "github.com/gnames/gnmatcher/internal/io/trie"
  "github.com/gnames/gnmatcher/internal/io/virusio"
)

func Example() {
	// Note that it takes several minutes to initialize lookup data structures.
	// Requirement for initialization: Postgresql database with loaded
	// http://opendata.globalnames.org/dumps/gnames-latest.sql.gz
	//
	// If data are imported already, it still takes several seconds to
	// load lookup data into memory.
	cfg := config.New()
	em := bloom.New(cfg)
	fm := trie.New(cfg)
	vm := virusio.New(cfg)
	gnm := gnmatcher.New(em, fm, vm, cfg)
	res := gnm.MatchNames([]string{"Pomatomus saltator", "Pardosa moesta"})
	for _, match := range res.Matches {
		fmt.Println(match.Name)
		fmt.Println(match.MatchType)
		for _, item := range match.MatchItems {
			fmt.Println(item.MatchStr)
			fmt.Println(item.EditDistance)
		}
	}
}

Configuration

You can use either cofiguration file, or environment variables. Configuration file is usually located at $HOME/.config/gnmatcher.yaml. When gnmatcher runs first time, it will create the configuration file. Logs provide the location of the configuration file with every run of gnmatcher. The meanings of configuration options are documented in the config file.

To make it easier to run gnmatcher in a container, or a Kubernetes pod, there are also environment variables that override configuration file values.

Env. Var.	Configuration
GNM_CACHE_DIR	CacheDir
GNM_JOBS_NUM	JobsNum
GNM_PG_HOST	PgHost
GNM_PG_PORT	PgPort
GNM_PG_USER	PgUser
GNM_PG_PASS	PgPass
GNM_PG_DB	PgDB

Client

A user can find an example of a client for the service in this test file.

The API is formally described in the RESTful API documentation

Development

There is a docker-compose file that sets up HTTP service to run tests. To run it to the following:

Copy .env.example file to the .env file in the project's root directory, change the settings accordingly.
Build the gnmatcher binary and docker image using make dc command.
Run docker-compose command docker compose up
Run tests via go test ./... -v

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
cmd		cmd
internal		internal
pkg		pkg
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
bench.txt		bench.txt
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
main.go		main.go
tools.go		tools.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gnmatcher

Introduction

Input and Output

Performance

Prerequisites

Usage

Usage with docker

Usage from command line

Usage as a library

Configuration

Client

Development

About

Releases 16

Packages

Languages

License

gnames/gnmatcher

Folders and files

Latest commit

History

Repository files navigation

gnmatcher

Introduction

Input and Output

Performance

Prerequisites

Usage

Usage with docker

Usage from command line

Usage as a library

Configuration

Client

Development

About

Resources

License

Stars

Watchers

Forks

Releases 16

Packages 0

Languages

Packages