spike: string matching algorithms #6

adamdecaf · 2019-01-08T17:18:58Z

OFAC searches are inherently messy and complicated because they interact with people's real names and/or aliases. This means there isn't a "one-size" algorithm we could apply and instead need to offer our customers with lots of options.

In short, our search endpoint(s) should be able to reflect multiple string comparison algorithms:

POST /v1/ofac/search/name?algos=levenstein,exact
[
 {
   // ...
 }
]

We should also support no algos parameter (or a value of all) to run all string comparison algorithms. All searches run across aliases and real names from the OFAC list.

The initial list of algorithms could include:

case-insensitive and trimmed exact match
Levenstein
Jaro
Jaro-Winkler
Hamming Distance
Soundex
Ukkonen
Wagner-Fischer

With many other possible algorithms,

https://www.rosette.com/blog/overview-fuzzy-name-matching-techniques/

I think the result body for an algorithm should be:

{
  "algorithm": "hamming",
  "score": 0.95
}

algorithm: lowercase enumeration of all algorithms.
score: Is a normalized 0-1 percent of string match. (i.e. 0.95 -> 95% match)

The text was updated successfully, but these errors were encountered:

adamdecaf · 2019-01-09T19:23:23Z

FYI, we should look at how ES does Soundex and if we're normalizing the result similar to them.

https://www.elastic.co/guide/en/elasticsearch/guide/current/phonetic-matching.html (Should also check 6.x)

adamdecaf · 2019-01-09T21:47:08Z

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.

(From: https://en.wikipedia.org/wiki/Soundex)

We will likely need to find non-english alternatives to use alongside.

Edit: From the paper directly below:

The Soundex and Metaphone algorithms are highly susceptible to missing matches where a name pair starts with a different letter (e.g. KAVANAGH, CAVANAGH and HAVANAH). This is especially true with names beginning with vowels - where typically many equivalents exist (e.g. EWELL, ULE, YOUL, WHEWELL, HEWEL), which should result in matches being reported.

adamdecaf · 2019-01-09T23:49:09Z

From https://waset.org/publications/8664/a-comparison-and-analysis-of-name-matching-algorithms-

Looks like the order of most accurate in that graph:

NYSIIS
Phonex
LIG2

I've also found some sample implementations:

adamdecaf · 2019-01-18T03:59:33Z

FYI, if we need to come back to this I think we should write custom ES scorers with specific string compare algorithms.

adamdecaf self-assigned this Jan 8, 2019

adamdecaf added this to In progress in Current Work Jan 8, 2019

adamdecaf changed the title ~~proposal: string matching algorithms~~ spike: string matching algorithms Jan 10, 2019

adamdecaf moved this from In progress to Blocked / Waiting in Current Work Jan 14, 2019

adamdecaf closed this as completed Jan 18, 2019

adamdecaf moved this from Blocked / Waiting to Done in Current Work Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike: string matching algorithms #6

spike: string matching algorithms #6

adamdecaf commented Jan 8, 2019 •

edited

adamdecaf commented Jan 9, 2019

adamdecaf commented Jan 9, 2019 •

edited

adamdecaf commented Jan 9, 2019

adamdecaf commented Jan 18, 2019

spike: string matching algorithms #6

spike: string matching algorithms #6

Comments

adamdecaf commented Jan 8, 2019 • edited

adamdecaf commented Jan 9, 2019

adamdecaf commented Jan 9, 2019 • edited

adamdecaf commented Jan 9, 2019

adamdecaf commented Jan 18, 2019

adamdecaf commented Jan 8, 2019 •

edited

adamdecaf commented Jan 9, 2019 •

edited