Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike: string matching algorithms #6

Closed
adamdecaf opened this issue Jan 8, 2019 · 4 comments
Closed

spike: string matching algorithms #6

adamdecaf opened this issue Jan 8, 2019 · 4 comments
Assignees

Comments

@adamdecaf
Copy link
Member

adamdecaf commented Jan 8, 2019

OFAC searches are inherently messy and complicated because they interact with people's real names and/or aliases. This means there isn't a "one-size" algorithm we could apply and instead need to offer our customers with lots of options.

In short, our search endpoint(s) should be able to reflect multiple string comparison algorithms:

POST /v1/ofac/search/name?algos=levenstein,exact
[
 {
   // ...
 }
]

We should also support no algos parameter (or a value of all) to run all string comparison algorithms. All searches run across aliases and real names from the OFAC list.

The initial list of algorithms could include:

With many other possible algorithms,

I think the result body for an algorithm should be:

{
  "algorithm": "hamming",
  "score": 0.95
}
  • algorithm: lowercase enumeration of all algorithms.
  • score: Is a normalized 0-1 percent of string match. (i.e. 0.95 -> 95% match)
@adamdecaf adamdecaf self-assigned this Jan 8, 2019
@adamdecaf adamdecaf added this to In progress in Current Work Jan 8, 2019
@adamdecaf
Copy link
Member Author

FYI, we should look at how ES does Soundex and if we're normalizing the result similar to them.

https://www.elastic.co/guide/en/elasticsearch/guide/current/phonetic-matching.html (Should also check 6.x)

@adamdecaf
Copy link
Member Author

adamdecaf commented Jan 9, 2019

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.

(From: https://en.wikipedia.org/wiki/Soundex)

We will likely need to find non-english alternatives to use alongside.

Edit: From the paper directly below:

The Soundex and Metaphone algorithms are highly susceptible to missing matches where a name pair starts with a different letter (e.g. KAVANAGH, CAVANAGH and HAVANAH). This is especially true with names beginning with vowels - where typically many equivalents exist (e.g. EWELL, ULE, YOUL, WHEWELL, HEWEL), which should result in matches being reported.

@adamdecaf adamdecaf changed the title proposal: string matching algorithms spike: string matching algorithms Jan 10, 2019
@adamdecaf adamdecaf moved this from In progress to Blocked / Waiting in Current Work Jan 14, 2019
@adamdecaf
Copy link
Member Author

FYI, if we need to come back to this I think we should write custom ES scorers with specific string compare algorithms.

@adamdecaf adamdecaf moved this from Blocked / Waiting to Done in Current Work Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Current Work
  
Done
Development

No branches or pull requests

1 participant