-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spike: string matching algorithms #6
Comments
FYI, we should look at how ES does Soundex and if we're normalizing the result similar to them. https://www.elastic.co/guide/en/elasticsearch/guide/current/phonetic-matching.html (Should also check 6.x) |
(From: https://en.wikipedia.org/wiki/Soundex) We will likely need to find non-english alternatives to use alongside. Edit: From the paper directly below:
|
From https://waset.org/publications/8664/a-comparison-and-analysis-of-name-matching-algorithms- Looks like the order of most accurate in that graph:
I've also found some sample implementations: |
FYI, if we need to come back to this I think we should write custom ES scorers with specific string compare algorithms. |
OFAC searches are inherently messy and complicated because they interact with people's real names and/or aliases. This means there isn't a "one-size" algorithm we could apply and instead need to offer our customers with lots of options.
In short, our search endpoint(s) should be able to reflect multiple string comparison algorithms:
We should also support no
algos
parameter (or a value ofall
) to run all string comparison algorithms. All searches run across aliases and real names from the OFAC list.The initial list of algorithms could include:
With many other possible algorithms,
I think the result body for an algorithm should be:
algorithm
: lowercase enumeration of all algorithms.score
: Is a normalized 0-1 percent of string match. (i.e. 0.95 -> 95% match)The text was updated successfully, but these errors were encountered: