Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental Improved Search Algorithm #524

Merged
merged 7 commits into from Dec 14, 2023

Conversation

tomdaffurn
Copy link
Contributor

@tomdaffurn tomdaffurn commented Dec 8, 2023

This is a re-write of the jaroWinkler function with the goal of improving the scoring performance. The new algorithm changes several things:

  1. Compare tokens in the search to the index tokens
    • i.e. "find matches for every search token" rather than "find match for every indexed token"
    • Improves scores of searches that don't include "middle" names
    • Prevents sanctioned names that are 1 word (HADI, EMMA, KAMILA) matching long searches
    • Has a side-effect that short search terms will have more false positives. I think this is a good trade off as the sanction lists will always contain the full name, but the search might not
  2. Once a token has matched something, it can't match a different token
    • This prevents names with repeated words having artificially high scores
    • e.g. prevents any search containing "Vladimir" matching "VLADIMIROV, Vladimir Vladimirovich"
  3. Weights each word-score by the length of the word, relative to the search and indexed name
    • This corrects for error that is introduced by splitting names into tokens and doing piecewise Jaro-Winkler scoring
    • Combing word-scores using a simple average gives short words (like Li, Al) equal weight to much longer words
    • The length-weighted scores are comparable to what you get by doing whole-name to whole-name Jaro-Winkler comparison
  4. Punishes word-scores when the matching tokens have significantly different length
  5. Punishes word-scores when the matching tokens start with different letters

The resulting search behaviour has significantly better true positive rate AND false positive rate. Examples of this are shown in cmd/server/new_algorithm_test.go.

I've done testing with 2000 real customer names, and with 50 sanctioned names. The aggregated results are below. I can share the 50 sanctioned names data, but the 2000 customer names are too sensitive to share.

I haven't fixed all of the tests and written enough new tests, but I'm happy to do so if you like this change.

image image

{"the group for the preservation of the holy sites", "the group", 0.416},
{precompute("the group for the preservation of the holy sites"), precompute("the group"), 0.416},
{"group preservation holy sites", "group", 0.460},
{"the group for the preservation of the holy sites", "the group", 0.880},
Copy link
Contributor Author

@tomdaffurn tomdaffurn Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example results changing because of the preference for matching all of the search name over matching all of the indexed name. If this is undesirable for a user, they can increase UNMATCHED_INDEX_TOKEN_WEIGHT. It's currently set very low

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree. Adding that knob will help some use-cases to lower these types of scores.

@tomdaffurn tomdaffurn marked this pull request as ready for review December 8, 2023 05:33
adamdecaf
adamdecaf previously approved these changes Dec 12, 2023
Comment on lines +370 to +371
//TODO should use a phonetic comparison here, like Soundex
score = score * differentLetterPenaltyWeight
Copy link
Member

@adamdecaf adamdecaf Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I had been thinking we should detect the same. Soundex is fairly focused on English words though, so it may need adapted for international -> english translations and names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great info. Thanks Adam, I'll have a read

@adamdecaf
Copy link
Member

This is excellent @tomdaffurn! Thank you for the contribution. I had been thinking about how to implement a couple of these improvements, but your solution is excellent. From the results I've seen this could be merged and replace the existing algorithm. We've made similar releases in the past.

@tomdaffurn
Copy link
Contributor Author

Thanks for the review and tick Adam! You've got a great tool here, and it's fun to work on.

There were some linting errors in my code, so I've fixed those and added to README.md

@codecov-commenter
Copy link

Codecov Report

Merging #524 (2532acd) into master (1ef25be) will increase coverage by 1.63%.
Report is 7 commits behind head on master.
The diff coverage is 0.00%.

❗ Current head 2532acd differs from pull request most recent head ddb46e7. Consider uploading reports for the commit ddb46e7 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #524      +/-   ##
=========================================
+ Coverage    8.18%   9.81%   +1.63%     
=========================================
  Files          44      38       -6     
  Lines        3531    2811     -720     
=========================================
- Hits          289     276      -13     
+ Misses       3219    2511     -708     
- Partials       23      24       +1     

@adamdecaf adamdecaf merged commit f476bbd into moov-io:master Dec 14, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants