Experimental Improved Search Algorithm #524

tomdaffurn · 2023-12-08T04:54:38Z

This is a re-write of the jaroWinkler function with the goal of improving the scoring performance. The new algorithm changes several things:

Compare tokens in the search to the index tokens
- i.e. "find matches for every search token" rather than "find match for every indexed token"
- Improves scores of searches that don't include "middle" names
- Prevents sanctioned names that are 1 word (HADI, EMMA, KAMILA) matching long searches
- Has a side-effect that short search terms will have more false positives. I think this is a good trade off as the sanction lists will always contain the full name, but the search might not
Once a token has matched something, it can't match a different token
- This prevents names with repeated words having artificially high scores
- e.g. prevents any search containing "Vladimir" matching "VLADIMIROV, Vladimir Vladimirovich"
Weights each word-score by the length of the word, relative to the search and indexed name
- This corrects for error that is introduced by splitting names into tokens and doing piecewise Jaro-Winkler scoring
- Combing word-scores using a simple average gives short words (like Li, Al) equal weight to much longer words
- The length-weighted scores are comparable to what you get by doing whole-name to whole-name Jaro-Winkler comparison
Punishes word-scores when the matching tokens have significantly different length
Punishes word-scores when the matching tokens start with different letters

The resulting search behaviour has significantly better true positive rate AND false positive rate. Examples of this are shown in cmd/server/new_algorithm_test.go.

I've done testing with 2000 real customer names, and with 50 sanctioned names. The aggregated results are below. I can share the 50 sanctioned names data, but the 2000 customer names are too sensitive to share.

I haven't fixed all of the tests and written enough new tests, but I'm happy to do so if you like this change.

tomdaffurn · 2023-12-08T05:31:56Z

cmd/server/search_test.go

-		{"the group for the preservation of the holy sites", "the group", 0.416},
-		{precompute("the group for the preservation of the holy sites"), precompute("the group"), 0.416},
-		{"group preservation holy sites", "group", 0.460},
+		{"the group for the preservation of the holy sites", "the group", 0.880},


This is a good example results changing because of the preference for matching all of the search name over matching all of the indexed name. If this is undesirable for a user, they can increase UNMATCHED_INDEX_TOKEN_WEIGHT. It's currently set very low

Yea I agree. Adding that knob will help some use-cases to lower these types of scores.

adamdecaf · 2023-12-12T21:05:15Z

cmd/server/search.go

+		//TODO should use a phonetic comparison here, like Soundex
+		score = score * differentLetterPenaltyWeight


Agreed. I had been thinking we should detect the same. Soundex is fairly focused on English words though, so it may need adapted for international -> english translations and names.

https://airccse.org/journal/ijcsea/papers/4314ijcsea03.pdf

https://www.basistech.com/whitepapers/the-name-matching-you-need-EN.pdf

https://www.iijg.org/wp-content/uploads/2014/02/Alexander_Beider.pdf

Great info. Thanks Adam, I'll have a read

adamdecaf · 2023-12-12T21:10:03Z

This is excellent @tomdaffurn! Thank you for the contribution. I had been thinking about how to implement a couple of these improvements, but your solution is excellent. From the results I've seen this could be merged and replace the existing algorithm. We've made similar releases in the past.

tomdaffurn · 2023-12-13T00:43:22Z

Thanks for the review and tick Adam! You've got a great tool here, and it's fun to work on.

There were some linting errors in my code, so I've fixed those and added to README.md

codecov-commenter · 2023-12-13T17:24:27Z

Codecov Report

Merging #524 (2532acd) into master (1ef25be) will increase coverage by 1.63%.
Report is 7 commits behind head on master.
The diff coverage is 0.00%.

❗ Current head 2532acd differs from pull request most recent head ddb46e7. Consider uploading reports for the commit ddb46e7 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #524      +/-   ##
=========================================
+ Coverage    8.18%   9.81%   +1.63%     
=========================================
  Files          44      38       -6     
  Lines        3531    2811     -720     
=========================================
- Hits          289     276      -13     
+ Misses       3219    2511     -708     
- Partials       23      24       +1

tomdaffurn added 3 commits December 8, 2023 10:58

Experimental search algo

e68b03d

Add tests and fix tests broken by changing scores

df9ce90

Update TestJaroWinkler to use best-pairs algo

8ddbe16

tomdaffurn commented Dec 8, 2023

View reviewed changes

tomdaffurn marked this pull request as ready for review December 8, 2023 05:33

tomdaffurn requested a review from adamdecaf as a code owner December 8, 2023 05:33

Clean up test

19bae92

adamdecaf previously approved these changes Dec 12, 2023

View reviewed changes

adamdecaf reviewed Dec 12, 2023

View reviewed changes

tomdaffurn added 2 commits December 13, 2023 09:09

Fix linter issue

336b96c

Update config settings

b99f1e3

tomdaffurn dismissed adamdecaf’s stale review via b99f1e3 December 12, 2023 23:23

More linter fixes

ddb46e7

adamdecaf approved these changes Dec 14, 2023

View reviewed changes

adamdecaf merged commit f476bbd into moov-io:master Dec 14, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental Improved Search Algorithm #524

Experimental Improved Search Algorithm #524

tomdaffurn commented Dec 8, 2023 •

edited

tomdaffurn Dec 8, 2023 •

edited

adamdecaf Dec 12, 2023

adamdecaf Dec 12, 2023 •

edited

tomdaffurn Dec 13, 2023

adamdecaf commented Dec 12, 2023

tomdaffurn commented Dec 13, 2023

codecov-commenter commented Dec 13, 2023

		//TODO should use a phonetic comparison here, like Soundex
		score = score * differentLetterPenaltyWeight

Experimental Improved Search Algorithm #524

Experimental Improved Search Algorithm #524

Conversation

tomdaffurn commented Dec 8, 2023 • edited

tomdaffurn Dec 8, 2023 • edited

Choose a reason for hiding this comment

adamdecaf Dec 12, 2023

Choose a reason for hiding this comment

adamdecaf Dec 12, 2023 • edited

Choose a reason for hiding this comment

tomdaffurn Dec 13, 2023

Choose a reason for hiding this comment

adamdecaf commented Dec 12, 2023

tomdaffurn commented Dec 13, 2023

codecov-commenter commented Dec 13, 2023

Codecov Report

tomdaffurn commented Dec 8, 2023 •

edited

tomdaffurn Dec 8, 2023 •

edited

adamdecaf Dec 12, 2023 •

edited