Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-order search #24

Open
orlade opened this issue Oct 1, 2015 · 2 comments
Open

Out-of-order search #24

orlade opened this issue Oct 1, 2015 · 2 comments

Comments

@orlade
Copy link

orlade commented Oct 1, 2015

I had a play with string_score, and while it's excellent for what it does, it doesn't quite fit my use case out of the box. In particular, it lacks recognition for highly similar substrings that are in a different order.

tl;rd: Do you plan to add support for out-of-order substring matching? Or can you at least think of a smart way to do it? If this is totally outside of the scope of string_score, then go ahead and close this. I'm mostly just rubber ducking the problem.

For background, I have two spreadsheets where each row represents a building with some attributes, including ID and name. I'm told that both spreadsheets contain the same 66 buildings, except that one of the spreadsheets has 72 rows, and neither of them use the same IDs or names consistently. One will abbreviate some names, the other will abbreviate others, or the same ones but in a different way. It's a mess, so I'd like an automated, objective mechanism for associating the "matching" rows and ultimately merging the attributes.

For example, when searching for a match for 2G8 Bahagian Pinjaman Perumahan, string_score with 0.5 fuzziness thinks that PMO is a better match than LOT 2G8 (2M10 & 2M11) Bhg. Pinjaman Perumahan, JPM. Or for a more English example, comparing university of oxford with oxford of university scores 0.027.

To address this failure mode, I've wrapped it in a pretty gnarly loop:

  1. Both the search string and the comparison string are split into words
  2. Each word of the search string is string_score'd against each word of the comparison string (concatenating the whole comparison string as an option for matching abbreviations)
  3. For each search word, the score for the most similar comparison word is recorded
  4. The score for the search string against the comparison string is taken as the sum of the (maximum) score of each word, normalised by dividing by the number of search words.

Clearly this is more expensive (something like an order of magnitude, or at least a factor of the average number of words per string), but it's pretty easy to implement given what string_score already does. Can you think of a straightforward way to modify your algorithm to handle this kind of case? Or even just a smarter way to package it than mine?

@joshaven
Copy link
Owner

joshaven commented Oct 6, 2015

Thanks for your interest in my code, it is an honor to know that people are still finding it useful.

Would it work to split the out-of-order text into segments and sum their scores?

var agScore = 0, segments = "World Hello".split(' ');
segments.forEach(function (word) { agScore = agScore + "Hello World".score(word); });
console.log('The aggregate score of all word segments is: ' + agScore);

I am super busy so I am afraid I cannot answer how feasible it is to modify the algorithm to support better solution at the moment.

If you would like to see this feature builtin to some degree can you describe the desired results in the test language? See the following for examples: https://github.com/joshaven/string_score/blob/master/tests/test_string_score.js

@orlade
Copy link
Author

orlade commented Oct 9, 2015

No problem, I hardly expect you to go and add features, just wondering if you had any brainwaves.

My previous example would be something like ok('University of Melbourne'.score('Melbourne Uni') < 'University of Melbourne'.score('Melbourne University'), 'Better out-of-order matches still score better than worse out-of-order matches'); (this passes with your method).

I think your method works in a basic sense, but the score needs to be penalised/normalised so that adding in additional unmatched words will be less good, i.e. ok('hello world'.score('world foo bar hello') < 'hello world'.score('world hello'), 'Unmatched words should be penalised');.

But normalising has to be a little intelligent such that ok('University of Melbourne'.score('Melbourne University') > 'University of Melbourne'.score('Unimelb'), 'Multiple out-of-order better than one wrong'); and ok('University of Melbourne'.score('Melbourne University') > 'University of Melbourne'.score('Uni versity of Melb ourne'), 'Fewer but more accurate words better than more close but incorrect fragments ');.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants