Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching with punctuation #52

Closed
PetesBreenCoding opened this issue Oct 13, 2022 · 6 comments
Closed

Searching with punctuation #52

PetesBreenCoding opened this issue Oct 13, 2022 · 6 comments
Labels
question Further information is requested

Comments

@PetesBreenCoding
Copy link

Hi,

If I have
await index.AddAsync(1, "Murphy's law");

then
index.Search("*Murphy's*")
will match that entry, but
index.Search("*Murphys*")
will not, but I would like it to.

What's the best approach to solving this? I could strip punctuation from the strings I input to the index and the query, but there may be a nicer way. I have played around with fuzzy matching, but never quite got it right.

Thanks

@mikegoatly
Copy link
Owner

Hi @PetesBreenCoding,

The short answer is as things stand, fuzzy matching will do the trick for you - I think the confusion comes from your use of wildcards in the matching. This fiddle has some working code for you to play with:

var index = new FullTextIndexBuilder<int>()
    .WithQueryParser(o => o.AssumeFuzzySearchTerms())
    .Build();

await index.AddAsync(1, "Murphy's law");

Console.WriteLine(index.Search("Murphys").Count());
// Prints "1"

Console.WriteLine(index.Search("Murphy's").Count());
// Prints "1"

But I feel that your question deserves a bit of a deeper dive into what's going on.

By default LIFTI will split words on punctuation, including apostrophes. This means that (rightly or wrongly) "Murphy's law" will actually get tokenized as three words:

  • Murphy
  • s
  • law

When you search for "Murphy's", the query is actually parsed as "Murphy & s" because search terms are also split using the same tokenization. You can see this using the following code (with fuzzy matching on as default):

var query = index.QueryParser.Parse(index.FieldLookup, "Murphy's", index.DefaultTokenizer);
Console.WriteLine(query.Root.ToString());

// Prints "?3,?MURPHY & ?0,?S"

This is searching for documents containing fuzzy matches for "murphy" and "s" (although because "s" is so short, only an exact match would be acceptable - the zero in ?0,? is stating that no edits are allowed)

When you add wildcards into the mix, for your first example you get this:

var query = index.QueryParser.Parse(index.FieldLookup, "*Murphy's*", index.DefaultTokenizer);
Console.WriteLine(query.Root.ToString());
// Prints "*MURPHY & S*"

Which essentially means documents containing any words ending with "murphy" and any words starting with "S". This will by coincidence match your document.

But your second search looks like this:

var query = index.QueryParser.Parse(index.FieldLookup, "*Murphys*", index.DefaultTokenizer);
Console.WriteLine(query.Root.ToString());
// Prints *MURPHYS*

Which means documents containing any words completely containing "murphys" - this won't match in your index.

@mikegoatly mikegoatly added the question Further information is requested label Oct 13, 2022
@PetesBreenCoding
Copy link
Author

Thanks very much @mikegoatly. I have been playing around with it a bit, and I have it working pretty well with the fuzzy logic. I also mixed in the wildcard search like so:
?1?query | *query*

This is so I can match on "murphys", "murphy's" but also "mur". I found the default fuzzy matching was returning too many irrelevant results in my database. I'll tweak this over time once I get used to the library a bit more.
Thanks again!

@mikegoatly
Copy link
Owner

No problem at all, glad to help.

One thing to be aware of is that using a * at the start of your search terms isn't particularly efficient for large indexes at the moment because of the way the index structure has to be recursively scanned to find the first character to match. The queries will be faster if you can just use a wildcard at the end, e.g. mur*, although you may not notice if your index isn't particularly big.

@PetesBreenCoding
Copy link
Author

At the moment, this particular index is only 2,000 records, and will probably never be any more than 10,000. I assume they would be considered small numbers?

@mikegoatly
Copy link
Owner

It'll probably be fine - it was more just something to be aware of. Let me know if you run into any problems though.

@mikegoatly
Copy link
Owner

Closing this as the question is resolved. Feel free to raise another issue if anything else come up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants