Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard #58

Closed
markwriter opened this issue Nov 15, 2022 · 2 comments
Closed

Wildcard #58

markwriter opened this issue Nov 15, 2022 · 2 comments

Comments

@markwriter
Copy link

I was hoping to use the wildcard to find "stationary" and "stationery" using %. Searching for "station%ry" did not return any results - "stationary stationery" works as expected.

Enclosed is Json.
umClasses.zip

var jsonFile = File.ReadAllText("C:\Users\manderson\Documents\umClasses.json");
var searchModels =
JsonConvert.DeserializeObject<System.Collections.Generic.List>(jsonFile);
var index = new FullTextIndexBuilder()
.WithDefaultTokenization(o => o.WithStemming())
.WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
.WithTextExtractor()
.WithObjectTokenization(
itemOptions => itemOptions
.WithKey(c => c.ClassDescriptionId)
.WithField("GLGuidelines", f => f.GLGuidelines)
).Build();
index.AddRangeAsync(searchModels);

        //Was hoping these two counts would be the same
        var searchResults = index?.Search("stationary stationery").Count();
        var searchResults2 = index?.Search("station%ry").Count();
@mikegoatly
Copy link
Owner

The wierdness you're seeing is an unfortunate side-effect of using stemming in the index. The words "stationary" and "stationery" get stemmed to different forms:

Console.WriteLine(index.DefaultTokenizer.Process("stationary")[0].Value);
Console.WriteLine(index.DefaultTokenizer.Process("stationery")[0].Value);

// Output:
// STATIONARI
// STATIONERI

This makes wildcard matching behave unexpectedly - there's not really an effective way for LIFTI to stem a wildcard search like this to its index counterpart.

Your options in this case are:

  • use an ends with wildcard to perform the search, e.g. station* - you could make it even more selective by requiring that at least one character appears after "station": station%*
  • Armed with the knowledge of how the words are stemmed, search with station%ri

@markwriter
Copy link
Author

Got it - I appreciate the answer & suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants