Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider #77

mikegoatly · 2023-08-14T17:55:52Z

An extension of #76 - I've just realised that wildcard field names are going to be a bit problematic. When parsing text from a query, the QueryTokenizer needs to know which index tokenizer to use when processing the search text.

Consider this index:

var index = new FullTextIndexBuilder<int>()
    .WithDefaultTokenization(t => t.WithStemming()) // Stemming on all fields by default
    .WithObjectTokenization<Customer>(o => o
        .WithKey(c => c.Id)
        .WithField(
           "Name", 
           c => c.Name, 
           tokenizationOptions: fo => fo.WithTokenization(t => t)) // No stemming on the Name field
        .WithDynamicFields("Tags", c => c.TagDictionary, "Tag_")
    )
    .Build();

The default index tokenizer uses stemming, whereas the field Name has it's own index tokenizer configured without stemming. If we allowed wildcard field names like this [Na*]=Something then it's no longer clear which tokenizer to use for the search text Something (especially if we ended up with another field starting with Na).

So I think as things stand, the options are:

Support wildcards, but duplicate the search parts for each matched field, e.g . [Tag_*]=foo would be equivalent to searching for [Tag_One]=foo | [Tag_Two]=foo | [Tag_Three]=foo
Support searching across all fields emitted by a named dynamic field provider using some other syntax, e.g. [?Tags]=foo (Syntax TBD). A single dynamic field provider will only ever have one index tokenizer associated to it, so this should work.

The first option would have a performance impact on the query, and we're probably going to need to build in some search optimisations to cache the search results emitted by a query to save the same search predicate being performed multiple times.

The second option is a bit more limited, but at least solves the issue across a specific dynamic field source.

The text was updated successfully, but these errors were encountered:

h0lg · 2023-08-18T14:29:59Z

I understand that in your example it is unclear which tokenizer to apply to the search text if the index itself uses a different tokenizer than the field(s) being searched. I never thought about this configuration and don't have an answer.

But how does lifti decide which tokenizer to use for the search text when searching across all fields with different configured tokenizers? Isn't that a similar question? O am I missing some important difference?

mikegoatly · 2023-08-28T10:43:41Z

@h0lg If no field is specified, then the currently the default index tokenizer is used to parse and normalize the search text - it's only if a specific field is being searched on, LIFTI uses the index tokenizer that was configured for that.

In that respect, you're right in that searching across all fields will be a problem if different tokenization has been used for them, and that's exactly the same as the problem that needs to be solved here.

I'd need to spend a bit more time thinking about this than I have right now, but I'm wondering if when searching for text across multiple fields:

All affected fields are collected (all fields, or a subset when a wildcarded field name is specified)
Each unique tokenizer is used to parse the search text.
The distinct search terms yielded from the tokenizers are combined with a field filter operator with the appropriate field ids. (A search term in this context could be any number number of tokens if a bracketed statement is encountered)

Edge cases to consider:

When searching across all fields, if all tokenizers are the same or all unique tokenizers produce the same search terms, then no field filters need to be applied.

I think this will require quite a bit of rework in the query parser logic, but it's certainly not impossible...

h0lg · 2023-08-28T19:43:54Z

I see, thanks for the clarification and sharing your thoughts.

Explaining the intricacies of the tokenization during the field search process and what happens in which case seems daunting to me. Maybe we're thinking about it too complicated? You could go with some rule that's easy to communicate and doesn't require you to explain the underlying mechanics - even if it has limitations. e.g.

If you search the same term/query across multiple fields (using wild cards or pipes or whatever), you can only do so if they share the same tokenizer. Otherwise you have write separate field queries.

Would that make things easier?

mikegoatly mentioned this issue Aug 14, 2023

Query syntax: Add support for spaces in field names #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider #77

Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider #77

mikegoatly commented Aug 14, 2023

h0lg commented Aug 18, 2023

mikegoatly commented Aug 28, 2023

h0lg commented Aug 28, 2023 •

edited

Loading

Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider #77

Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider #77

Comments

mikegoatly commented Aug 14, 2023

h0lg commented Aug 18, 2023

mikegoatly commented Aug 28, 2023

h0lg commented Aug 28, 2023 • edited Loading

h0lg commented Aug 28, 2023 •

edited

Loading