New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Morphology only for certain fields #7

Closed
barryhunter opened this Issue Sep 19, 2017 · 14 comments

Comments

Projects
None yet
4 participants
@barryhunter
Contributor

barryhunter commented Sep 19, 2017

Would be nice to be able to only apply morphology to certain fields (or exclude applying it certain fields!) during indexing.

Know it would then mean different tokens exist in index for different fields (so a if a query keywoord is morphed, wont match the unmorphed filed) - but this could be mitigated by using expand_keywords on the index. (then the excluded fields, would only match via the 'exact keywords' not morphed at all)

Basically to avoid having to do

@!(place) Huntly | @place =Huntly

@airolg

This comment has been minimized.

airolg commented Sep 25, 2017

Thanks for issue, I've added it to Manticore's backlog.

@airolg airolg self-assigned this Sep 27, 2017

@tomatolog

This comment has been minimized.

Contributor

tomatolog commented Nov 12, 2017

want to clarify - in case of new option that disables morphology on indexing - morph whatever will be applied for query and there is a chance that query will have no matches, ie

source s1 {
sql_query = SELECT 1, 'runs' no_m, 'tests' m UNION SELECT 1, 'tests' no_m, 'runs' m 

index i1 {
source = s1
morph_disabled_fields = no_m

and this query will match only 2nd doc

SELECT id FROM i1 WHERE MATCH ( 'runs' ); SHOW META;

as runs from query got transformed be morph to 'run' and only 2nd doc at field m got same token

to match all docs you should write query like

SELECT id FROM i1 WHERE MATCH ( 'runs' ) option expand_keywords=1;
SELECT id FROM i1 WHERE MATCH ( 'runs | =runs' );
SELECT id FROM i1 WHERE MATCH ( '=runs' );

Could it be source of additional errors?

Not quite sure how this feature easy this example

> Basically to avoid having to do
> @!(place) Huntly | @place =Huntly
@barryhunter

This comment has been minimized.

Contributor

barryhunter commented Nov 13, 2017

Yes, as mentioned would probably use it with expand_keywords=1 so could match all columns.

       MATCH('Huntly') option expand_keywords=1;

Would work instead of having to do

       MATCH('@!(place) Huntly | @place =Huntly');

Where (only) on the place field don't want non-exact matches. Its fairly trivial in this case, but in case of a query like MATCH('Huntly Church') - having to transform it into MATCH('(@!(place) Huntly | @place =Huntly) (@!(place) Church| @place =Church)') is quickly getting unwieldy. especially if add in Phrase, OR, or even other field match terms, etc

Could it be source of additional errors?

Dont think so, to at least they wouldn't be unknown. using expand_words (or doing manually with 'runs | =runs' wouldnt add any false positives). At worse just changing the ranking positions.

@tomatolog

This comment has been minimized.

Contributor

tomatolog commented Nov 17, 2017

also not clear now in case of morphology enabled each term produce multiple entries at dictionary like runs got transformed to

run
=runs

and these terms got stored into dictionay.

However in case we just skip morth part for source term runs then got stored in dictionary as is. Then later there is no way to match that term at dictionary either

  • query runs morphed to run and there is no matches
  • query =runs skip morph but also has no matches
  • query runs + expand_keywords=1 transformed to run | =runs and also has no matches

Seems you request is at indexing time to skip morthed tokens but store only exact term tokens. Am I right?

@tomatolog

This comment has been minimized.

Contributor

tomatolog commented Nov 23, 2017

Here is variants collected so far on how to set new option at index section

morphology_fields = enabled \ disabled: fields list
morphology_skip_fields = fields list
morphology = lemmatize_ru_all, lemmatize_en_all : fields list
morphology_fields = field, !field, !(fields list)

like these examples, however only one option will be used

morphology_fields = enabled: fi1, fi12
morphology_fields = disabled: tags
morphology_skip_fields = fi1, fi12
morphology = lemmatize_ru_all, lemmatize_en_all : fi1, fi12
morphology_fields = fi1, !tags, !(fi22, fi13)

Want to know you opinion on option naming.

Wanted also notice that feature implemented works as
fields with disabled morphology collects only exact form tokens (=token)

ie for field with disabled morphology and content runs forest as running these tokens got stored =runs =forest =as =running

that is why these queries match nothing

run
runs
running

but these queries match document

=runs
=running

or query with expand_keywords option or to index with expand_keywords option set

@barryhunter

This comment has been minimized.

Contributor

barryhunter commented Nov 23, 2017

Want to know you opinion on option naming.

I was kinda thinking the very simple morphology_skip_fields - just list any fields that want to omit. On basis it probably just a few (like person name, place name etc), out of many. But guess that would then make it cumbersome if have few fields want it

 morphology_fields = !tags, !fi22

seems simplest, no grouping.

But with rule if there are no + fields listed, then there is automatic 'all' listed at start. So

 morphology_fields = fi1, fi12

would only enable it for those two specific fields (no others). Ading + and ! together at once, the ! fields would have no effect.

fields with disabled morphology collects only exact form tokens (=token)

Yes, think that is what was originally thinking.

(It may not matter, if they 'unmorphed' - non exact - token was written to index, it would never match (keyword in query would be morphed) - but does seem best to just omit it completely )

If dont want to rewrite query, then expand_keywords, would allow query words to match morphed fields or unmorphed fields equally.

@klirichek

This comment has been minimized.

Contributor

klirichek commented Nov 25, 2017

@barryhunter, how you think about extending existing morphology clause?
morphology = lemmatize_ru_all, lemmatize_en_all : fi1, fi12
IMHO that is simplest way since doesn't need to add any new option/keyword to our huge set of existing rules.
Also it provide the same simple way if sometimes somebody want different morphology settings for different fields (then we just add the next same line with another fieldset, and one more, and more). Say,
morphology = lemmatize_en_all : english_title
morphology = lemmatize_ru_all : rus_translated, rus_subject
morphology = lemmatize_de_all
that is set english lematizer for english_title field, russian for rus_translated, rus_subject, and german for all the rest (unspecified).
Or,
morphology = none : hashtags
morphology = stemmer_en, stemmer_de
that is - set english and german stemming, but exclude it for hashtags.

Of course for now we can't assign different morphology settings this way (to different fields), but I mean that case of using existing 'morphology' clause in config can be this easy extended when necessary (second example in the case is exactly to switch off morphology on certain fields)

@barryhunter

This comment has been minimized.

Contributor

barryhunter commented Dec 1, 2017

Only just seen this. Frankly not too bothered by the exact syntax. Even if cumbersome the config isnt updated very often.

But on the idea if being able to specify different morphology for different fields, not sure that would work. Because the morphology has to be applied to the query too, if different morphology, then keywords would match different fields incorrectly.

... but otherwise yes would be happy extending morphology variable. Perhaps clearer if was this way done this way round...

morphology = stemmer_en, stemmer_de
morphology = none : hashtags

... ie set all, then override for 'some' fields. (so morth not specified dields affects everything, rather than just 'not specified' ones.

@tomatolog

This comment has been minimized.

Contributor

tomatolog commented Dec 5, 2017

Team finally decided to add simplest option morphology_skip_fields = fi1, fi12 to suppress morphology on certain fields and later we might add some extension to that syntax

@tomatolog

This comment has been minimized.

Contributor

tomatolog commented Dec 6, 2017

just pushed morphology_skip_fields feature to master e0f8754
Now at index section you might use morphology_skip_fields option to enumerate all fields there morphology will be disabled on indexing, ie only exact terms will be stored for such fields.

That option not got stored into index and has no effect at RT index. These will be addressed soon.

@barryhunter

This comment has been minimized.

Contributor

barryhunter commented Dec 7, 2017

Just compiled 2.5.2 f8d9b03@171207 id64-dev - this seems to work.

... did only test a indexer built indexer, and true the setting doesn't show in show index ... settings etc - but actual queries against the index seem to work.

sphinxQL>select COUNT(*) from agents where match ('@a2 Broker') OPTION expand_keywords=1;
|        7 |

sphinxQL>select COUNT(*) from agents where match ('@a2 Brokers') OPTION expand_keywords=1;
|       83 |

The index still has morphology on other fields (including the original agent_name field)

sphinxQL>select COUNT(*) from agents where match ('@agent_name Brokers') OPTION expand_keywords=1;
|       90 |

sphinxQL>select COUNT(*) from agents where match ('@agent_name Broker') OPTION expand_keywords=1;
|       90 |

(yes, coopts the new expand_keywords OPTION, which makes it much easier to enable, without changing the whole index)


sphinxQL>show index agents settings;
index_exact_words = 1
charset_type = utf-8
dict = keywords
morphology = stem_en
@tomatolog

This comment has been minimized.

Contributor

tomatolog commented Dec 7, 2017

This option has not stored into index as that require RT index to catch up plain index version. I'll fix that later along with show index settings

@barryhunter

This comment has been minimized.

Contributor

barryhunter commented Dec 7, 2017

ah yes, was just confirming those limitations. the index settings dump was just to confirm that morphology was clearly enabled on the index.

Anyway, looks good so far. Thank you for continuing with this :)

@tomatolog

This comment has been minimized.

Contributor

tomatolog commented Jan 9, 2018

I've just added feature to RT index at c5eb06e

@tomatolog tomatolog closed this Jan 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment