Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percolate query doesn't support cyrillic tokenization #194

Closed
Legless opened this issue May 17, 2019 · 2 comments
Labels
bug

Comments

@Legless
Copy link

@Legless Legless commented May 17, 2019

Hello!

Using Manticore 3.0.0 95141ec@190509 release, I got this issue: when using CALL PQ with cyrillic (russian) text, the search query is not stemmed.

I have folowing PQ index:

            
| id   | query            | tags | filters |
            
|  813 | test             |      |         |
            
|  814 | проверка         |      |         |
            

When I'm calling pq with latin text, everything works as expected, searching with other word form too:

            
CALL PQ ('pq_test', 'testing', 0 AS docs_json );
            
 ------ 
            
| id   |
            
 ------ 
            
|  813 |
            
 ------ 
            

But when I'm searching cyrillic test, it's only found with exact match:

            
CALL PQ ('pq_test', 'проверка', 0 AS docs_json );
            
 ------ 
            
| id   |
            
 ------ 
            
|  814 |
            
 ------ 
            

And when I'm searching with oner word form, it returns empty set:

            
CALL PQ ('pq_test', 'проверки', 0 AS docs_json );         
            
Empty set (0.00 sec)
            

In my config I have base index:

            
index __base {
            
    min_stemming_len    = 2
            
    min_word_len        = 1
            
    index_exact_words   = 1
            
    min_infix_len       = 2
            
    expand_keywords     = 1
            

            
    docinfo         = extern
            
    dict            = keywords
            
    mlock           = 0
            
    morphology      = stem_ru
            

            
    blend_chars = -
            
    ignore_chars = U 0021, U 0023..U 002C, U 002F, U 003A..U 0040, U 0060, U 2019, U 00AB, U 00BB, U 0027
            
    charset_table = 0..9, _, A..Z->a..z, a..z,  \
            
            U 410..U 418->U 430..U 438,  \
            
            U 41A..U 42A->U 43A..U 44A,  \
            
            U 42C->U 44C,  \
            
            U 42E..U 42F->U 44E..U 44F,  \
            
            U 404->U 454,  \
            
            U 406->U 438,  \
            
            U 456->U 438,  \
            
            U 407->U 438,  \
            
            U 457->U 438,  \
            
            U 439->U 438,  \
            
            U 419->U 438,  \
            
            U 42B->U 438,  \
            
            U 44B->U 438,  \
            
            U 490->U 433,  \
            
            U 491->U 433,  \
            
            U 401->U 435,  \
            
            U 451->U 435,  \
            
            U 42D->U 435,  \
            
            U 44D->U 435,  \
            
            U 404->U 435,  \
            
            U 454->U 435,  \
            
            U 430..U 438,  \
            
            U 43A..U 44A,  \
            
            U 44C,         \
            
            U 44E..U 44F    
            
}
            

And pq, which extends it:

            
index pq_test : __base {
            
    path            = /var/lib/manticore/data/pq_test
            
    type            = percolate
            
    rt_field = Content
            
}
            

I tried to debug this using call keywords, but

            
call keywords( 'слово', 'pq_test' );
            
ERROR 1064 (42000): not implemented
            

Any ideas?

@tomatolog

This comment has been minimized.

Copy link
Contributor

@tomatolog tomatolog commented May 21, 2019

to use CALL KEYWORDS on index you might set your base to

index base1
{
	type = template
...
index pq_test : base1
{  
    type            = percolate  

I tested you case and see issue here that I'm going to investigate. I'll inform you on fix.

@tomatolog

This comment has been minimized.

Copy link
Contributor

@tomatolog tomatolog commented Aug 26, 2019

I've just fixed at commit 6b8c424 percolate index to handle stemmers \ morphology option

@tomatolog tomatolog closed this Aug 26, 2019
@tomatolog tomatolog removed the in backlog label Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.