Requiring 3 characters to perform a search works poorly for logographic corpora #73

DenialAdams · 2020-10-15T04:41:37Z

Code

Relevant typescript:

Line 134 in db1b958

if (query.length >= 3) {

But I don't know if this affects indexing as well, or if it's strictly a search interface issue

Details

I have a corpus of documents which are mixed Chinese/English. Searching for the English "cat" works well. However trying to search for the Chinese character 猫 (cat) is not fruitful, because the search will not trigger unless I input at least three characters.

Given that the ratio of semantics to character count varies across languages, I think that this can lead to a frustrating user experience.

As an aside, thanks a lot for making this, I'm super excited to use this!

jameslittle230 · 2020-10-15T06:46:24Z

Hey @DenialAdams! Thanks for writing in, hope Stork is working well for you so far despite this!

I'll have to noodle on this issue for a bit. If a corpus gets too large, Stork has a lot of trouble searching for any query less than three characters: the index file size itself gets pretty big and the search algorithm can get kind of slow.

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

Thanks again,
James

DenialAdams · 2020-10-15T12:42:13Z

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Interestingly (and I only thought to try this out this morning), if I use the CLI to search (bypassing the >= 3 requirement) I do get some results with a one character search. Well... I get exactly one result. I suspect this is because that's the only time this character appears alone, i.e. with whitespace on either side of it. A cursory look at the code seems to confirm that:

stork/src/index_versions/v3/builder/word_list_generators/mod.rs

Lines 47 to 63 in db1b958

    
           impl WordListGenerator for PlainTextWordListGenerator { 
        
               fn create_word_list( 
        
                   &self, 
        
                   _config: &InputConfig, 
        
                   buffer: &str, 
        
               ) -> Result<Contents, WordListGenerationError> { 
        
                   Ok(Contents { 
        
                       word_list: buffer 
        
                           .split_whitespace() 
        
                           .map(|word| AnnotatedWord { 
        
                               word: word.to_string(), 
        
                               ..Default::default() 
        
                           }) 
        
                           .collect(), 
        
                   }) 
        
               } 
        
           }

Since Chinese sentences don't use whitespace to separate words, this also might be a little bit of an issue :) but I don't know exactly what the word list is used for; I'll keep learning by reading your code (which is very readable, nice job!)

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

I'll definitely keep you posted :) I don't speak (or read) Chinese, but I'm building this for my roommate, so I'll forward along his thoughts

DenialAdams · 2020-10-15T14:25:31Z

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

I did some profiling and was able to easily cut 3 hours down to 1 min (PR incoming), so this is much less of a concern for me now :)

(edit: #74)

DenialAdams · 2020-10-15T16:34:39Z

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

Here's a prototype for this approach:
DenialAdams@fa2b24d

Let me know what you think of it, so far it's really improved the search results but the searches do take a little longer now (worth it to me)

DenialAdams mentioned this issue Oct 21, 2020

Index small substrings if they seem ideographic #80

Closed

jameslittle230 mentioned this issue Nov 3, 2020

Configure minimum prefix/search string length with ideograph consideration #83

Merged

jameslittle230 closed this as completed in #83 Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requiring 3 characters to perform a search works poorly for logographic corpora #73

Requiring 3 characters to perform a search works poorly for logographic corpora #73

DenialAdams commented Oct 15, 2020 •

edited

jameslittle230 commented Oct 15, 2020

DenialAdams commented Oct 15, 2020

DenialAdams commented Oct 15, 2020 •

edited

DenialAdams commented Oct 15, 2020

Requiring 3 characters to perform a search works poorly for logographic corpora #73

Requiring 3 characters to perform a search works poorly for logographic corpora #73

Comments

DenialAdams commented Oct 15, 2020 • edited

Code

Details

jameslittle230 commented Oct 15, 2020

DenialAdams commented Oct 15, 2020

DenialAdams commented Oct 15, 2020 • edited

DenialAdams commented Oct 15, 2020

DenialAdams commented Oct 15, 2020 •

edited

DenialAdams commented Oct 15, 2020 •

edited