Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requiring 3 characters to perform a search works poorly for logographic corpora #73

Closed
DenialAdams opened this issue Oct 15, 2020 · 4 comments · Fixed by #83
Closed

Comments

@DenialAdams
Copy link
Contributor

DenialAdams commented Oct 15, 2020

Code

Relevant typescript:

if (query.length >= 3) {

But I don't know if this affects indexing as well, or if it's strictly a search interface issue

Details

I have a corpus of documents which are mixed Chinese/English. Searching for the English "cat" works well. However trying to search for the Chinese character 猫 (cat) is not fruitful, because the search will not trigger unless I input at least three characters.

Given that the ratio of semantics to character count varies across languages, I think that this can lead to a frustrating user experience.


As an aside, thanks a lot for making this, I'm super excited to use this!

@jameslittle230
Copy link
Owner

Hey @DenialAdams! Thanks for writing in, hope Stork is working well for you so far despite this!

I'll have to noodle on this issue for a bit. If a corpus gets too large, Stork has a lot of trouble searching for any query less than three characters: the index file size itself gets pretty big and the search algorithm can get kind of slow.

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

Thanks again,
James

@DenialAdams
Copy link
Contributor Author

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Interestingly (and I only thought to try this out this morning), if I use the CLI to search (bypassing the >= 3 requirement) I do get some results with a one character search. Well... I get exactly one result. I suspect this is because that's the only time this character appears alone, i.e. with whitespace on either side of it. A cursory look at the code seems to confirm that:

impl WordListGenerator for PlainTextWordListGenerator {
fn create_word_list(
&self,
_config: &InputConfig,
buffer: &str,
) -> Result<Contents, WordListGenerationError> {
Ok(Contents {
word_list: buffer
.split_whitespace()
.map(|word| AnnotatedWord {
word: word.to_string(),
..Default::default()
})
.collect(),
})
}
}

Since Chinese sentences don't use whitespace to separate words, this also might be a little bit of an issue :) but I don't know exactly what the word list is used for; I'll keep learning by reading your code (which is very readable, nice job!)

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

I'll definitely keep you posted :) I don't speak (or read) Chinese, but I'm building this for my roommate, so I'll forward along his thoughts

@DenialAdams
Copy link
Contributor Author

DenialAdams commented Oct 15, 2020

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

I did some profiling and was able to easily cut 3 hours down to 1 min (PR incoming), so this is much less of a concern for me now :)

(edit: #74)

@DenialAdams
Copy link
Contributor Author

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

Here's a prototype for this approach:
DenialAdams@fa2b24d

Let me know what you think of it, so far it's really improved the search results but the searches do take a little longer now (worth it to me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants