-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requiring 3 characters to perform a search works poorly for logographic corpora #73
Comments
Hey @DenialAdams! Thanks for writing in, hope Stork is working well for you so far despite this! I'll have to noodle on this issue for a bit. If a corpus gets too large, Stork has a lot of trouble searching for any query less than three characters: the index file size itself gets pretty big and the search algorithm can get kind of slow. I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings. Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.) As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter! Thanks again, |
Interestingly (and I only thought to try this out this morning), if I use the CLI to search (bypassing the >= 3 requirement) I do get some results with a one character search. Well... I get exactly one result. I suspect this is because that's the only time this character appears alone, i.e. with whitespace on either side of it. A cursory look at the code seems to confirm that: stork/src/index_versions/v3/builder/word_list_generators/mod.rs Lines 47 to 63 in db1b958
Since Chinese sentences don't use whitespace to separate words, this also might be a little bit of an issue :) but I don't know exactly what the word list is used for; I'll keep learning by reading your code (which is very readable, nice job!)
Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅 A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?
I'll definitely keep you posted :) I don't speak (or read) Chinese, but I'm building this for my roommate, so I'll forward along his thoughts |
I did some profiling and was able to easily cut 3 hours down to 1 min (PR incoming), so this is much less of a concern for me now :) (edit: #74) |
Here's a prototype for this approach: Let me know what you think of it, so far it's really improved the search results but the searches do take a little longer now (worth it to me) |
Code
Relevant typescript:
stork/js/entity.ts
Line 134 in db1b958
But I don't know if this affects indexing as well, or if it's strictly a search interface issue
Details
I have a corpus of documents which are mixed Chinese/English. Searching for the English "cat" works well. However trying to search for the Chinese character 猫 (cat) is not fruitful, because the search will not trigger unless I input at least three characters.
Given that the ratio of semantics to character count varies across languages, I think that this can lead to a frustrating user experience.
As an aside, thanks a lot for making this, I'm super excited to use this!
The text was updated successfully, but these errors were encountered: