Skip to content
This repository has been archived by the owner on Nov 12, 2023. It is now read-only.

Stats wrong for Chinese text #55

Closed
jzohrab opened this issue Jul 15, 2023 · 5 comments
Closed

Stats wrong for Chinese text #55

jzohrab opened this issue Jul 15, 2023 · 5 comments
Labels
bug Something isn't working fixed Fixed in develop branch, waiting for release

Comments

@jzohrab
Copy link
Owner

jzohrab commented Jul 15, 2023

Stats currently are calculated in such a way that don't work for character-based languages, such as Chinese. For example, take this single-page text, with completely garbage terms created:

image

Even though the terms are trash, they cover 100% of the text, so you'd expect the % to be pretty high ... but the index page shows 0% known:

image

Obviously not right.

@jzohrab jzohrab added the bug Something isn't working label Jul 15, 2023
@jzohrab
Copy link
Owner Author

jzohrab commented Jul 23, 2023

Tried a fix in [issue_55_fix_chinese_stats](https://github.com/jzohrab/lute/tree/issue_55_fix_chinese_stats) but the code is way too slow for real prod usage. There may be a far better way to do this, not sure what just yet though.

@jzohrab
Copy link
Owner Author

jzohrab commented Jul 28, 2023

Tried another way that didn't rely on a full render calculation, also failed spectacularly with timeout. New class TokenCoverage in same branch. Messy code too, which is great.

@jzohrab
Copy link
Owner Author

jzohrab commented Jul 30, 2023

Tried yet another method using regex matches, still nowhere near completes processing before 30s timeout, so it's way too slow for prod.

Have sunk several hours into this, because a) it's interesting, and b) if I could figure this out, I'd be able to drop the TextTokens table, which takes up a lot of space. Currently, I'm really only using the TextTokens table for calculating stats -- ... actually, the current methods would probably suffice for calculating stats, so I may revisit this idea for that.

Regardless, I'm still not sure how to calculate coverage accurately for Chinese at the moment. The first method used (do a fake render) seemed to be the best -- still slow-ish, but maybe there are some good optimizations possible in the rendering calculations which feel overcomplicated.

@jzohrab
Copy link
Owner Author

jzohrab commented Jul 31, 2023

Returned to the first method (effectively rendering each page in code), found some good simplifications to the renderable calculator class, but still not good enough. For a book of ~100K spanish words, the stats calc takes ~20s on my Mac, not usable.

Still found some good code optimizations, they're pushed to the branch, and can be pulled into the develop branch. Will handle that separately. Leaving this issue open.

@jzohrab
Copy link
Owner Author

jzohrab commented Aug 2, 2023

Reducing the calc size makes it workable. Merged into the dev branch, added wiki faq page about it -- https://github.com/jzohrab/lute/wiki/Stats-calculation -- and will include it in next launch. Phew.

@jzohrab jzohrab added the fixed Fixed in develop branch, waiting for release label Aug 2, 2023
@jzohrab jzohrab closed this as completed Aug 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working fixed Fixed in develop branch, waiting for release
Projects
None yet
Development

No branches or pull requests

1 participant