Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete display of delimited dictionary entries #5168

Closed
ceaum opened this issue Aug 2, 2019 · 4 comments · Fixed by koreader/koreader-base#1431 or #8446
Closed

Incomplete display of delimited dictionary entries #5168

ceaum opened this issue Aug 2, 2019 · 4 comments · Fixed by koreader/koreader-base#1431 or #8446

Comments

@ceaum
Copy link

ceaum commented Aug 2, 2019

  • KOReader version: v2019.07
  • Device: Kobo Clara HD

Dictionary look-up of a word that contains an entry with delimited portions displays a seemingly arbitrary portion.

Two examples using the wikt-en-ALL-2018-05-15 dictionary:

  1. Querying "haan" in StarDict displays both the Dutch and Finnish entries as such
    . The Koreader dictionary only displays the Dutch entry like this, which happens to be the first one.
  2. Querying "vraag" displays this
    in StarDict, but only the Dutch entry in Koreader, which here happens to be the last entry.

The .dict.dz, .idx and .ifo files are placed within a directory in /mnt/onboard/.adds/koreader/data/dict/.

E: Title and text edits because I accidentally posted before completing the Issue.
E2: formatting

@ceaum ceaum changed the title Incomplete display of dictionary entry Incomplete display of delimited dictionary entries Aug 2, 2019
@Frenzie Frenzie added the bug label Aug 2, 2019
@Frenzie
Copy link
Member

Frenzie commented Aug 2, 2019

See Dushistov/sdcv#30.

@cyphar
Copy link
Contributor

cyphar commented Oct 18, 2021

Unsurprisingly because this affects Japanese text I went and fixed it 😅. I've submitted Dushistov/sdcv#78 upstream which should fix this issue.

@Frenzie Frenzie closed this as completed Oct 18, 2021
@Frenzie Frenzie added this to the 2021.11 milestone Oct 18, 2021
@poire-z
Copy link
Contributor

poire-z commented Oct 18, 2021

Just asking: how do you judge the performance impact of your upstream PR ?
sdcv can be slow when you have lots of dicts and no results, and/or the need to use fuzzy search.
Feels like your fix will just happen after an entry is found and will read before/after, so it shouldn't be too expensive - and will do nothing when nothing found, right ?

@cyphar
Copy link
Contributor

cyphar commented Oct 18, 2021

Yes to your questions -- only after finding an entry (binary search) it will do the minimum possible extra work to find any extra entries (linearly look before and after the found index, comparing each with the string). If there are no identical entries it'll add only two extra string comparisons, if there are identical entries I doubt you can do better than O(number-of-identical-entries) which is what we are doing. There is sort-and-remove-duplicates step -- which I guess isn't strictly necessary -- at the end of the search but that's O(n log n) where n is the number of results (which is going to be small).

All-in-all it shouldn't make lookups much slower than they already were. On my laptop, exact searches with my 7 relatively-large Japanese dictionaries takes ~80ms for both the no matches case and the lots-of-entries-matching (>100 for はい) cases. Fuzzy searching takes 300-500ms (depending on whether it finds anything during fuzzy searching). This is basically identical to the time taken with sdcv master.

EDIT: I added some micro-optimisations (using std::set so no need to sort the vector, and only iterate over the match block once rather than twice in the worst case). There wasn't any change to the timing, but now there aren't any low-hanging optimisations to apply left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants