[ReaderHighlight] NT: Improve selection of hyphenated words#13129
Conversation
poire-z
left a comment
There was a problem hiding this comment.
Just some comments while reading for you to see if there's anything that makes sense - I'm not really interested in understanding it all as its quite contained to a NT only section.
There are already quite a few comments, but as it's not obvious, more would be welcome for when Little Timmy gets into coding in 10 years.
| -- With crengine, selected_text.sboxes return good coordinates. | ||
| local pos = self.selected_text.sboxes[1] | ||
| local margins = self.ui.document.configurable.h_page_margins[1] + self.ui.document.configurable.h_page_margins[2] | ||
| local two_column_mode = self.ui.document.configurable.visible_pages == 2 |
There was a problem hiding this comment.
(Wording in the comment: selected_text.sboxes is a table, so it does not return anything, it (or they?) has or contains.)
Tested with a PDF ? where I guess we don't have configurable.h_page_margins (so accessing its index may crash if nil) and neither visible_pages (but no crash).
There was a problem hiding this comment.
This beef, you should have had it when the comment was first introduced years ago, i am just moving it around ;). But i get your point
Good point regarding pdf, will do so
There was a problem hiding this comment.
works fine with pdf, we do have the if not (self.ui.rolling and self.selected_text and self.selected_text.sboxes and #self.selected_text.sboxes > 0) then early though
| -- We cannot precisely recognise hyphenated words under all circumstances, so a heuristic approach is used, goes in two steps. | ||
| -- First step: check if the box is huge. | ||
| local is_word_hyphenated = pos.w > 0.7 * effective_width | ||
| -- Second step: weed out false positives by comparing words at different box coordinates. | ||
| if is_word_hyphenated then |
There was a problem hiding this comment.
Is this function called (or this bit of code met) only on the first press, where only a single word will be selected ?
If no, I think you insisting in there on "word hyphenated" is wrong: it may be your main use case, but reading it is confusing. It would have a place in a comment.
There was a problem hiding this comment.
I'm inclined to agree. It's trying to detect multiline words which may or may not be hyphenated. But perhaps most important absolutely no hyphenation detection of any sort is involved.
There was a problem hiding this comment.
Function is only called when one word is selected. As i have stated already, i don’t know how to figure out if a word is being hyphenated, if you do know -i have been screaming about it- speak up. I am going by Sherlock's approach, “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.”
There was a problem hiding this comment.
Just because the word is split across multiple lines doesn't mean it's hyphenated. It just means it's split across multiple lines.
There was a problem hiding this comment.
^ That's it. You can't know if it's really hyphenated in the frontend context (and trying to know would probably be complicated).
In your context of boxes (that's all you have), it is "split across multiple lines". In 99.99% it will indeed be because of hyphenation - but writing about hyphenation in your frontend code is confusing because it is a little lie (but you could write in a comment "most often, it will be because the word is hyphenated").
There was a problem hiding this comment.
Yes, but by process of elimination, it's a lot more likely we are breaking it up and hyphenating it. Therefore the system is 99% accurate. I invite you to put to the test.
There was a problem hiding this comment.
My comments all along were just about wording and naming in the code.
is_word_hyphenated => is_word_splitted (and "hyphenated" mentionned only in comments) would make me satistfied :) (split/splitted, not sure it's correct English, so another word that looks different as an adjective would be better, unless it reads naturally to you Englishmen).
(About testing, effectiveness, accuray, I trust you - because I won't use it - but I can read that code :)
There was a problem hiding this comment.
is_word_split seems fine to me then.
There was a problem hiding this comment.
You can't know if it's really hyphenated in the frontend context (and trying to know would probably be complicated).
Nor does it even matter. We don't want to limit this behavior to hyphenation — a dumb split should be treated the same way. :-)
| if is_word_hyphenated then | ||
| local word_at_pos = self.ui.document:getWordFromPosition({ | ||
| x = BD.mirroredUILayout() and (pos.x + pos.w) or pos.x, | ||
| y = pos.y + pos.h * 1/4 |
There was a problem hiding this comment.
Not sure I understand this 1/4 and the 3/4 below - might be worth a comment.
There was a problem hiding this comment.
In a single line 1/2 is the middle, so for two lines 1/4 and 3/4 would be line 1 and 2. Or in case of a false positive, a bit higher/lower than it'd otherwise be, probably harmless.
There was a problem hiding this comment.
question: technically speaking to make the system 99.99% sure we are spliting words, we should also check the bottom right and compare the three results, what do you think? is it enough the way it is, or should i do both?
There was a problem hiding this comment.
I figure you should probably go for the bottom center if you did. The end of the line may not be reached.
There was a problem hiding this comment.
There was a problem hiding this comment.
but also in the check we first make sure we do in fact have a word before trying to call it.
I didn't mean in the sense of it causing a crash or something, just in the sense of having useful results.
I didn't realize it'd select things all the way over on the left of the line.
There was a problem hiding this comment.
Well, we could always make it 99.9999% accurate ;) aws offers 11 9s of data durability…
|
you are all going to love this one ;) <p class="C136">Farid opened his mouth, almost said something,
reconsid<span id="page_219"></span>ered and said nothing. He slumped
against the wall, muttered something under his breath, and crossed his mutilated
foot over the good one. His accusing eyes never left me.</p> |
Not that much :) it's a known "issue" among many small ones we can live with, when a text node is broken inside a word (by these stupid (kobo?) spans): we mostly handle things text node by text node, so when these span happen, it marks the end of a text node, and so of its last word. |
|
I see. Noted
btw, Little Timmy is more into construction and that sort of thing, he wants to be a quantity surveyor when he grows up Edit: @poire-z do you need more comments or is it fine now? |
Looks ok aesthetically :) (I haven't focused on the logic and results, trusting you.) |
|
just an idea, but should not be needed because centred text should not hyphenate in the first place. -- With crengine, selected_text.sboxes have good coordinates, so we'll borrow them.
local pos = self.selected_text.sboxes[1]
local font_size = self.ui.document.configurable.font_size
local saved_mbh = self.ui.doc_settings:readSetting("min_box_height")
if saved_mbh and saved_mbh.font_size ~= font_size then
saved_mbh = nil
end
local tally = saved_mbh and saved_mbh.tally or {}
tally[pos.h] = (tally[pos.h] or 0) + 1
self.ui.doc_settings:saveSetting("min_box_height", { tally = tally, font_size = font_size })
local min_box_height = pos.h
local best_count = 0
for h, count in pairs(tally) do
if count > best_count then
best_count = count
min_box_height = h
end
end
local margins = self.ui.document.configurable.h_page_margins[1] + self.ui.document.configurable.h_page_margins[2]
local two_column_mode = self.ui.document.configurable.visible_pages == 2
local effective_width = two_column_mode and (self.screen_w - margins) / 2 or self.screen_w - margins
-- When words are split (and hyphenated) due to line breaks, they create selection boxes that are almost as wide as the
-- effective_width, so we need to check if that is the case, in order to handle those cases properly. We cannot precisely
-- and easily recognise hyphenated words in the front end, so a heuristic approach is used, it goes in two steps.
-- Step one: check if our box is a 'big boy'. We must allow some room for unknown variables like publisher-embedded padding, etc.
local is_word_split = pos.w > 0.7 * effective_width
if pos.h > min_box_height then
-- If the box is also very tall, it's likely a split word that got an extra tall box instead of two separate boxes.
is_word_split = true
end |
|
this seems like a better idea local is_word_split = false
if pos.w > 0.5 * effective_width then
is_word_split = true
elseif pos.w > 0.25 * effective_width and (pos.h/pos.w) > 0.2 then
is_word_split = true
end |



what's new
The changes focus on improving the logic for handling highlight interactions and enhancing the accuracy of text selection.
Highlight Interaction Logic:
Text Selection Accuracy:
updatePositionsto centralize the logic for updating hold and indicator positions.screenshots
before
after
correct detection under extreme circumstances
correct detection in two column mode and landscape orientation
closes #13120
This change is