Skip to content

[ReaderHighlight] NT: Improve selection of hyphenated words#13129

Merged
Frenzie merged 14 commits into
koreader:masterfrom
Commodore64user:hyphen-select
Jan 28, 2025
Merged

[ReaderHighlight] NT: Improve selection of hyphenated words#13129
Frenzie merged 14 commits into
koreader:masterfrom
Commodore64user:hyphen-select

Conversation

@Commodore64user

@Commodore64user Commodore64user commented Jan 24, 2025

Copy link
Copy Markdown
Member

what's new

The changes focus on improving the logic for handling highlight interactions and enhancing the accuracy of text selection.

  • Highlight Interaction Logic:

    • Simplified the conditional checks and streamlined the logic for starting and stopping highlights.
    • Added a return statement to handle cases where there is no current indicator position early in the function.
  • Text Selection Accuracy:

    • Introduced a heuristic approach to determine if a word is hyphenated due to line breaks, improving the accuracy of text selection.
    • Created a helper function updatePositions to centralize the logic for updating hold and indicator positions.

screenshots

before

after

correct detection under extreme circumstances

correct detection in two column mode and landscape orientation

closes #13120


This change is Reviewable

@poire-z poire-z left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments while reading for you to see if there's anything that makes sense - I'm not really interested in understanding it all as its quite contained to a NT only section.
There are already quite a few comments, but as it's not obvious, more would be welcome for when Little Timmy gets into coding in 10 years.

Comment on lines +2481 to +2484
-- With crengine, selected_text.sboxes return good coordinates.
local pos = self.selected_text.sboxes[1]
local margins = self.ui.document.configurable.h_page_margins[1] + self.ui.document.configurable.h_page_margins[2]
local two_column_mode = self.ui.document.configurable.visible_pages == 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Wording in the comment: selected_text.sboxes is a table, so it does not return anything, it (or they?) has or contains.)

Tested with a PDF ? where I guess we don't have configurable.h_page_margins (so accessing its index may crash if nil) and neither visible_pages (but no crash).

@Commodore64user Commodore64user Jan 26, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This beef, you should have had it when the comment was first introduced years ago, i am just moving it around ;). But i get your point

Good point regarding pdf, will do so

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works fine with pdf, we do have the if not (self.ui.rolling and self.selected_text and self.selected_text.sboxes and #self.selected_text.sboxes > 0) then early though

Comment on lines +2487 to +2491
-- We cannot precisely recognise hyphenated words under all circumstances, so a heuristic approach is used, goes in two steps.
-- First step: check if the box is huge.
local is_word_hyphenated = pos.w > 0.7 * effective_width
-- Second step: weed out false positives by comparing words at different box coordinates.
if is_word_hyphenated then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function called (or this bit of code met) only on the first press, where only a single word will be selected ?
If no, I think you insisting in there on "word hyphenated" is wrong: it may be your main use case, but reading it is confusing. It would have a place in a comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree. It's trying to detect multiline words which may or may not be hyphenated. But perhaps most important absolutely no hyphenation detection of any sort is involved.

@Commodore64user Commodore64user Jan 26, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function is only called when one word is selected. As i have stated already, i don’t know how to figure out if a word is being hyphenated, if you do know -i have been screaming about it- speak up. I am going by Sherlock's approach, “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.”

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because the word is split across multiple lines doesn't mean it's hyphenated. It just means it's split across multiple lines.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ That's it. You can't know if it's really hyphenated in the frontend context (and trying to know would probably be complicated).
In your context of boxes (that's all you have), it is "split across multiple lines". In 99.99% it will indeed be because of hyphenation - but writing about hyphenation in your frontend code is confusing because it is a little lie (but you could write in a comment "most often, it will be because the word is hyphenated").

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but by process of elimination, it's a lot more likely we are breaking it up and hyphenating it. Therefore the system is 99% accurate. I invite you to put to the test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments all along were just about wording and naming in the code.
is_word_hyphenated => is_word_splitted (and "hyphenated" mentionned only in comments) would make me satistfied :) (split/splitted, not sure it's correct English, so another word that looks different as an adjective would be better, unless it reads naturally to you Englishmen).
(About testing, effectiveness, accuray, I trust you - because I won't use it - but I can read that code :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_word_split seems fine to me then.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't know if it's really hyphenated in the frontend context (and trying to know would probably be complicated).

Nor does it even matter. We don't want to limit this behavior to hyphenation — a dumb split should be treated the same way. :-)

Comment thread frontend/apps/reader/modules/readerhighlight.lua
if is_word_hyphenated then
local word_at_pos = self.ui.document:getWordFromPosition({
x = BD.mirroredUILayout() and (pos.x + pos.w) or pos.x,
y = pos.y + pos.h * 1/4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this 1/4 and the 3/4 below - might be worth a comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a single line 1/2 is the middle, so for two lines 1/4 and 3/4 would be line 1 and 2. Or in case of a false positive, a bit higher/lower than it'd otherwise be, probably harmless.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: technically speaking to make the system 99.99% sure we are spliting words, we should also check the bottom right and compare the three results, what do you think? is it enough the way it is, or should i do both?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure you should probably go for the bottom center if you did. The end of the line may not be reached.

@Commodore64user Commodore64user Jan 26, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's not really a problem (whether is bottom right or bottom centre), it seems to me crengine will always attach to the nearest word so you should always get 'something' but also in the check we first make sure we do in fact have a word before trying to call it.

@Frenzie Frenzie Jan 26, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but also in the check we first make sure we do in fact have a word before trying to call it.

I didn't mean in the sense of it causing a crash or something, just in the sense of having useful results.

I didn't realize it'd select things all the way over on the left of the line.

@Commodore64user Commodore64user Jan 26, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we could always make it 99.9999% accurate ;) aws offers 11 9s of data durability…

@Commodore64user

Commodore64user commented Jan 26, 2025

Copy link
Copy Markdown
Member Author

you are all going to love this one ;)

<p class="C136">Farid opened his mouth, almost said something, 
reconsid<span id="page_219"></span>ered and said nothing. He slumped 
against the wall, muttered something under his breath, and crossed his mutilated 
foot over the good one. His accusing eyes never left me.</p>

@poire-z

poire-z commented Jan 26, 2025

Copy link
Copy Markdown
Contributor

you are all going to love this one ;)

Not that much :) it's a known "issue" among many small ones we can live with, when a text node is broken inside a word (by these stupid (kobo?) spans): we mostly handle things text node by text node, so when these span happen, it marks the end of a text node, and so of its last word.
Searching for "reconsidered" would not find it. A ligature would not happen...
The one thing that may work - as seen there - is hyphenation.

@Commodore64user

Commodore64user commented Jan 26, 2025

Copy link
Copy Markdown
Member Author

I see. Noted

more would be welcome for when Little Timmy gets into coding in 10 years.

btw, Little Timmy is more into construction and that sort of thing, he wants to be a quantity surveyor when he grows up

Edit: @poire-z do you need more comments or is it fine now?

@poire-z

poire-z commented Jan 27, 2025

Copy link
Copy Markdown
Contributor

do you need more comments or is it fine now?

Looks ok aesthetically :) (I haven't focused on the logic and results, trusting you.)

@Frenzie Frenzie added this to the 2025.01 milestone Jan 28, 2025
Comment thread frontend/apps/reader/modules/readerhighlight.lua Outdated
@Commodore64user

Copy link
Copy Markdown
Member Author

just an idea, but should not be needed because centred text should not hyphenate in the first place.

    -- With crengine, selected_text.sboxes have good coordinates, so we'll borrow them.
    local pos = self.selected_text.sboxes[1]
    local font_size = self.ui.document.configurable.font_size
    local saved_mbh = self.ui.doc_settings:readSetting("min_box_height")
    if saved_mbh and saved_mbh.font_size ~= font_size then
        saved_mbh = nil
    end
    local tally = saved_mbh and saved_mbh.tally or {}
    tally[pos.h] = (tally[pos.h] or 0) + 1
    self.ui.doc_settings:saveSetting("min_box_height", { tally = tally, font_size = font_size })

    local min_box_height = pos.h
    local best_count = 0
    for h, count in pairs(tally) do
        if count > best_count then
            best_count = count
            min_box_height = h
        end
    end
    local margins = self.ui.document.configurable.h_page_margins[1] + self.ui.document.configurable.h_page_margins[2]
    local two_column_mode = self.ui.document.configurable.visible_pages == 2
    local effective_width = two_column_mode and (self.screen_w - margins) / 2 or self.screen_w - margins
    -- When words are split (and hyphenated) due to line breaks, they create selection boxes that are almost as wide as the
    -- effective_width, so we need to check if that is the case, in order to handle those cases properly. We cannot precisely
    -- and easily recognise hyphenated words in the front end, so a heuristic approach is used, it goes in two steps.
    -- Step one: check if our box is a 'big boy'. We must allow some room for unknown variables like publisher-embedded padding, etc.
    local is_word_split = pos.w > 0.7 * effective_width
    if pos.h > min_box_height then
        -- If the box is also very tall, it's likely a split word that got an extra tall box instead of two separate boxes.
        is_word_split = true
    end

@Commodore64user

Copy link
Copy Markdown
Member Author

this seems like a better idea

    local is_word_split = false
    if pos.w > 0.5 * effective_width then
        is_word_split = true
    elseif pos.w > 0.25 * effective_width and (pos.h/pos.w) > 0.2 then
        is_word_split = true
    end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NT: Wrong positioning when selecting hyphenated words

3 participants