Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fulltext search: all entries in entire document #11313

Merged
merged 30 commits into from Jan 13, 2024

Conversation

hius07
Copy link
Member

@hius07 hius07 commented Jan 2, 2024

New All button to search for all entries in a pdf document.

1

Search results showing two previous and two next words, and page number.

2

On tap jumps to the page and highlights the search string.

3

Extreme: search for "the".

4


This change is Reviewable

@hius07
Copy link
Member Author

hius07 commented Jan 2, 2024

Related issues #11038, #11203.

@@ -126,6 +127,13 @@ function ReaderSearch:onShowFulltextSearchInput()
UIManager:close(self.input_dialog)
end,
},
{
text = _("All"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add a context. Whether it's "all search results" or "all aliens from Mars" might make a difference in some languages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@translators Find all entries in entire document, button displayed on the search bar, should be short. ?

Copy link
Member

@Frenzie Frenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance for just text boxes is okay? I can easily see going through all pages being quite heavy.

@hius07
Copy link
Member Author

hius07 commented Jan 2, 2024

Yes, it's not very fast but rather good. I'll try some acceleration.

table.insert(res_item.boxes, self:nativeToPageRectTransform(doc, page, word_box))

Transform is not needed for every item, can be done when drawing only.

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2024

Various thoughts:

Feels a bit Proof of Concept that we can have something working with PDF - while having nothing (yet) working with EPUBs. EPUBs is more text-centered and we may have more info to work with and present in a search result lists (ie. chapter titles of where the results is in). We may need to think about both to figure out the common needs/possibilities/API (and possibly have an API that would be reusable if we need something to present in Book map).
But I guess it's better to have something limited to PDF and possibly slow than nothing... (really? :) it was not somethign super needed). And the implementation is small and non-intrusive, so it should not hurt much if we ever need to rework it all.

For crengine, I think we'd need some work/addition to cre.cpp - even just removing scan limits to 2 page height to see how fast/slow it could be - which you can't do/compile/test I guess :/
And may be it would deserve something more complete than the existing findText() that returns only start and end xpointers, returning words and context and page numbers, to avoid so crengine api call for each result.
(Not too keen on having to work on that these days :/)

There's also the notion of "context" I mentionned at #11203 (comment).
You hardcoded it to 2 "words" (not really tunable, no loop), which will be little in CJK and may be very long in German :)
I remember there's been some discussion about context for the Vocabulary Builder plugin - and some stuff has been implemented by #9622 that you maybe could reuse?

About the context again, above, in your screenshot for "the", the middle word(s) is always lowercase, even when the surrounding words are uppercase. May be you should fetch the middle word from the text instead of reusing the lowercased pattern.
Also, the Vocab Builder plugin wraps the word in "context before【 word lookup 】context after" - may be you could do the same, so the search term jumps at you in each result context. (I can't find where these symbols were discussed.)

You could also keep the list of results to re-use/display it again if the users comes back to executing the same query to go see another page - to avoid the slowness.

@hius07
Copy link
Member Author

hius07 commented Jan 2, 2024

For crengine, I think we'd need some work/addition to cre.cpp

That's the main reason I started with pdf.

Thanks for the various ideas, I'll work with them.

end
end

function KoptInterface:findTextAll(doc, pattern, caseInsensitive)
Copy link
Member

@NiLuJe NiLuJe Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findAllText?

Copy link
Member

@NiLuJe NiLuJe Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unless there's a previous pattern of functions being named findText<Something>?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For internal usage there is

function KoptInterface:findAllMatches(doc, pattern, caseInsensitive, page)

and for external calls
function KoptInterface:findText(doc, pattern, origin, reverse, caseInsensitive, pageno)

So, the new external one supposed to be findTextAll.

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2024

Here's a patch to base/cre.cpp that may help getting all results and more info so you can build something similar to what you have with PDF:
cre_cpp_findTextAll_1.diff.txt
May be some dev fellow with a Kindle toolchain could build it with this patch and just give you the generated libkoreader-cre.so ? @NiLuJe @yparitcher @mergen3107 ? (I guess they'll guess what kind of kindle build you need).

I tested it just on console with:

+++ b/frontend/apps/reader/modules/readersearch.lua
-    local retval, words_found = self.ui.document:findText(pattern, origin, direction, case_insensitive, page, regex, self.max_hits)
+    local retval, words_found = self.ui.document._document:findTextAll(pattern, case_insensitive and 1 or 0, regex and 1 or 0, self.max_hits or 200, 1, 3)
+    logger.warn(retval, words_found)

The last 2 arguments can be set to 0 to only get xpointers, as in findText(), if you need to benchmark how expensive is the next stuff.

If the previous to last is 1, you additionally get word (actually, what matched), prefix and suffix, which could allow you to reconstruct the full word whose only some bits did match.
Ie, searching for "oman":

  {
    ["end"] = "/body/DocFragment[16]/body/section/div/p[23]/text()[8].10",
    start = "/body/DocFragment[16]/body/section/div/p[23]/text()[8].6",
    word = "oman",
    word_prefix = "R",
    word_suffix = "ia"
  } --[[table: 0x7f0535156bd8]],
  {
    ["end"] = "/body/DocFragment[17]/body/section/div/p[63]/text()[2].32",
    start = "/body/DocFragment[17]/body/section/div/p[63]/text()[2].28",
    word = "oman",
    word_prefix = "w"
  } --[[table: 0x7f0535156da8]],

The last one is the nb of additional words you want before and after.
With 3:

  {
    ["end"] = "/body/DocFragment[16]/body/section/div/p[8]/text()[3].360",
    next_text = " of his caste",
    prev_text = "would marry a ",
    start = "/body/DocFragment[16]/body/section/div/p[8]/text()[3].356",
    word = "oman",
    word_prefix = "w"
  } --[[table: 0x7f33806e7d38]],
  {
    ["end"] = "/body/DocFragment[16]/body/section/div/p[23]/text()[8].10",
    next_text = " and ego in",
    prev_text = "pronounced yo) in ",
    start = "/body/DocFragment[16]/body/section/div/p[23]/text()[8].6",
    word = "oman",
    word_prefix = "R",
    word_suffix = "ia"
  } --[[table: 0x7f33806e7f68]],
  {
    ["end"] = "/body/DocFragment[17]/body/section/div/p[63]/text()[2].32",
    next_text = ", was imagined as",
    prev_text = "high-born ",
    start = "/body/DocFragment[17]/body/section/div/p[63]/text()[2].28",
    word = "oman",
    word_prefix = "w"
  } --[[table: 0x7f33806e81a0]],

I haven't checked how it behave when matching the first or last word of a book :)
Also, if we'd like to show the chapter a match is in, it's not for crengine to do that, as a user can have a custom ToC, which is only known to frontend.

@yparitcher
Copy link
Member

Not at my computer to build,
Can you please add it to dispatcher?

@mergen3107
Copy link
Contributor

@hius07 @poire-z
Here is the Yandex.Disk shared folder with the new v2023.10-82 build and the libkoreader-cre.so lib itself with the above changes:
https://disk.yandex.com/d/vWFIaaT40QRteg

@hius07
Copy link
Member Author

hius07 commented Jan 3, 2024

Updates:
(1) Adjustable (not from UI) number of words in context, is 3 now
(2) Special brackets in the result list (around the whole words, in pdf we do not highlight part of the word in search results)
(3) Cache the last search result

5

6

@poire-z
Copy link
Contributor

poire-z commented Jan 3, 2024

Have you got to testing the above libkoreader-cre.so, at least being sure that just replacing it does not make it crash, and that calling the function works?

image
I guess before/after these【 】, there might be some space from the neighbour words, that causes the large gap? How does it look when the immediate neighbour chars are a punctuation like a comma or parens?
I wonder if we could remove any near space to make the gap less large, as these【 】being CJK chars already bring some spacing on their outer side.

@hius07
Copy link
Member Author

hius07 commented Jan 3, 2024

Thank you and @mergen3107 for the lib, haven't tried yet, I want to finish with pdf and then dig cre.

It's better without additional spaces:

7

@hius07
Copy link
Member Author

hius07 commented Jan 3, 2024

About brackets: in cre we will need sometimes to put them around a part of a word, and even without additional spaces the gaps are rather wide.

@poire-z
Copy link
Contributor

poire-z commented Jan 3, 2024

Depends on how you want to present stuff: you could bracket the full word even if the pattern match only some part (the cre new functions would allow you to construct the whole word with the returned word_prefix/suffix).
There may be other thick but less wide unicode symbols - not sure why we went with these.

We could also use HtmlBoxWidget so you could just use pretext <b>word</b> posttext (and various font size and style to show the TOC chapter or other stuff). But it may be expensive having to have MuPDF render each HTML snippet it that list of 12791 results :)

@ryanwwest
Copy link
Contributor

ryanwwest commented Jan 3, 2024

What about inverting or highlighting the matched phrase, similar to VS Code? Color highlights could also work but I think there are some reasons against colors here.

image

@poire-z
Copy link
Contributor

poire-z commented Jan 3, 2024

In most of our UI list text stuff, we are in a plain-text world (single font size, no styling, not even bold or italic, single color, single bgcolor).
(Otherwise, we should go with HTML like I mentionned above, which will be expensive. We use it for Dict lookup results, but drawing only one stuff at a time.)

@ryanwwest
Copy link
Contributor

ryanwwest commented Jan 3, 2024

it may be expensive having to have MuPDF render each HTML snippet it that list of 12791 results :)

Is there a way we could just render the current page of results at a time to lower that to maybe a maximum of 40 results/page? Maybe this is covered above, I only skimmed.

@hius07
Copy link
Member Author

hius07 commented Jan 4, 2024

The cre findTextAll() is really fast (tested without regex so far).

I am not sure about the brackets around the part of a word. Query "copy":

12

@poire-z
Copy link
Contributor

poire-z commented Jan 4, 2024

I am not sure about the brackets around the part of a word (no mandatory yet):

Why ?
Oh, right: "copyright" does not read intuitively :)
May be swap to using the fullword, like you'll have for PDF ? It may be a bit odd, but somehow, a full word not being what was expected to be found will jump out more obviously ? (ie. if realy looking for "copy", 【copyright】will be easily distinguished from 【copy】).

We could also use HtmlBoxWidget so you could just use pretext word posttext (and various font size and style to show the TOC chapter or other stuff). But it may be expensive having to have MuPDF render each HTML snippet it that list of 12791 results :)

Or maybe not. MuPDF is probably not as good with text than our text/xtext stuff (RTL, Bidi, picking the right glyphs from the CJK font per language, using our long chain of fallback font...).
So, probably best to keep showing text, with some kind of bracket or some other idea...

@hius07
Copy link
Member Author

hius07 commented Jan 4, 2024

The whole word in brackets:

13

@hius07
Copy link
Member Author

hius07 commented Jan 12, 2024

Works good, thanks!

Comment on lines 520 to 523
local text = item.matched_text or ""
if item.matched_word_prefix then
text = item.matched_word_prefix .. text
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it local word and mention why we show the full word and not only the requested matched text.
-- PDF/Kopt shows full words when only some part matches; let's do the same with CRE

@poire-z
Copy link
Contributor

poire-z commented Jan 12, 2024

If no more comment about this PR, I'll merge (and bump into the koreader repo, dunno if you can do this as part of this PR with github web) koreader/koreader-base#1720 tomorrow.
@hius07 : available this saturday to undo last commit and merge this one?

@hius07
Copy link
Member Author

hius07 commented Jan 12, 2024

Very good for Saturday, thanks.

@hius07 hius07 merged commit 0ceb88a into koreader:master Jan 13, 2024
3 checks passed
@hius07 hius07 deleted the pdf-text-search-all branch January 13, 2024 10:58
@hius07 hius07 added this to the 2024.02 milestone Jan 13, 2024
@poire-z
Copy link
Contributor

poire-z commented Jan 13, 2024

I tried to pursue on the idea at #11313 (comment).

Simplest and least intrusive (and to not have to pass a simple_formatting=true all along from upper widgets down to TextBoxWidget) I could come up is:

--- a/frontend/apps/reader/modules/readersearch.lua
+++ b/frontend/apps/reader/modules/readersearch.lua
@@ -518,2 +518,3 @@ function ReaderSearch:showFindAllResults(not_cached)
     if self.ui.rolling and not_cached then -- for ui.paging: items are built in KoptInterface:findAllText()
+        local TextBoxWidget = require("ui/widget/textboxwidget")
         for _, item in ipairs(self.findall_results) do
@@ -528,3 +529,4 @@ function ReaderSearch:showFindAllResults(not_cached)
             -- append context before and after the word
-            local text = "ã<80><90>" .. word .. "ã<80><91>"
+            -- local text = "ã<80><90>" .. word .. "ã<80><91>"
+            local text = TextBoxWidget.FORMATTING_BOLD_START .. word .. TextBoxWidget.FORMATTING_BOLD_END
             if item.prev_text then
@@ -535,2 +537,3 @@ function ReaderSearch:showFindAllResults(not_cached)
             end
+            text = TextBoxWidget.FORMATTING_ENABLED .. text
             item.text = text
--- a/frontend/ui/widget/textboxwidget.lua
+++ b/frontend/ui/widget/textboxwidget.lua
@@ -112,2 +112,9 @@ local TextBoxWidget = InputContainer:extend{
     for_measurement_only = nil, -- When the widget is a one-off used to compute text height
+
+    -- Simple formatting chars that can be embedded in self.text
+    -- (these codepoints are not part of Unicode, so hopefully not present naturally in any provided self.text)
+    FORMATTING_ENABLED = "\u{FFF1}", -- should be put at start of 'text' to indicate we may find the next ones
+    FORMATTING_BOLD_START = "\u{FFF2}",
+    FORMATTING_BOLD_END = "\u{FFF3}",
+    _formatting_char_bold = nil,
 }
@@ -156,2 +163,34 @@ function TextBoxWidget:init()

+    if self.text and type(self.text) == "string" and self.text:sub(1, #TextBoxWidget.FORMATTING_ENABLED) == TextBoxWidget.FORMATTING_ENABLED then
+        -- Support for very simple formatting (bold only for now)
+        self._formatting_char_bold = {}
+        -- Alas, we can't let any of our flag characters be fed to xtext (even with ASCII control
+        -- chars, it would give them a width, which would result at best in spurious added spacing).
+        -- So, split text into a table of chars, filter our flags out keeping track of where they
+        -- start and end bold, and rebuild a string.
+        local charlist = util.splitToChars(self.text)
+        table.remove(charlist, 1)
+        local is_bold = false
+        local len = #charlist
+        local i = 1
+        while i <= len do
+            local ch = charlist[i]
+            if ch == TextBoxWidget.FORMATTING_BOLD_START then
+                is_bold = true
+                table.remove(charlist, i)
+                len = len - 1
+            elseif ch == TextBoxWidget.FORMATTING_BOLD_END then
+                is_bold = false
+                table.remove(charlist, i)
+                len = len - 1
+            else
+                if is_bold then
+                    self._formatting_char_bold[i] = true
+                end
+                i = i + 1
+            end
+        end
+        self.text = table.concat(charlist, "")
+    end
+
     self:_computeTextDimensions()
@@ -824,3 +863,4 @@ function TextBoxWidget:_renderText(start_row_idx, end_row_idx)
                         local face = self.face.getFallbackFont(xglyph.font_num) -- callback (not a method)
-                        local glyph = RenderText:getGlyphByIndex(face, xglyph.glyph, self.bold)
+                        local bolder = self._formatting_char_bold and self._formatting_char_bold[xglyph.text_index] or false
+                        local glyph = RenderText:getGlyphByIndex(face, xglyph.glyph, bold, bolder)
                         local color = self.fgcolor
--- a/frontend/ui/rendertext.lua
+++ b/frontend/ui/rendertext.lua
@@ -291,3 +291,3 @@ end
 -- @treturn glyph
-function RenderText:getGlyphByIndex(face, glyphindex, bold)
+function RenderText:getGlyphByIndex(face, glyphindex, bold, bolder)
     if face.is_real_bold then
@@ -295,3 +295,3 @@ function RenderText:getGlyphByIndex(face, glyphindex, bold)
     end
-    local hash = "xglyph|"..face.hash.."|"..glyphindex.."|"..(bold and 1 or 0)
+    local hash = "xglyph|"..face.hash.."|"..glyphindex.."|"..(bold and 1 or 0)..(bolder and "x" or "")
     local glyph = GlyphCache:check(hash)
@@ -301,3 +301,11 @@ function RenderText:getGlyphByIndex(face, glyphindex, bold)
     end
-    local rendered_glyph = face.ftsize:renderGlyphByIndex(glyphindex, bold and face.embolden_half_strength)
+    local embolden_strength
+    if bold or bolder then
+        embolden_strength = face.embolden_half_strength
+        if bolder then
+            -- Even if not bold, get it bolder than the strength we'd use for bold
+            embolden_strength = embolden_strength * 1.5 -- or 2 for very bold
+        end
+    end
+    local rendered_glyph = face.ftsize:renderGlyphByIndex(glyphindex, embolden_strength)
     if not rendered_glyph then

which would give us instead of our strange brackets:
image

(I think we can't really go too bold, as complex glyphs like CJK may get ugly and unreadable.)

Good enough and not too ugly to go with that ?

@hius07
Copy link
Member Author

hius07 commented Jan 13, 2024

I like the bold font, more than the brackets.

@poire-z
Copy link
Contributor

poire-z commented Jan 15, 2024

Just mentionning that:
when on a CRE document, and jumping from that results list to a match and being back in the book, the match is highlighted, and stays highlighted for a long time :) Until a next jump or until the next text selection/highlight.
Dunno if there's, or it if needs, a solution: when jumping, we close the search results list widget and are back in the book, where we might want to browse pages around the result and come back to it - or "go back to previous location" to get back to the original page - or stay there, so hard to decide when to clear the result highlighting.
But it's a bit inconsistent that this "behaviour" does not happen with PDF.

@hius07
Copy link
Member Author

hius07 commented Jan 15, 2024

hard to decide when to clear the result highlighting

Exactly, that's the reason.

@poire-z
Copy link
Contributor

poire-z commented Jan 15, 2024

A few remarks I had while playing with it on my Kobo (while testing bold):

  • The All button feels a bit strange (it's simple and states we'll see All results, but somehow, it doesn't invite me to tap it :)). Dunno if a Unicode symbol showing a list of items (but different than a hamburger icon, which means popup menu elsewhere) would be better and more similar to the arrow triangles that invite to tap them ("All" in text near "Cancel" invites to give up :))
  • Having a gesture action "Last fulltext search results" could be nice, to get back to it after having jumped to a result. It should also remember the Menu page we were on last time and put us on it.
  • (Having a gesture action "Last fulltext searches" , showing in a popup the 5-10 last search terms, and allowing to jump to their (saved/cached) results would be nice too, but probably overkill :))

@mergen3107
Copy link
Contributor

I agree with the last point.
Greatest example I've seen so far is search engine in Notepad++.

You can search terms within files of a directory and its subdirectories.

Each search query is like a chapter in the TOC: you can sxpand and collapse it, and all searches are always available as a cache.

image

Since book files usually don't change their contents, it would make a lot of sense to just keep piling up cache of searched terms attached to this book's sdr (?).
After all, there is a finite number of "Martin Eden" within a given book :D
So you search for this guy once, and then just keep reusing the cache over and over.

(This now starts to resemble Kindle's X-Ray feature a lot...)

@ryanwwest
Copy link
Contributor

I love the Notepad++ search for the same reason. Remembering and caching the last few results is nice as long as it doesn't take up tons of space.

@Commodore64user
Copy link
Contributor

this is a brilliant addition and I am using it more and more every day, however, there is no option to adjust the font size of the results. Would it possible to add this?

also, in some results, when the matched word is at the beginning of a paragraph, it will join it together to the previous word: example, let's say you are looking for the word "Train" the text is, [he departed that night. Train playing in the background] where each line is a separate paragraph, then it will displayed as [he departed that night.Train playing in the background] without any spacing between night. and Train (makes sense?)

@hius07
Copy link
Member Author

hius07 commented Feb 28, 2024

adjust the font size

It's not obvious, but it is the same font size that is used by the file browser in Classic display mode.
It is set in the file browser upper menu, the first tab - Settings - Classic mode settings - Item font size.

@Commodore64user
Copy link
Contributor

It's not obvious, but it is the same font size that is used by the file browser in Classic display mode. It is set in the file browser upper menu, the first tab - Settings - Classic mode settings - Item font size.

it would probably be best, if it was under settings like it is for dictionary lookup, and independent of the classic display mode.

Another thing, currently you can only set 'words in context' and 'results per page', I was thinking that it would be great to have an option that instead of worrying about how many words you may or may not need, shows you as many words as it can fit in say 2-4 lines. So user selects font size and how many lines per result to display and the software shows whatever it can fit there. One could also select, instead of showing 50/50 word content (before/after matched word), to be perhaps 25/75 or 33/67 or any combination really.

And implementing the blinky thing when you get back to original place where you started the search from a findAll search (I will die on this hill) would be REALLY useful.

@Commodore64user
Copy link
Contributor

Because we don't render HTML in the UI (you may be used to expecting that as it's what is done nearly everywhere else nowaydays) - so there is no "fill up the page". We have some UI widgets from the 90s :) where we render plain text - and we only have a "list widget" that displays fixed-height items, that we're reusing. (We're not ready to create a variable height items equivalent widget). So, to make text of variable length fit in this fixed height, there is the current solution to reduce the font size until it fits (or truncate on one of the side, which won't be nice in this context - another solution that doesn't exist and a little bit expensive would be to cut the text string, removing a word on each side, until no font reduction is needed.)

and each of the twenty identified words is 20 characters long...

It may be clumsy, but items per page and nb of words are the only "parameters" we have - so a user will probably choose them depending 1) on his language and how longs are word in that language :) and 2) his eyesight.

oops, just found this, in my defence: it was hidden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants