Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting highlights runs out of memory #10745

Open
rixx opened this issue Jul 25, 2023 · 20 comments
Open

Exporting highlights runs out of memory #10745

rixx opened this issue Jul 25, 2023 · 20 comments
Labels
bug help-wanted We'd like help with this issue Plugin

Comments

@rixx
Copy link

rixx commented Jul 25, 2023

  • KOReader version: 202306
  • Device: Kobo libra 2

Issue

Trying to export all highlights in my device, the process crashes when it runs out of memory. This happens with all three plain text export formats (Text, My Clippings, Markdown), whereas the JSON export works perfectly fine (which is maybe a place to start debugging?).

The text exports still produce a limited result that just stopy partway through while attempting the export. Here are all file sizes, plus the successful JSON export:

ls -ahl clipboard/
-rw-r--r-- 1 rixx rixx 1.1M Jul 25  2023 2023-07-25-17-24-07-all-books.txt
-rw-r--r-- 1 rixx rixx 307K Jul 25 17:48 2023-07-25-17-47-51-myClippings.txt
-rw-r--r-- 1 rixx rixx 2.8M Jul 25 17:49 2023-07-25-17-49-33-all-books.md
-rw-r--r-- 1 rixx rixx  11M Jul 25 17:50 2023-07-25-17-50-23-all-books.json

Steps to reproduce

[Step 0: have a large amount of highlights. I realise that I'm part of the problem.]

  1. Go to Tools -> Export highlights
  2. Go to "Choose formats and services", make sure only "Text" (or: "Markdown", or: "My Clippings") is selected.
  3. Go back and select "Export all notes in this book"
  4. Wait a while, while the device seems unresponsive.
  5. Pop-up appears: "Exporting may take several seconds..."
  6. Crash screen appears. Last (and only relevant) log line is "./luajit: not enough memory"

I realise that having a ton of highlights is my fault, however, I do think KOReader can handle highlights larger than the available memory, at least when exporting in one of the plain text formats (like Text, Markdown, My Clippings, etc), as those permit sequential reads and writes, and there is no real need to keep the entire highlights db in memory to do so.

@pazos
Copy link
Member

pazos commented Jul 25, 2023

See https://github.com/koreader/koreader/blob/master/plugins/exporter.koplugin/base.lua#L87

The table of booknotes for all books is kept in memory while the exporter formats it to match its ouput.

In the case of the text exporter it just concatenates each note on a big string: https://github.com/koreader/koreader/blob/master/plugins/exporter.koplugin/target/text.lua#L12-L39

The json exporter works slightly different as it creates an inner representation of the booknotes on its own table and then use that table to encode the json file using rapidjson.

Rapidjson/lua it is also known to cause OOM on its own (see #7790 (comment)) as it doesn't support partial encoding from a stream.

tl;dr:

if rapidjson doesn't OOM it looks like the rest of exporters that do OOM are not optimized. Given that nobody reported this issue until now it looks like the people who use this plugin are not big highlighters.

Patches are very welcome.

@pazos pazos added bug Plugin help-wanted We'd like help with this issue labels Jul 25, 2023
@offset-torque
Copy link

@pazos
I am asking for adding this information to the guide.

If a user has this problem, currently their options for getting their highlights are:

  • Export as JSON
  • Use a third party app like KOHighlights or Calibre

Is this correct?

@pazos
Copy link
Member

pazos commented Jul 30, 2023

@offset-torque: not sure if it's worth to mention it on the guide, other than a link to this ticket.

But you tell me: go to your most highlighted document and try to export to text. Does it OOM?

If a user has this problem

Then please attach the book and the metadata sidecar here, so we can reproduce the ticket and eventually fix it :)

I wouldn't mention workarounds directly on the user guide. Not because I hate users but because I hate the idea of affected users (1?, 10?, 100?) getting some sort of workable workaround that lets this issue to stay unsolved.

@rixx
Copy link
Author

rixx commented Jul 30, 2023

But you tell me: go to your most highlighted document and try to export to text. Does it OOM?

How do I determine the most highlighted document? I guess I can parse the JSON with some code and look for it, though I've since deleted some books in an attempt to fix the problem. Do you mean "most highlighted" by highlight count, character count, …?

@pazos
Copy link
Member

pazos commented Jul 30, 2023

@rixx: any document that gets an OOM is enough. If you have one and you're ok with sharing it please send me an email. I can try to reproduce on a VM.

@rixx
Copy link
Author

rixx commented Jul 30, 2023

Good news, my exporter still runs OOM. Bad news, I don't think any individual book runs OOM:

  • The book with the most highlights had 819 highlights, exporting it to text does not run OOM.
  • The book witht the largest character count in highlights has 407400 characters in highlights. Exporting does not run OOM either.

(In my defense, these are very long, dense books and I use highlights to write up summaries.)

@pazos
Copy link
Member

pazos commented Jul 30, 2023

Good news, my exporter still runs OOM. Bad news, I don't think any individual book runs OOM:

  • The book with the most highlights had 819 highlights, exporting it to text does not run OOM.
  • The book witht the largest character count in highlights has 407400 characters in highlights. Exporting does not run OOM either.

(In my defense, these are very long, dense books and I use highlights to write up summaries.)

Could you log collectgarbage("count") just before https://github.com/koreader/koreader/blob/master/plugins/exporter.koplugin/target/text.lua#L48? and try to reproduce the OOM with the full history targeting plain text?

@pazos
Copy link
Member

pazos commented Jul 30, 2023

Also, @rixx, try with this patch:

diff --git a/plugins/exporter.koplugin/target/text.lua b/plugins/exporter.koplugin/target/text.lua
index dadc22849..6611fad46 100644
--- a/plugins/exporter.koplugin/target/text.lua
+++ b/plugins/exporter.koplugin/target/text.lua
@@ -9,6 +9,39 @@ local TextExporter = require("base"):new {
     mimetype = "text/plain",
 }
 
+local function fformat(booknotes)
+    local wide_space = "\227\128\128"
+    local tbl = {}
+
+    if booknotes.title then
+        table.insert(tbl, wide_space .. booknotes.title)
+        table.insert(tbl, wide_space)
+    end
+
+    for ___, entry in ipairs(booknotes) do
+        for ____, clipping in ipairs(entry) do
+            if clipping.chapter then
+                table.insert(tbl, wide_space .. clipping.chapter)
+                table.insert(tbl, wide_space)
+            end
+            local text = T(_("-- Page: %1, added on %2\n"), clipping.page, os.date("%c", clipping.time))
+            table.insert(tbl, wide_space .. wide_space .. text)
+            if clipping.text then
+                table.insert(tbl, clipping.text)
+            end
+            if clipping.note then
+                table.insert(tbl, "\n---\n" .. clipping.note)
+            end
+            if clipping.image then
+                table.insert(tbl, _("<An image>"))
+            end
+            table.insert(tbl, "\n-=-=-=-=-=-\n")
+        end
+    end
+    return tbl
+end
+
+
 local function format(booknotes)
     local wide_space = "\227\128\128"
     local content = ""
@@ -45,8 +78,8 @@ function TextExporter:export(t)
     local file = io.open(path, "a")
     if not file then return false end
     for __, booknotes in ipairs(t) do
-        local content = format(booknotes)
-        file:write(content)
+        local tbl = fformat(booknotes)
+        file:write(table.concat(tbl, "\n"))
     end
     file:close()
     return true

That should be much more performant in both memory allocations and time spent doing garbage collection.
If that stills OOM then meh :/. If that does not OOM then please do export a single document with and without the patch and compare output files. There're a few endlines and identations that are difficult to do without hacky code.

@rixx
Copy link
Author

rixx commented Jul 30, 2023

@pazos Nicely done, it works with your patch! Sorry don't have time to compare the different outputs though atm.

@pazos
Copy link
Member

pazos commented Jul 30, 2023

@pazos Nicely done, it works with your patch!

Great, thanks for the feedback.

Sorry don't have time to compare the different outputs though atm.

Not in a hurry here :)

When/if you find some time please check with clips that contain chapters, notes and images (I don't really know if the latter is possible :p)

@rixx
Copy link
Author

rixx commented Jul 30, 2023

Will do!

While I'm here, is there a way to increase log levels? My kobo has been experiencing random KOreader crashes since I got it, and crash.log never indicates anything except the usual restart notices. If I can, I'd like to figure out what is going on. No obvious relation to time, or even use (sometimes happened when not in use, sometimes after hours, sometimes 5 minutes after the last crash).

@pazos
Copy link
Member

pazos commented Jul 30, 2023

On the File Manager: tools -> more tools -> developer options -> verbose logs.

It is a known issue on the Kobo Libra 2. See open tickets like #8414 and #10713.

You can try some of the workarounds in there. Sadly, IIRC, there's nothing to fix the underlying issue, since we're talking about kernel deadlocks here :/

@offset-torque
Copy link

I see it is already solved pazos, that was fast. So no need to add anything to the guide.

@pazos
Copy link
Member

pazos commented Jul 31, 2023

Sadly it isn't fixed yet.

Markdown needs a similar change in prepareBookContent

MyClippings in format

Pinging @uroybd and @Mochitto as the creators (and hopefully users) of those exporters :)

To avoid OOM to a certain extent we need to remove all kind of string concatenation in inner loops.

String concatenation makes a bunch of allocations

So we need to replace

local content = ""
for index, notes in ipairs(t)
  content = content .. "new content\n"
end

with

local tbl = {}
for index, notes in ipairs(t)
  table.insert(tbl, "new content")
end
local content = table.concat(tbl, "\n")

I think Memos is ok as is, but lets ping @fierceX as well :)

@uroybd
Copy link
Contributor

uroybd commented Jul 31, 2023

I can open a PR containing the changes @pazos has recommended.

Is there any other scope for optimizations?

@Mochitto
Copy link
Contributor

Will take care of the patch for my clippings as well later on today :) Thanks for solving it @pazos

@pazos
Copy link
Member

pazos commented Jul 31, 2023

Thank you both. We're not in a hurry. Just please check while you're at it that the new code produces the same output as the old one or you're ok with the differences.

I can't test myself as I don't highlight on KO.

Is there any other scope for optimizations?

I bet they are. Not relevant to this ticket, though

@smasher816
Copy link

I can confirm this is an issue with the current v2023.08 version.

For reference I have 11k highlights (collecting frases).

$ jq -r '.entries | .[] | .text' mybook.json | wc -l
   11120

I have not tested the patch, but I can confirm that using the json exporter works (while txt exporter crashes).
Using the above jq command I am able to convert the json into a txt file for importing into anki, which works for my needs.

@rixx
Copy link
Author

rixx commented Jan 17, 2024

Alright, I finally got around to comparing the outputs of the two code paths. There is a difference in output: in the new, patched version, there are added newlines surrounding the separator lines (-=-=-=-=-=-). I've attached the (shortened) export of a single book's highlights here for comparison.
highlights-unpatched.txt
highlights-patched.txt

--- /home/rixx/highlights-unpatched.txt	2024-01-17 20:28:45.396911516 +0100
+++ /home/rixx/highlights-patched.txt	2024-01-17 20:28:35.416853066 +0100
@@ -3,11 +3,15 @@
  Special Collections
  
   -- Page: 3, added on Wed Jan 17 12:57:13 2024
+
 Jude had been teaching Engineering Communication and Design online for so long that the temporary virtual classroom seemed permanent. Every year or two someone in the department brought up in-person classes, but he wasn’t surprised when the state of emergency extended another year, and they continued to wait for a return to normal that never came.
+
 -=-=-=-=-=-
+
  Special Collections
  
   -- Page: 4, added on Wed Jan 17 12:58:19 2024
+
 Twenty years before, he had lectured them about communication standards in complex infrastructure projects, and told them that they would spend their entire careers writing, and that they should learn to do it well. But now, Jude had no answer that meant anything to them.
--=-=-=-=-=-
 
+-=-=-=-=-=-

@pazos
Copy link
Member

pazos commented Jan 18, 2024

Thank you, @rixx

there are added newlines surrounding the separator lines (-=-=-=-=-=-)

These are easy to change. Just see my patch above and remove both "\n" before and after the separator lines.

Feel free to submit a PR if you're happy with the changes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help-wanted We'd like help with this issue Plugin
Projects
None yet
Development

No branches or pull requests

6 participants