-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge redundant docx nodes to reduce memory footprint #5854
Comments
@jkr does this suggest anything? |
@alecgibson, would you upload a test file which we can use to analyze the problem? |
@tarleb there's one in the original post; does that not work? |
Argh, I'm just blind. Thanks! |
I had a look into this.
|
I made some strictness changes to These are just all my changes I made, it's not scientific about which ones are necessary or not. I'm not sure what's best to do here, I wouldn't personally want to rely on the |
Nice work @mpickering! Well, pandoc uses that lib in quite a few places... so guess it's either:
The nice thing about |
Thanks @mpickering, this is great!
The only real contenders for parsing seem to be xml-conduit and tagsoup, but we're also using Anyhow, could tackling this make a good Summer of Code student project for next year? |
Perhaps |
|
But improving the |
in my case, this blocks text-extraction from docx |
Note that recent versions of pandoc use a different xml parsing library than the one that was used in 2.7 (the version originally tested in the above report). I would expect performance would be much better. |
OK, just tested with pandoc 2.14.0.1.
Some improvement here but not enough. |
I tried adding StrictData to T.P.XML.Light.Types. |
Actually this does look like quite an improvement over the original heap profile. |
A sample of the intermediate representation created by the docx reader before the AST is constructed:
and so on. One thing we could try would be doing a fusion operation on this representation (the |
I tried fusing the PlainRuns at the paragraph parsing phase; no help. I think that, as before, the problem is occuring in the XML parser. |
It's possible to have some
docx
files with repeated, redundant styling applied on every word, like so:When running these files through Pandoc, it consumes a vast amount of memory (>2GB when processing an 80k word document).
In contrast, if we copy-paste the contents of this file into a "fresh" document in MS Word and save, running the new document through Pandoc only consumes ~100MB memory.
Is there any way for Pandoc to be a bit "smarter" when building its AST to find these repeated nodes, and merge them in order to reduce the memory footprint?
I realise that the workaround is trivial, but we're trying to deal with arbitrary user input (always exciting), and technically this is a valid way of representing a document (if also a bit stupid), and it would be great if Pandoc could cope with this in a sensible way.
Pandoc version
Console output
Heap Profile
Test document
example.docx
The text was updated successfully, but these errors were encountered: