Merge redundant docx nodes to reduce memory footprint #5854

alecgibson · 2019-10-25T11:40:55Z

It's possible to have some docx files with repeated, redundant styling applied on every word, like so:

<w:r w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t>without</w:t>
</w:r>
<w:r w:rsidR="00C734B2" w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t xml:space="preserve"></w:t>
</w:r>
<w:r w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t>backup.</w:t>
</w:r>

When running these files through Pandoc, it consumes a vast amount of memory (>2GB when processing an 80k word document).

In contrast, if we copy-paste the contents of this file into a "fresh" document in MS Word and save, running the new document through Pandoc only consumes ~100MB memory.

Is there any way for Pandoc to be a bit "smarter" when building its AST to find these repeated nodes, and merge them in order to reduce the memory footprint?

I realise that the workaround is trivial, but we're trying to deal with arbitrary user input (always exciting), and technically this is a valid way of representing a document (if also a bit stupid), and it would be great if Pandoc could cope with this in a sensible way.

Pandoc version

pandoc 2.7.2
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.7.7

Console output

pandoc +RTS -s -h -RTS --from=docx --to=json --out=test.json example.docx
  23,178,301,504 bytes allocated in the heap
  11,219,862,864 bytes copied during GC
   2,350,551,128 bytes maximum residency (16 sample(s))
       5,922,728 bytes maximum slop
            2241 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22022 colls,     0 par    4.509s   4.563s     0.0002s    0.0013s
  Gen  1        16 colls,     0 par    6.978s   9.773s     0.6108s    3.2935s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time    5.223s  (  5.938s elapsed)
  GC      time   11.487s  ( 14.336s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time   16.711s  ( 20.286s elapsed)

  Alloc rate    4,437,656,505 bytes per MUT second

  Productivity  31.3% of total user, 29.3% of total elapsed

Heap Profile

Test document

example.docx

The text was updated successfully, but these errors were encountered:

jgm · 2019-10-26T05:31:30Z

@jkr does this suggest anything?

tarleb · 2020-12-20T10:08:53Z

@alecgibson, would you upload a test file which we can use to analyze the problem?

alecgibson · 2020-12-20T10:11:48Z

@tarleb there's one in the original post; does that not work?

tarleb · 2020-12-20T10:23:54Z

Argh, I'm just blind. Thanks!

mpickering · 2020-12-20T14:44:26Z

I had a look into this.

It seems that the xml parsing library is not very well optimised so that there are a lot of thunks which are left in the AST once the XML parsing has finished. I'm not sure if there's a better (more optimised) alternative.
Almost all the allocations come from XML parsing, I think it just uses more memory because the file is much bigger because of all the redundant styling. It's not clear though why it needs to allocate gbs of (,) and : constructors for a 250kb file.
I added some more strictness to the xml library which flattened the profile after the initial parsing because the whole input stream is not retained. As far as I could tell, the whole input stream was being retained because the attributes field was never forced.

profiles.zip

mpickering · 2020-12-20T16:03:08Z

I made some strictness changes to xml, which halves the maximum residency.

mpickering/xml@ba346a8

profiles2.zip

These are just all my changes I made, it's not scientific about which ones are necessary or not.

I'm not sure what's best to do here, I wouldn't personally want to rely on the xml library.

mb21 · 2020-12-20T16:11:03Z

Nice work @mpickering! Well, pandoc uses that lib in quite a few places... so guess it's either:

improve performance in upstream (the lib) by introducing more strictness, and potentially also replacing String with Text
do the above but fork it
write an API-compatible wrapper around another xml lib... do you know of a good and small one?

The nice thing about xml is that it's quite minimal, so doing 1. or 2. potentially sounds like less hassle?

tarleb · 2020-12-20T16:36:41Z

Thanks @mpickering, this is great!

write an API-compatible wrapper around another xml lib... do you know of a good and small one?

The only real contenders for parsing seem to be xml-conduit and tagsoup, but we're also using xml for writing XML. So replacing it is difficult.

Anyhow, could tackling this make a good Summer of Code student project for next year?

mpickering · 2020-12-20T16:56:27Z

Perhaps xeno is another option?

jgm · 2020-12-20T18:24:53Z

citeproc uses xml-conduit. pandoc depends on citeproc. So we could use xml-conduit instead of xml without incurring any more dependencies. And it has a renderer.

jgm · 2020-12-20T18:28:05Z

But improving the xml library by submitting patches upstream seems a good idea to me in any case. It is also used by texmath and 110 other packages:
https://packdeps.haskellers.com/reverse/xml
So improving it could really help the whole ecosystem (assuming this is not one of those cases where the extra strictness sometimes helps and sometimes hurts...)

milahu · 2021-06-04T12:17:49Z

in my case, this blocks text-extraction from docx
related:
https://github.com/Microsoft/Simplify-Docx
https://github.com/mwilliamson/mammoth.js (does merge adjacent text nodes)
https://stackoverflow.com/questions/7752932/simplify-clean-up-xml-of-a-docx-word-document

jgm · 2021-06-04T15:10:42Z

Note that recent versions of pandoc use a different xml parsing library than the one that was used in 2.7 (the version originally tested in the above report). I would expect performance would be much better.

jgm · 2021-06-04T15:12:49Z

OK, just tested with pandoc 2.14.0.1.

  34,687,033,432 bytes allocated in the heap
   5,041,403,792 bytes copied during GC
     889,977,368 bytes maximum residency (13 sample(s))
       5,154,280 bytes maximum slop
            1999 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4119 colls,     0 par    2.698s   2.731s     0.0007s    0.0048s
  Gen  1        13 colls,     0 par    1.784s   2.281s     0.1755s    0.8247s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time    8.698s  (  8.759s elapsed)
  GC      time    4.483s  (  5.012s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time   13.182s  ( 13.786s elapsed)

  Alloc rate    3,987,722,584 bytes per MUT second

  Productivity  66.0% of total user, 63.5% of total elapsed

Some improvement here but not enough.

jgm · 2021-06-04T15:18:43Z

I tried adding StrictData to T.P.XML.Light.Types.
This did not affect things much, actually made it a bit worse.

jgm · 2021-06-04T15:22:30Z

Here's a heap profile with 2.14.0.1.

jgm · 2021-06-04T15:25:28Z

Actually this does look like quite an improvement over the original heap profile.

jgm · 2021-06-04T22:18:47Z

A sample of the intermediate representation created by the docx reader before the AST is constructed:

PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun "foo"]),PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun " "])

and so on. One thing we could try would be doing a fusion operation on this representation (the Document structure produced by archiveToDocument), before it is converted to a Pandoc. I don't now if this would help.
Probably a better approach would be to do the fusion in the process of parsing a Document.

jgm · 2021-06-04T22:42:12Z

I tried fusing the PlainRuns at the paragraph parsing phase; no help. I think that, as before, the problem is occuring in the XML parser.

alecgibson mentioned this issue Oct 25, 2019

Reduce memory usage #3169

Closed

mb21 added format:Docx performance reader labels Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge redundant docx nodes to reduce memory footprint #5854

Merge redundant docx nodes to reduce memory footprint #5854

alecgibson commented Oct 25, 2019 •

edited

Loading

jgm commented Oct 26, 2019

tarleb commented Dec 20, 2020

alecgibson commented Dec 20, 2020

tarleb commented Dec 20, 2020

mpickering commented Dec 20, 2020

mpickering commented Dec 20, 2020 •

edited

Loading

mb21 commented Dec 20, 2020

tarleb commented Dec 20, 2020

mpickering commented Dec 20, 2020

jgm commented Dec 20, 2020

jgm commented Dec 20, 2020

milahu commented Jun 4, 2021 •

edited

Loading

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

Merge redundant docx nodes to reduce memory footprint #5854

Merge redundant docx nodes to reduce memory footprint #5854

Comments

alecgibson commented Oct 25, 2019 • edited Loading

Pandoc version

Console output

Heap Profile

Test document

jgm commented Oct 26, 2019

tarleb commented Dec 20, 2020

alecgibson commented Dec 20, 2020

tarleb commented Dec 20, 2020

mpickering commented Dec 20, 2020

mpickering commented Dec 20, 2020 • edited Loading

mb21 commented Dec 20, 2020

tarleb commented Dec 20, 2020

mpickering commented Dec 20, 2020

jgm commented Dec 20, 2020

jgm commented Dec 20, 2020

milahu commented Jun 4, 2021 • edited Loading

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

jgm commented Jun 4, 2021

alecgibson commented Oct 25, 2019 •

edited

Loading

mpickering commented Dec 20, 2020 •

edited

Loading

milahu commented Jun 4, 2021 •

edited

Loading