Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge redundant docx nodes to reduce memory footprint #5854

Open
alecgibson opened this issue Oct 25, 2019 · 19 comments
Open

Merge redundant docx nodes to reduce memory footprint #5854

alecgibson opened this issue Oct 25, 2019 · 19 comments

Comments

@alecgibson
Copy link

alecgibson commented Oct 25, 2019

It's possible to have some docx files with repeated, redundant styling applied on every word, like so:

<w:r w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t>without</w:t>
</w:r>
<w:r w:rsidR="00C734B2" w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t xml:space="preserve"></w:t>
</w:r>
<w:r w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t>backup.</w:t>
</w:r>

When running these files through Pandoc, it consumes a vast amount of memory (>2GB when processing an 80k word document).

In contrast, if we copy-paste the contents of this file into a "fresh" document in MS Word and save, running the new document through Pandoc only consumes ~100MB memory.

Is there any way for Pandoc to be a bit "smarter" when building its AST to find these repeated nodes, and merge them in order to reduce the memory footprint?

I realise that the workaround is trivial, but we're trying to deal with arbitrary user input (always exciting), and technically this is a valid way of representing a document (if also a bit stupid), and it would be great if Pandoc could cope with this in a sensible way.

Pandoc version

pandoc 2.7.2
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.7.7

Console output

pandoc +RTS -s -h -RTS --from=docx --to=json --out=test.json example.docx
  23,178,301,504 bytes allocated in the heap
  11,219,862,864 bytes copied during GC
   2,350,551,128 bytes maximum residency (16 sample(s))
       5,922,728 bytes maximum slop
            2241 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22022 colls,     0 par    4.509s   4.563s     0.0002s    0.0013s
  Gen  1        16 colls,     0 par    6.978s   9.773s     0.6108s    3.2935s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time    5.223s  (  5.938s elapsed)
  GC      time   11.487s  ( 14.336s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time   16.711s  ( 20.286s elapsed)

  Alloc rate    4,437,656,505 bytes per MUT second

  Productivity  31.3% of total user, 29.3% of total elapsed

Heap Profile

Screen Shot 2019-10-25 at 12 44 22

Test document

example.docx

@jgm
Copy link
Owner

jgm commented Oct 26, 2019

@jkr does this suggest anything?

@tarleb
Copy link
Collaborator

tarleb commented Dec 20, 2020

@alecgibson, would you upload a test file which we can use to analyze the problem?

@alecgibson
Copy link
Author

@tarleb there's one in the original post; does that not work?

@tarleb
Copy link
Collaborator

tarleb commented Dec 20, 2020

Argh, I'm just blind. Thanks!

@mpickering
Copy link
Collaborator

I had a look into this.

  1. It seems that the xml parsing library is not very well optimised so that there are a lot of thunks which are left in the AST once the XML parsing has finished. I'm not sure if there's a better (more optimised) alternative.
  2. Almost all the allocations come from XML parsing, I think it just uses more memory because the file is much bigger because of all the redundant styling. It's not clear though why it needs to allocate gbs of (,) and : constructors for a 250kb file.
  3. I added some more strictness to the xml library which flattened the profile after the initial parsing because the whole input stream is not retained. As far as I could tell, the whole input stream was being retained because the attributes field was never forced.

profiles.zip

@mpickering
Copy link
Collaborator

mpickering commented Dec 20, 2020

I made some strictness changes to xml, which halves the maximum residency.

mpickering/xml@ba346a8

profiles2.zip

These are just all my changes I made, it's not scientific about which ones are necessary or not.

I'm not sure what's best to do here, I wouldn't personally want to rely on the xml library.

@mb21
Copy link
Collaborator

mb21 commented Dec 20, 2020

Nice work @mpickering! Well, pandoc uses that lib in quite a few places... so guess it's either:

  1. improve performance in upstream (the lib) by introducing more strictness, and potentially also replacing String with Text
  2. do the above but fork it
  3. write an API-compatible wrapper around another xml lib... do you know of a good and small one?

The nice thing about xml is that it's quite minimal, so doing 1. or 2. potentially sounds like less hassle?

@tarleb
Copy link
Collaborator

tarleb commented Dec 20, 2020

Thanks @mpickering, this is great!

write an API-compatible wrapper around another xml lib... do you know of a good and small one?

The only real contenders for parsing seem to be xml-conduit and tagsoup, but we're also using xml for writing XML. So replacing it is difficult.

Anyhow, could tackling this make a good Summer of Code student project for next year?

@mpickering
Copy link
Collaborator

Perhaps xeno is another option?

@jgm
Copy link
Owner

jgm commented Dec 20, 2020

citeproc uses xml-conduit. pandoc depends on citeproc. So we could use xml-conduit instead of xml without incurring any more dependencies. And it has a renderer.

@jgm
Copy link
Owner

jgm commented Dec 20, 2020

But improving the xml library by submitting patches upstream seems a good idea to me in any case. It is also used by texmath and 110 other packages:
https://packdeps.haskellers.com/reverse/xml
So improving it could really help the whole ecosystem (assuming this is not one of those cases where the extra strictness sometimes helps and sometimes hurts...)

@milahu
Copy link

milahu commented Jun 4, 2021

@jgm
Copy link
Owner

jgm commented Jun 4, 2021

Note that recent versions of pandoc use a different xml parsing library than the one that was used in 2.7 (the version originally tested in the above report). I would expect performance would be much better.

@jgm
Copy link
Owner

jgm commented Jun 4, 2021

OK, just tested with pandoc 2.14.0.1.

  34,687,033,432 bytes allocated in the heap
   5,041,403,792 bytes copied during GC
     889,977,368 bytes maximum residency (13 sample(s))
       5,154,280 bytes maximum slop
            1999 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4119 colls,     0 par    2.698s   2.731s     0.0007s    0.0048s
  Gen  1        13 colls,     0 par    1.784s   2.281s     0.1755s    0.8247s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time    8.698s  (  8.759s elapsed)
  GC      time    4.483s  (  5.012s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time   13.182s  ( 13.786s elapsed)

  Alloc rate    3,987,722,584 bytes per MUT second

  Productivity  66.0% of total user, 63.5% of total elapsed

Some improvement here but not enough.

@jgm
Copy link
Owner

jgm commented Jun 4, 2021

I tried adding StrictData to T.P.XML.Light.Types.
This did not affect things much, actually made it a bit worse.

@jgm
Copy link
Owner

jgm commented Jun 4, 2021

Here's a heap profile with 2.14.0.1.

Screen Shot 2021-06-04 at 9 21 51 AM

@jgm
Copy link
Owner

jgm commented Jun 4, 2021

Actually this does look like quite an improvement over the original heap profile.

@jgm
Copy link
Owner

jgm commented Jun 4, 2021

A sample of the intermediate representation created by the docx reader before the AST is constructed:

PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun "foo"]),PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun " "])

and so on. One thing we could try would be doing a fusion operation on this representation (the Document structure produced by archiveToDocument), before it is converted to a Pandoc. I don't now if this would help.
Probably a better approach would be to do the fusion in the process of parsing a Document.

@jgm
Copy link
Owner

jgm commented Jun 4, 2021

I tried fusing the PlainRuns at the paragraph parsing phase; no help. I think that, as before, the problem is occuring in the XML parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants