Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage #3169

Open
jgm opened this issue Oct 19, 2016 · 22 comments

Comments

@jgm
Copy link
Owner

commented Oct 19, 2016

Pandoc is a memory hog.
https://groups.google.com/d/msg/pandoc-discuss/l6Xo0xk8NAQ/1KCKPyc2BgAJ

Do some profiling to figure out why and fix this.

@jgm

This comment has been minimized.

Copy link
Owner Author

commented Oct 19, 2016

These tests are with a 143K input file.

screen shot 2016-10-19 at 17 12 45

screen shot 2016-10-19 at 17 16 48

Note that the Markdown writer is much more memory-hungry than the HTML writer:

screen shot 2016-10-19 at 17 18 16

screen shot 2016-10-19 at 17 19 32

@sergiocorreia

This comment has been minimized.

Copy link

commented Oct 19, 2016

In case you want to profile with other inputs, for panflute I searched github for large md files, and collected some of them:

These files have a diverse source of elements, and some have maybe 100,000k elements (I used them to profile panflute as it was a bit slow)

@jgm

This comment has been minimized.

Copy link
Owner Author

commented Nov 18, 2016

jgm added a commit that referenced this issue Nov 18, 2016

@jgm

This comment has been minimized.

Copy link
Owner Author

commented Nov 18, 2016

Case                     Bytes  GCs  Check
Pandoc document        107,672    0  OK   
markdown reader    116,843,688  227  OK   
html reader         85,325,688  165  OK   
docbook reader      61,925,288  119  OK   
latex reader        58,609,584  114  OK   
commonmark reader    3,053,696    5  OK   
markdown writer     15,704,536   30  OK   
html writer          6,776,128   13  OK   
docbook writer      12,567,392   24  OK   
latex writer         7,387,296   14  OK   
commonmark writer    3,315,184    6  OK   

@tarleb tarleb added the performance label Nov 27, 2016

@ickc

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2016

There's a file when using pandoc -s -o -t md causes a strange behavior: Selected Hymns.docx. This file was obtained from a PDF source and used an online service to be converted to docx.

The strange behavior is this: while processing with a ton of CPU and memory usage, the resultant md file is just a newline character.

@sergiocorreia

This comment has been minimized.

Copy link

commented Dec 20, 2016

The strange behavior is this: while processing with a ton of CPU and memory usage

I tried to open it with Word 2016 on Windows on a newish laptop, and gave up after a minute or so. Then I renamed it and opened the zip, and from the look of it the document seems like a mess (word/document.xml seems pretty garbled).

So this might be a problem with Word and not with Pandoc per se...

@ickc

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2016

I recall the Word 2016 for Mac has a similar behavior too. I now suspect the online converter actually get it wrong and returns an invalid (or a valid but really bad) docx.

The conversion is not important for me. But it is still puzzling to see it spend a ton of CPU and memory and time to render it, but results in an empty file with no debug message.

@jgm

This comment has been minimized.

Copy link
Owner Author

commented Jun 19, 2017

As a checkpoint, here's the current result of make weigh:

Case                 Allocated  GCs
Pandoc document        106,024    0
markdown reader    136,916,240  262
html reader        113,248,608  220
docbook reader      61,625,792  119
latex reader        92,679,280  179
commonmark reader    2,236,664    4
markdown writer     20,798,712   40
html writer         12,345,760   23
docbook writer      15,275,296   29
latex writer        19,628,304   37
commonmark writer    2,282,488    4

Mostly things have gotten worse as a result of changes since November 2016.
This needs looking into.

@ids1024

This comment has been minimized.

Copy link

commented Sep 15, 2017

Trying to convert this (75 MB) file makes my computer run out of memory (my computer has 16 GB): https://github.com/PerseusDL/lexica/blob/master/CTS_XML_TEI/perseus/pdllex/lat/ls/lat.ls.perseus-eng1.xml

It's not really a big deal, since pandoc probably wouldn't produce useable output for a TEI file of that sort (I was just interested in seeing what it results in), but the memory consumption seems surprising.

@jgm

This comment has been minimized.

Copy link
Owner Author

commented Sep 15, 2017

@ickc

This comment has been minimized.

Copy link
Contributor

commented Sep 16, 2017

Trying to convert this (75 MB) file makes my computer run out of memory (my computer has 16 GB): https://github.com/PerseusDL/lexica/blob/master/CTS_XML_TEI/perseus/pdllex/lat/ls/lat.ls.perseus-eng1.xml

It's not really a big deal, since pandoc probably wouldn't produce useable output for a TEI file of that sort (I was just interested in seeing what it results in), but the memory consumption seems surprising.

What's the command used? pandoc version and platform?

@ids1024

This comment has been minimized.

Copy link

commented Sep 16, 2017

I ran pandoc lexica/CTS_XML_TEI/perseus/pdllex/lat/ls/lat.ls.perseus-eng1.xml -o test.md using pandoc 1.19.2.1 on Arch Linux.

@ickc

This comment has been minimized.

Copy link
Contributor

commented Sep 16, 2017

Interesting:

# pandoc v1.19.1
$ /usr/local/bin/time -v pandoc -t native lat.ls.perseus-eng1.xml -o lat.ls.perseus-eng1.native
pandoc: Stack space overflow: current size 33624 bytes.
pandoc: Use `+RTS -Ksize -RTS' to increase it.
Command exited with non-zero status 2
	Command being timed: "pandoc -t native lat.ls.perseus-eng1.xml -o lat.ls.perseus-eng1.native"
	User time (seconds): 253.71
	System time (seconds): 6.41
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 4:20.51
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 53923790848
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3291389
	Voluntary context switches: 3
	Involuntary context switches: 69123
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 23635
	Page size (bytes): 4096
	Exit status: 2
# pandoc master at commit 5849b89
$ /usr/local/bin/time -v /Users/kolen/Downloads/pandoc-osx-5849b89/pandoc -t native lat.ls.perseus-eng1.xml -o lat.ls.perseus-eng1.native
pandoc: Stack space overflow: current size 33624 bytes.
pandoc: Use `+RTS -Ksize -RTS' to increase it.
Command exited with non-zero status 2
	Command being timed: "/Users/kolen/Downloads/pandoc-osx-5849b89/pandoc -t native lat.ls.perseus-eng1.xml -o lat.ls.perseus-eng1.native"
	User time (seconds): 4.34
	System time (seconds): 0.58
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.94
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6518554624
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 397980
	Voluntary context switches: 4
	Involuntary context switches: 530
	Swaps: 0
	File system inputs: 1
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 444
	Page size (bytes): 4096
	Exit status: 2

Note that the kbytes should be bytes above (a glitch of using GNU time on macOS). The interesting bit is that the current master reaches "Stack space overflow" at 4 s rather than 4 min.

@jgm

This comment has been minimized.

Copy link
Owner Author

commented Sep 16, 2017

@ickc

This comment has been minimized.

Copy link
Contributor

commented Sep 16, 2017

Edit: my guess is wrong, see @jgm's reply a minute earlier than mine.

@jgm

This comment has been minimized.

Copy link
Owner Author

commented Aug 3, 2018

See another instance of the same issue here.

Note that in this case using --trace prevented running out of memory. Perhaps forcing evaluation of the intermediate Block data structures helps?

@archonic

This comment has been minimized.

Copy link
Contributor

commented May 27, 2019

I have a background worker in my web app which uses pandoc through pandoc-ruby. Is there a way to configure pandoc to not exceed a certain amount of memory? A docx which is just 409kb is causing evictions on a pod which has 1GB of allocatable memory.

@ickc

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

Which OS, Linux?

@archonic

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

Currently slim but it will be alpine. Thanks @ickc!

@jgm

This comment has been minimized.

Copy link
Owner Author

commented May 28, 2019

@archonic - If pandoc is compiled with -rtsopts (as it should be by default unless someone specifically turns this off), then you can specify the maximum heap size as follows:

pandoc +RTS -M30m -RTS -f markdown -t html MANUAL.txt

The +RTS -M30m -RTS sets max heap size to 40M. Max stack size defaults to 80% heap size.

@archonic

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

Awesome. If anyone else is looking for a way to limit memory usage through PandocRuby, just call this before convert:

PandocRuby.pandoc_path = "pandoc +RTS -M30m -RTS"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.