Start New Page using template and exponential file size increase #404

magpieuk opened this Issue Sep 21, 2012 · 16 comments


None yet

6 participants


We are using or maybe misusing the start_new_page :template facility to merge two PDFs. The source file is approx 300KB and the additional file is 8 pages with a file size of approx 3MB.

            doc_hash = doc[:content]
            doc_pages = doc_hash.page_references.count
            (1 .. doc_pages).each do |pg|
              start_new_page :template => doc[:content], :template_page => pg
              draw_text "#{doc[:file_name]}: #{doc[:info]}",
                        :at => [bounds.left, bounds.bottom + 10], :font => size_of_font - 2

The document creates fine and the two documents are merged, however the file size is now 30MB.

I think that every time I call the start new page we are storing the entire template file instead of just the required page. Is there a way of just getting the single page, or alternatively storing the template once and referencing it for each page?


It's possible to fix this by getting #import_page to accept an already-read PDF::Reader::ObjectHash

A first attempt is below, and it produces viewable pdfs. Needs tests etc. One for ruby-northeast? (-:

module ImportFromHash
  def import_page(filename_or_hash, page_num)
    @loaded_objects = {}

    if filename_or_hash.is_a?(PDF::Reader::ObjectHash)
      hash = filename_or_hash
      unless File.file?(filename_or_hash)
        raise ArgumentError, "#{filename_or_hash} does not exist"
    ref  = hash.page_references[page_num - 1]

    ref.nil? ? nil : load_object_graph(hash, ref).identifier

  rescue PDF::Reader::MalformedPDFError, PDF::Reader::InvalidObjectError
    msg = "Error reading template file. If you are sure it's a valid PDF, it may be a bug."
    raise Prawn::Errors::TemplateError, msg
  rescue PDF::Reader::UnsupportedFeatureError
    msg = "Template file contains unsupported PDF features"
    raise Prawn::Errors::TemplateError, msg
prawnpdf member

@paulcc that's a nice solution.

I also have a branch [1] that allows PDF pages to be imported via the #image() method (like PNGs and JPGs). I had planned to work on merging it upstream once prawn 1.0 ships.


prawnpdf member

While I did not test @paulcc's solution I am not seeing how it would avoid the excessive pdf file size since it looks like it would not avoid the calls to load_object_graph.

I have another possible solution implemented here: jonsgreen@59a521e. Not the prettiest so I am open to suggestions.

I also want to apologize for having released the page template feature with this gaping issue and for not having given it any attention until now in spite of many complaints about the file size problem.



My fix solves magpieuk's issue because it only reads and parses the file once. (We're using the call to embed pdf docs inside a pdf doc, hence reading distinct pages in sequence - plus we're reading the pages from a parsed pdf doc.)

It won't prevent repeat calls to load_object_graph if the same page is used repeatedly - which jonsgreen's fix will handle - though only if reading from a filename.

One thought though - why not memoize reads from streams or parsed pdf objects as well? It gives more flexibility and would simplify the code.


prawnpdf member

So here is another attempt that indexes both the page template identifiers and the object hash incorporating both ideas: jonsgreen@bd6c1c3.

I am not sure how great it is to be indexing the IO by an md5 hash but it seemed like a cleaner key than the stream itself. I am open to suggestions.

This gets the same filesize for me as with paulcc's fix for sequential pages but it gets increasingly better results for repeated pages because of the reduced calls to load_object_graph.

I have to confess that I am still puzzled by why it helps to use the same hash when calling load_object_graph for sequential pages; frankly I get a bit lost figuring out what everything is pointing to. I wonder whether there isn't something not quite right with how that is all working but I would need @yob's help in deciphering the intricacies of that method.

What do folks think of all this?

prawnpdf member

So there's a nonzero cost to MD5ing the contents of the template every time it is to be used (of course, it's still better than embedding the thing multiple times!). But is there an advantage over simply indexing by the IO object itself or its object_id? Is the concern that the caller might be opening the same file (or other stream) multiple times and acquiring different handles that point to the same content? I may be misunderstanding.

prawnpdf member

I had thought of using the #object_id but was not sure about the reliability of that having never really used that fundamental property of an object much.

I will give that a try and see if it simplifies matters and cleans things up.

prawnpdf member

I changed my commit to use object_id and submitted a pull request. Definitely nice to not need to MD5 contents unnecessarily.

prawnpdf member

@jonsgreen's #418 merged to master. Does this resolve the original issue?

prawnpdf member

I am closing this ticket since it should resolve the original issue.

@jonsgreen jonsgreen closed this Nov 3, 2012

Sorry for not replying earlier. We've not had a chance to test the fix in our system, but the code certainly looks ok.

Thanks for helping.


Did anyone actually test the proposed solution?
I'm running 1.0.0.rc2 and I still have the same problem: the whole template size is being added to output file even though only one template page is being used.

That is not a problem if the template is not that big; however as I'm trying to take advantage of the template feature to do something different, I'm generating HUGE files (e.g. 151MiB) from small ones (e.g. 1.9MiB).

What I want is to add a standard header to all pages of any given PDF.
The header is being added using Prawn's drawing functions.

So what I am doing is:

  • loading the existent PDF as a template
  • iterating through each of its pages
  • drawing the header using Prawn's drawing functions
  • generating a new PDF file with headers

It works, but as I said the output file is huge as it equals to original_pdf.size*original_pdf.pages.count.

Even though I'm aware Prawn was not designed to work with existent PDFs I cannot think of another way of doing what I want but by using Prawn.

I'm sorry if I should've found this solution somewhere else but I couldn't find it anywhere; and boy, I looked for it...

And here's the code:

original_file_path = "/path/to/original/file"
original_file = original_file_path
pages = original_file.page_references.count

new_file = skip_page_creation: true do |file|
  (1..pages).each do |page|
    file.start_new_page template: original_file_path, template_page: page
    add_header(file) # draws a header using Prawn drawing functions

new_file.render_file "/path/to/new/file"

I'd be glad if someone could point what I'm doing wrong out.
However, if this is still a bug is there a way of overcoming it?



Sorry, I've still not tested the code that was merged in. I've just re-read the commit, and it all looks sensible still. You could try my original attempt in #404 (comment)(but note a missing '=' in line "hash")

Is the blow-up linear (proportional to the #pages in the original) or is it worse than that? I'm wondering whether there's some sharing of objects in your original files which is being lost when you extract pages, eg if page 1 contained objects [A,B,C] where B and C were used on all pages, then the imported version might have its own duplicated copies of B and C etc. IIRC prawn is not too clever about sharing.

You might get some clues by counting how many objects and references to them are in the pdfs, eg try these on the original file then the generated file. Let us know what you find!

egrep -a '^[0-9]+ 0 obj' foo.pdf | wc -l
egrep -a '[0-9]+ 0 R' foo.pdf | wc -l


(re-posted - original formatting went awol)


@paulcc I think the problem is "simpler" than that. The final size is proportional to the number of pages in the original. As I said before, it's exactly equal to the number of pages of the original file times the original file size.
It's like even though I'm using only the first page of the template the whole template is being attached to the final final but only the selected page is being shown. So I don't think this problem is related to shared objects.

Anyway, I have also tried your solution (and I had noticed the missing equals sign 😄) but I had no success.

Thanks for helping me, by the way.


When @paulcc asked me to actually send the files and count their objects I decided to take a smaller PDF and edit it using Prawn. Then I noticed just a couple of bytes were added (due to the edits I had made); so it worked. So I decided to take a 2MiB file and guess what? It worked again.

I'm really not sure what happened; I was using version 0.12.0 and then updated to 1.0.rc2; that's what changed in my code. Before posting here I assured myself I was running it: restarted the server, ran bundle, made some tests and the huge files were still being generated. During this period I had restarted my system so maybe it was a caching issue, I really don't know. Or maybe I was just careless with one of the updating steps. :/

I'm so sorry for wasting your time and at the same time so thankful for your cooperation.
Have you all a great weekend. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment