Export to PDF format: Include the text in the PDF #135

jflesch · 2013-03-16T21:14:56Z

Currently, when we export a document to the PDF format, it only contains the image of the document. Including the text would be nice.

jflesch · 2013-05-06T08:49:06Z

There is one corner case to not forget: Imported PDF with no text (scanned PDF). OCR is run on them as well. To export them, we cannot just copy the pdf file as done currently. They must be rebuilt to include the OCR text.

jflesch · 2013-12-30T17:32:26Z

Since it will probably require a more dedicated library for this job, trying to reduce the PDFs sizes could also be a good thing.

zdenop · 2013-12-30T19:22:32Z

Have a look at these links:

https://github.com/jbrinley/HocrConverter/blob/master/HocrConverter.py
https://code.google.com/p/hocr-tools/source/browse/hocr-pdf
http://documentup.com/virantha/pypdfocr/

jflesch · 2013-12-30T19:32:52Z

Note to myself: pypdfocr uses reportlab.pdfgen.canvas.Canvas

jflesch · 2013-12-30T19:33:40Z

(same thing for the 2 others)

benoit-pierre · 2014-01-05T16:02:34Z

I made a small shell script to export a scanned Paperwork document to PDF: hocr2pdf is used on each page (from exactimage package on Ubuntu) and pdfunite for creating the final PDF. Seems to work pretty well. Also interesting is the fact that the final PDF is much smaller than the version using Paperwork export feature: from 21Mo to 4.9Mo.

Here is the script:

#! /bin/sh

if [ 3 -ne $# ]
then
  echo 1>&2 "Usage: $0 <resolution in dpi> <input directory> <output pdf filename>"
  exit 1
fi

dpi="$1"
input="$2"
output="$3"
tmpdir="`mktemp -t -d export2pdf.XXXXXXXXXX`" || exit $?

code=1
pages=''
pagenum=1

while true
do
  image="$input/paper.$pagenum.jpg"
  if [ ! -r "$image" ]
  then
    break
  fi

  words="$input/paper.$pagenum.words"
  if [ ! -r "$words" ]
  then
    words='/dev/null'
  fi

  echo "Processing page $pagenum..."
  hocr2pdf -r "$dpi" -i "$image" -o "$tmpdir/$pagenum.pdf" <"$words"
  code=$?

  if [ 0 -ne $code ]
  then
    break
  fi

  pages="$pages $tmpdir/$pagenum.pdf"
  pagenum=$((pagenum+1))
done

if [ 0 -eq $code ]
then
  echo "Creating final PDF..."
  pdfunite $pages "$output"
  code=$?
fi

rm -f $pages
rmdir "$tmpdir"

exit $code

zdenop · 2014-01-05T17:23:21Z

If you are interested in small pdf than have a look at https://github.com/agl/jbig2enc (it did not add text layer, but it should be possible to combine pdf.py from jbig2enc with above mentioned projects...

Interesting experiences with scanning, ocr, jbig2enc, hocr, pdf... can be found at this blog http://ssdigit.nothingisreal.com/

akarzim · 2014-11-26T18:09:13Z

Another way to drastically reduce the size of the produced pdf is a duplex stream using

pdf2ps - - | ps2pdf - -

But this way we loose OCR data.

jflesch · 2016-06-29T16:07:11Z

Well, actually, Cairo can do it ... :)
openpaperwork/paperwork-backend@38632cd

r0bis · 2016-07-18T12:30:18Z

Well, this would be a very good feature to include text in any way. In case if one needs to further optimise or edit the PDF one could use MasterPDF Editor for linux. Just getting the text in would be a huge step forward. I am using version 0.3.2 and it only generates pdf images, no "selectable" or searchable text is found.

jflesch · 2016-07-18T13:29:41Z

I understand. What I tried to say in my comment #135 (comment) is that I've added this feature in paperwork-backend. In other words, it will be available in Paperwork 0.4.
(problem is, for now, Paperwork 0.4 is still strongly unstable when scanning ... :/)

r0bis · 2016-07-18T17:04:58Z

That is great, sorry I misunderstood. Fingers crossed that instability in scanning will be resolved. This is a great piece of software. I was just curious about design choice by the way - I have not seen a way to add document title; effectively all written identification is done through tags and additional text? It is nice to spend absolute minimum time on document processing when scanning. On the other hand document titles / file names are such time honored tradition... :) I am thinking when number of documents reaches hundreds, names might be handy.

jflesch mentioned this issue Mar 16, 2013

Export PDF #44

Closed

jflesch modified the milestones: 0.4-unstable, 0.3-unstable Oct 9, 2015

jflesch closed this as completed Jun 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export to PDF format: Include the text in the PDF #135

Export to PDF format: Include the text in the PDF #135

jflesch commented Mar 16, 2013

jflesch commented May 6, 2013

jflesch commented Dec 30, 2013

zdenop commented Dec 30, 2013

jflesch commented Dec 30, 2013

jflesch commented Dec 30, 2013

benoit-pierre commented Jan 5, 2014

zdenop commented Jan 5, 2014

akarzim commented Nov 26, 2014

jflesch commented Jun 29, 2016

r0bis commented Jul 18, 2016

jflesch commented Jul 18, 2016

r0bis commented Jul 18, 2016

Export to PDF format: Include the text in the PDF #135

Export to PDF format: Include the text in the PDF #135

Comments

jflesch commented Mar 16, 2013

jflesch commented May 6, 2013

jflesch commented Dec 30, 2013

zdenop commented Dec 30, 2013

jflesch commented Dec 30, 2013

jflesch commented Dec 30, 2013

benoit-pierre commented Jan 5, 2014

zdenop commented Jan 5, 2014

akarzim commented Nov 26, 2014

jflesch commented Jun 29, 2016

r0bis commented Jul 18, 2016

jflesch commented Jul 18, 2016

r0bis commented Jul 18, 2016