Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Export to PDF format: Include the text in the PDF #135

Closed
jflesch opened this issue Mar 16, 2013 · 12 comments
Closed

Export to PDF format: Include the text in the PDF #135

jflesch opened this issue Mar 16, 2013 · 12 comments

Comments

@jflesch
Copy link
Member

jflesch commented Mar 16, 2013

Currently, when we export a document to the PDF format, it only contains the image of the document. Including the text would be nice.

@jflesch jflesch mentioned this issue Mar 16, 2013
@jflesch
Copy link
Member Author

jflesch commented May 6, 2013

There is one corner case to not forget: Imported PDF with no text (scanned PDF). OCR is run on them as well. To export them, we cannot just copy the pdf file as done currently. They must be rebuilt to include the OCR text.

@jflesch
Copy link
Member Author

jflesch commented Dec 30, 2013

Since it will probably require a more dedicated library for this job, trying to reduce the PDFs sizes could also be a good thing.

@jflesch
Copy link
Member Author

jflesch commented Dec 30, 2013

Note to myself: pypdfocr uses reportlab.pdfgen.canvas.Canvas

@jflesch
Copy link
Member Author

jflesch commented Dec 30, 2013

(same thing for the 2 others)

@benoit-pierre
Copy link

I made a small shell script to export a scanned Paperwork document to PDF: hocr2pdf is used on each page (from exactimage package on Ubuntu) and pdfunite for creating the final PDF. Seems to work pretty well. Also interesting is the fact that the final PDF is much smaller than the version using Paperwork export feature: from 21Mo to 4.9Mo.

Here is the script:

#! /bin/sh

if [ 3 -ne $# ]
then
  echo 1>&2 "Usage: $0 <resolution in dpi> <input directory> <output pdf filename>"
  exit 1
fi

dpi="$1"
input="$2"
output="$3"
tmpdir="`mktemp -t -d export2pdf.XXXXXXXXXX`" || exit $?

code=1
pages=''
pagenum=1

while true
do
  image="$input/paper.$pagenum.jpg"
  if [ ! -r "$image" ]
  then
    break
  fi

  words="$input/paper.$pagenum.words"
  if [ ! -r "$words" ]
  then
    words='/dev/null'
  fi

  echo "Processing page $pagenum..."
  hocr2pdf -r "$dpi" -i "$image" -o "$tmpdir/$pagenum.pdf" <"$words"
  code=$?

  if [ 0 -ne $code ]
  then
    break
  fi

  pages="$pages $tmpdir/$pagenum.pdf"
  pagenum=$((pagenum+1))
done

if [ 0 -eq $code ]
then
  echo "Creating final PDF..."
  pdfunite $pages "$output"
  code=$?
fi

rm -f $pages
rmdir "$tmpdir"

exit $code

@zdenop
Copy link

zdenop commented Jan 5, 2014

If you are interested in small pdf than have a look at https://github.com/agl/jbig2enc (it did not add text layer, but it should be possible to combine pdf.py from jbig2enc with above mentioned projects...

Interesting experiences with scanning, ocr, jbig2enc, hocr, pdf... can be found at this blog http://ssdigit.nothingisreal.com/

@akarzim
Copy link

akarzim commented Nov 26, 2014

Another way to drastically reduce the size of the produced pdf is a duplex stream using

pdf2ps - - | ps2pdf - - 

But this way we loose OCR data.

@jflesch jflesch modified the milestones: 0.4-unstable, 0.3-unstable Oct 9, 2015
@jflesch
Copy link
Member Author

jflesch commented Jun 29, 2016

Well, actually, Cairo can do it ... :)
openpaperwork/paperwork-backend@38632cd

@jflesch jflesch closed this as completed Jun 29, 2016
@r0bis
Copy link

r0bis commented Jul 18, 2016

Well, this would be a very good feature to include text in any way. In case if one needs to further optimise or edit the PDF one could use MasterPDF Editor for linux. Just getting the text in would be a huge step forward. I am using version 0.3.2 and it only generates pdf images, no "selectable" or searchable text is found.

@jflesch
Copy link
Member Author

jflesch commented Jul 18, 2016

I understand. What I tried to say in my comment #135 (comment) is that I've added this feature in paperwork-backend. In other words, it will be available in Paperwork 0.4.
(problem is, for now, Paperwork 0.4 is still strongly unstable when scanning ... :/)

@r0bis
Copy link

r0bis commented Jul 18, 2016

That is great, sorry I misunderstood. Fingers crossed that instability in scanning will be resolved. This is a great piece of software. I was just curious about design choice by the way - I have not seen a way to add document title; effectively all written identification is done through tags and additional text? It is nice to spend absolute minimum time on document processing when scanning. On the other hand document titles / file names are such time honored tradition... :) I am thinking when number of documents reaches hundreds, names might be handy.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants