-
Notifications
You must be signed in to change notification settings - Fork 149
Export to PDF format: Include the text in the PDF #135
Comments
There is one corner case to not forget: Imported PDF with no text (scanned PDF). OCR is run on them as well. To export them, we cannot just copy the pdf file as done currently. They must be rebuilt to include the OCR text. |
Since it will probably require a more dedicated library for this job, trying to reduce the PDFs sizes could also be a good thing. |
Note to myself: pypdfocr uses reportlab.pdfgen.canvas.Canvas |
(same thing for the 2 others) |
I made a small shell script to export a scanned Paperwork document to PDF: hocr2pdf is used on each page (from exactimage package on Ubuntu) and pdfunite for creating the final PDF. Seems to work pretty well. Also interesting is the fact that the final PDF is much smaller than the version using Paperwork export feature: from 21Mo to 4.9Mo. Here is the script: #! /bin/sh
if [ 3 -ne $# ]
then
echo 1>&2 "Usage: $0 <resolution in dpi> <input directory> <output pdf filename>"
exit 1
fi
dpi="$1"
input="$2"
output="$3"
tmpdir="`mktemp -t -d export2pdf.XXXXXXXXXX`" || exit $?
code=1
pages=''
pagenum=1
while true
do
image="$input/paper.$pagenum.jpg"
if [ ! -r "$image" ]
then
break
fi
words="$input/paper.$pagenum.words"
if [ ! -r "$words" ]
then
words='/dev/null'
fi
echo "Processing page $pagenum..."
hocr2pdf -r "$dpi" -i "$image" -o "$tmpdir/$pagenum.pdf" <"$words"
code=$?
if [ 0 -ne $code ]
then
break
fi
pages="$pages $tmpdir/$pagenum.pdf"
pagenum=$((pagenum+1))
done
if [ 0 -eq $code ]
then
echo "Creating final PDF..."
pdfunite $pages "$output"
code=$?
fi
rm -f $pages
rmdir "$tmpdir"
exit $code |
If you are interested in small pdf than have a look at https://github.com/agl/jbig2enc (it did not add text layer, but it should be possible to combine pdf.py from jbig2enc with above mentioned projects... Interesting experiences with scanning, ocr, jbig2enc, hocr, pdf... can be found at this blog http://ssdigit.nothingisreal.com/ |
Another way to drastically reduce the size of the produced pdf is a duplex stream using
But this way we loose OCR data. |
Well, actually, Cairo can do it ... :) |
Well, this would be a very good feature to include text in any way. In case if one needs to further optimise or edit the PDF one could use MasterPDF Editor for linux. Just getting the text in would be a huge step forward. I am using version 0.3.2 and it only generates pdf images, no "selectable" or searchable text is found. |
I understand. What I tried to say in my comment #135 (comment) is that I've added this feature in paperwork-backend. In other words, it will be available in Paperwork 0.4. |
That is great, sorry I misunderstood. Fingers crossed that instability in scanning will be resolved. This is a great piece of software. I was just curious about design choice by the way - I have not seen a way to add document title; effectively all written identification is done through tags and additional text? It is nice to spend absolute minimum time on document processing when scanning. On the other hand document titles / file names are such time honored tradition... :) I am thinking when number of documents reaches hundreds, names might be handy. |
Currently, when we export a document to the PDF format, it only contains the image of the document. Including the text would be nice.
The text was updated successfully, but these errors were encountered: