New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? #126
Comments
It could be modified to do this. I could be persuaded but I'm reluctant to add this because This shell pipeline gets you both: ocrmypdf input.pdf - | tee output.pdf | pdftotext - output.txt (Although in the most recent release it sometimes fails because stdout is not flushed correctly. That will be fixed.) |
Coming back to text extraction (for epub generation). I found that plain text results via Later this month I can present you some detailed examples, if you want, currently I cannot. To make sure: I am in favour of using your code, but currently my own tool chain gives much better text file results. Perhaps the problem is pdftotext, which garbles the text, and my original idea of this issue is worth to be resconsidered (ocrmypdf -> text output). |
Is it this issue?
tesseract-ocr/tesseract#357
https://bugs.ghostscript.com/show_bug.cgi?id=696874. (See especially
comment 4. My initial guess at the cause was way off.)
Use --output-type pdf
Ghostscript 9.20 might help if using the default --output-type pdfa
…On Tue, Feb 7, 2017 at 09:52 Wikinaut ***@***.***> wrote:
Coming back to text extraction (for epub generation).
I found that plain text results via ocrmypdf -> pdftotext are really
useless (text is garbled), whereas the generation via tesseract (alone) is
really perfect.
Later this month I can present you some detailed examples, if you want,
currently I cannot.
To make sure: I am in favour of using your code, but currently my own tool
chain gives much better text file results. Perhaps the problem is
pdftotext, which garbles the text, and my original idea of this issue is
worth to be resconsidered (ocrmypdf -> text output).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM3XK3n_icxuoELPL_xQJJe-4ZOt5ks5raK93gaJpZM4LotjQ>
.
|
I used what I wrote here #124 (comment) (without the --debug-rendering), with the old debian 8 ghostscript.
I installed a recent debian 9 with ghostscript 9.20 and can retry it there (in some days). |
Can you send me the file?
|
This ended up leading into tesseract issue #712, in which we dramatically improved tesseract PDFs after they pass through Ghostscript. Some viewers still have problems with Tesseract PDFs and it's unlikely that will ever be completed solved. Because of that Tesseract PDF conversion can be thought of as a possibly lossy operation with respect to word breaks. Tesseract knows more than it can put in a PDF about where word breaks are located, because of PDF limitations, so it is more reliable to get this out of tesseract directly. I'll add the feature. |
Added in v5 |
OCRmyPDF's main task is the creation of an mixed-mode (image, text) PDF.
Can OCRmyPDF create a single plain text output file (in addition to the pdf output) ? (One could use
pdftotext
, but this requires a further call).tesseract
by its own already can create pdf, hocr and text in one go, and for certain applications like creation of an ebook ( out.txt as input tocalibre
), this can be useful.The text was updated successfully, but these errors were encountered: