[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? #126

Wikinaut · 2017-01-19T22:55:39Z

OCRmyPDF's main task is the creation of an mixed-mode (image, text) PDF.

Can OCRmyPDF create a single plain text output file (in addition to the pdf output) ? (One could use pdftotext, but this requires a further call).

tesseractby its own already can create pdf, hocr and text in one go, and for certain applications like creation of an ebook ( out.txt as input to calibre ), this can be useful.

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2017-01-20T00:42:22Z

It could be modified to do this. I could be persuaded but I'm reluctant to add this because pdftotext does a decent job. Other than saving typing what's the benefit?

This shell pipeline gets you both:

ocrmypdf input.pdf - | tee output.pdf | pdftotext - output.txt

(Although in the most recent release it sometimes fails because stdout is not flushed correctly. That will be fixed.)

Wikinaut · 2017-02-07T17:52:55Z

Coming back to text extraction (for epub generation).

I found that plain text results via ocrmypdf -> pdftotext are really useless (text is garbled), whereas the generation via tesseract (alone) is really perfect.

Later this month I can present you some detailed examples, if you want, currently I cannot.

To make sure: I am in favour of using your code, but currently my own tool chain gives much better text file results. Perhaps the problem is pdftotext, which garbles the text, and my original idea of this issue is worth to be resconsidered (ocrmypdf -> text output).

jbarlow83 · 2017-02-07T18:05:33Z

Is it this issue? tesseract-ocr/tesseract#357 https://bugs.ghostscript.com/show_bug.cgi?id=696874. (See especially comment 4. My initial guess at the cause was way off.) Use --output-type pdf Ghostscript 9.20 might help if using the default --output-type pdfa

…

On Tue, Feb 7, 2017 at 09:52 Wikinaut ***@***.***> wrote: Coming back to text extraction (for epub generation). I found that plain text results via ocrmypdf -> pdftotext are really useless (text is garbled), whereas the generation via tesseract (alone) is really perfect. Later this month I can present you some detailed examples, if you want, currently I cannot. To make sure: I am in favour of using your code, but currently my own tool chain gives much better text file results. Perhaps the problem is pdftotext, which garbles the text, and my original idea of this issue is worth to be resconsidered (ocrmypdf -> text output). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#126 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcM3XK3n_icxuoELPL_xQJJe-4ZOt5ks5raK93gaJpZM4LotjQ> .

Wikinaut · 2017-02-07T18:12:34Z

I used what I wrote here #124 (comment) (without the --debug-rendering), with the old debian 8 ghostscript.

ocrmypdf -j1 --output-type pdf -l deu --force-ocr in.pdf out.pdf
pdftotext out.pdf out.txt

I installed a recent debian 9 with ghostscript 9.20 and can retry it there (in some days).

jbarlow83 · 2017-02-07T21:36:16Z

Can you send me the file?

--output-type pdf disables Ghostscript, so something else may be the cause.

jbarlow83 · 2017-02-15T21:15:33Z

This ended up leading into tesseract issue #712, in which we dramatically improved tesseract PDFs after they pass through Ghostscript. Some viewers still have problems with Tesseract PDFs and it's unlikely that will ever be completed solved.

Because of that Tesseract PDF conversion can be thought of as a possibly lossy operation with respect to word breaks. Tesseract knows more than it can put in a PDF about where word breaks are located, because of PDF limitations, so it is more reliable to get this out of tesseract directly.

I'll add the feature.

jbarlow83 · 2017-05-12T22:30:10Z

Added in v5

Wikinaut closed this as completed Jan 22, 2017

jbarlow83 reopened this Feb 7, 2017

jbarlow83 added the enhancement label May 1, 2017

jbarlow83 added this to the v5.0 milestone May 1, 2017

jbarlow83 pushed a commit that referenced this issue May 10, 2017

Implement sidecar text files (#126)

183eafa

jbarlow83 closed this as completed May 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? #126

[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? #126

Wikinaut commented Jan 19, 2017

jbarlow83 commented Jan 20, 2017

Wikinaut commented Feb 7, 2017

jbarlow83 commented Feb 7, 2017 via email

Wikinaut commented Feb 7, 2017 •

edited

jbarlow83 commented Feb 7, 2017 •

edited

jbarlow83 commented Feb 15, 2017

jbarlow83 commented May 12, 2017

[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? #126

[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? #126

Comments

Wikinaut commented Jan 19, 2017

jbarlow83 commented Jan 20, 2017

Wikinaut commented Feb 7, 2017

jbarlow83 commented Feb 7, 2017 via email

Wikinaut commented Feb 7, 2017 • edited

jbarlow83 commented Feb 7, 2017 • edited

jbarlow83 commented Feb 15, 2017

jbarlow83 commented May 12, 2017

Wikinaut commented Feb 7, 2017 •

edited

jbarlow83 commented Feb 7, 2017 •

edited