Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? #126

Closed
Wikinaut opened this issue Jan 19, 2017 · 7 comments
Milestone

Comments

@Wikinaut
Copy link

OCRmyPDF's main task is the creation of an mixed-mode (image, text) PDF.

Can OCRmyPDF create a single plain text output file (in addition to the pdf output) ? (One could use pdftotext, but this requires a further call).

tesseractby its own already can create pdf, hocr and text in one go, and for certain applications like creation of an ebook ( out.txt as input to calibre ), this can be useful.

@jbarlow83
Copy link
Collaborator

It could be modified to do this. I could be persuaded but I'm reluctant to add this because pdftotext does a decent job. Other than saving typing what's the benefit?

This shell pipeline gets you both:

ocrmypdf input.pdf - | tee output.pdf | pdftotext - output.txt

(Although in the most recent release it sometimes fails because stdout is not flushed correctly. That will be fixed.)

@Wikinaut
Copy link
Author

Wikinaut commented Feb 7, 2017

Coming back to text extraction (for epub generation).

I found that plain text results via ocrmypdf -> pdftotext are really useless (text is garbled), whereas the generation via tesseract (alone) is really perfect.

Later this month I can present you some detailed examples, if you want, currently I cannot.

To make sure: I am in favour of using your code, but currently my own tool chain gives much better text file results. Perhaps the problem is pdftotext, which garbles the text, and my original idea of this issue is worth to be resconsidered (ocrmypdf -> text output).

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Feb 7, 2017 via email

@Wikinaut
Copy link
Author

Wikinaut commented Feb 7, 2017

I used what I wrote here #124 (comment) (without the --debug-rendering), with the old debian 8 ghostscript.

ocrmypdf -j1 --output-type pdf -l deu --force-ocr in.pdf out.pdf
pdftotext out.pdf out.txt

I installed a recent debian 9 with ghostscript 9.20 and can retry it there (in some days).

@jbarlow83 jbarlow83 reopened this Feb 7, 2017
@jbarlow83
Copy link
Collaborator

jbarlow83 commented Feb 7, 2017

Can you send me the file?

--output-type pdf disables Ghostscript, so something else may be the cause.

@jbarlow83
Copy link
Collaborator

This ended up leading into tesseract issue #712, in which we dramatically improved tesseract PDFs after they pass through Ghostscript. Some viewers still have problems with Tesseract PDFs and it's unlikely that will ever be completed solved.

Because of that Tesseract PDF conversion can be thought of as a possibly lossy operation with respect to word breaks. Tesseract knows more than it can put in a PDF about where word breaks are located, because of PDF limitations, so it is more reliable to get this out of tesseract directly.

I'll add the feature.

@jbarlow83 jbarlow83 added this to the v5.0 milestone May 1, 2017
jbarlow83 pushed a commit that referenced this issue May 10, 2017
@jbarlow83
Copy link
Collaborator

Added in v5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants