Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider integrating OCR support #807

Open
laurent22 opened this issue Sep 19, 2018 · 12 comments
Open

Consider integrating OCR support #807

laurent22 opened this issue Sep 19, 2018 · 12 comments

Comments

@laurent22
Copy link
Owner

@laurent22 laurent22 commented Sep 19, 2018

It seems possible to add support for OCR content in Joplin via the Tesseract library: http://tesseract.projectnaptha.com

A first step would be to assess the feasibility of this project by integrating the lib in the desktop app and trying to OCR an image.

  • Is the image correctly OCRed?
  • Does it work with non-English text?
  • How slow/fast is it? Test with a very large image to be sure. It should not freeze the app while processing an image.

If everything works well, we can add the feature to the app.

Specification

  • On desktop app: Create service that runs in the background and process the resources that need to be OCRed.
  • When a document is OCRed: Append block to end of note that contains the extracted plain text
  • When attaching resource, ask what user wants to do:
    • Always OCR all files
    • Never OCR any file
    • Always OCR files with extension ".ext"
    • Never OCR files with extension ".ext"
  • Can be changed in settings
  • Right-click resource, or note, to OCR content
  • Add resource ocr_status on resource table: Can be: none, todo, processing, done
  • Add ocr_text to resource: must include detailed coordinates, and a way to get plain text back

Advantage of it doing that way:

  • Search engine just works - no need for special indexing of OCR content since it is inside the note directly
  • Will work with all clients (mobile, desktop, terminal)
  • When a note is exported to Markdown, it will include the OCR content

Format of OCR text block

<!-- autogen-ocr :resource.id -->
* * *

**:resource.title**

:resource.ocr_text
<!-- autogen-ocr :resourceId -->

For example, for a resource called "TrainTicket.png":

<!-- autogen-ocr 2ee4eec909734f7197654a9a040dfba7 -->
* * *

**TrainTicket.png**

From: London
To: Paris
Date: 01/12/2019
Time: 15:00
...etc.
<!-- autogen-ocr :resourceId -->

The advantage of this format is that it will render nicely in the viewer, and it will still be clearly identified as OCR content, which means later we can identify these blocks, update them, remove them, etc.

Later

  • Support PDF files - for example by converting each page to an image first, then passing it to Tesseract.
  • Make ocr_text searchable
  • Display search results directly on document. i.e. if it's an image, highlight the parts of the image that contain the search text.
@laurent22 laurent22 added the essential label Sep 19, 2018
@bmix

This comment has been minimized.

Copy link

@bmix bmix commented Oct 2, 2018

I would recommend against it, because adding major features like these require a lot of development time, that may be better spent on the core features of the app.

There are already capable applications, implementing Tesseract on all major platforms:

  • gImageReader - can export to hOCR, which is a micro format for OCR export via HTML, see also git repository and obligatory Wikipedia entry. An importer would need to be written, since this HTML is very structured and contains data, that is redundant for Joplin.

  • then, while not OCR, but PDF extraction, there is also

    • Apache's pdfbox, which is available on all major platforms and can extract to HTML (I think even XHTML!)
    • and Poppler, also available on all major platforms, this exports to XML or HTML, I do not remember. Again, just importers would need to be written.

(I use all three programs on Windows and they are very mature and powerful)

EDIT: Isn't the main feature of Evernote's OCR the recognition of handwriting? AFAIK Tesseract doesn't do that. It's still considered some magic wizardry, which may need performant systems and (patented?) algorithms. I wouldn't be surprised, if Evernote did such OCR on their servers in the cloud.

@bernd-wechner

This comment has been minimized.

Copy link

@bernd-wechner bernd-wechner commented Oct 2, 2018

I concur with bmix. Not that OCR is a BAD idea, only that it seems a distraction when so many other things are calling and not a priority in the least given how many OCR options I have outside of Joplin. There are many features I'd prefer to see before this one!

@laurent22

This comment has been minimized.

Copy link
Owner Author

@laurent22 laurent22 commented Oct 2, 2018

If it can be easily added using the Tesseract.js library it would be a good addition. The point is to be able to search attached documents, PDFs in particular, in the app itself. So for this external tools can't really help.

Most likely Evernote do this on their server, yes, but maybe it's possible to get the desktop app to do the job in the background. It's only under consideration at this point and if it cannot be done in a reliable way it won't be added.

@StrilGit

This comment has been minimized.

Copy link

@StrilGit StrilGit commented Nov 2, 2018

I would pay a lot to get handwrite-recognition and OCR. For me, that is the most critical feature to make Joplin a real replacement for Evernote, Onenote, etc.
I am sadly not able to add this...

I do not want to use a keyboard in meetings. Using a pencil better accepted...

@mkrauser

This comment has been minimized.

Copy link

@mkrauser mkrauser commented Feb 1, 2019

OCR is one of evernotes features, I enjoy the most! So I can simply search scanned PDFs. In my case, handwriting-recognition is not that important for me.
I would love to see this in joplin.

@theredspoon

This comment has been minimized.

Copy link

@theredspoon theredspoon commented Feb 2, 2019

If it can be easily added using the Tesseract.js library it would be a good addition. The point is to be able to search attached documents, PDFs in particular, in the app itself. So for this external tools can't really help.

Most likely Evernote do this on their server, yes, but maybe it's possible to get the desktop app to do the job in the background. It's only under consideration at this point and if it cannot be done in a reliable way it won't be added.

After transitioning from Evernote to Joplin, I'm missing this feature the most. @laurent22 any suggestions on ways to proceed in evaluating whether Tesseract.js can do the work?

@lumogas

This comment has been minimized.

Copy link

@lumogas lumogas commented Apr 15, 2019

For what it's worth, I would find it very useful. I think it's a very powerful feature of Evernote's search, and it would make the transition to Joplin a lot easier for me.

@StoltHD

This comment has been minimized.

Copy link

@StoltHD StoltHD commented May 29, 2019

Annotations extractions in PDF can be done by the pdf.js library, the Zotfile plugin for Zotero use this... But please hold it more updated than zotfile...

It would be great to have OCR or at least reading of OCR enabled PDF's in Joplin, som people use Joplin or other notebooks for old historical research and add old Newspapers, Typewritten Patent Documents, Census sheets (Difficult to OCR, but can easily be annotated with comments)..

If you make it as an add-on/Plug-in, people can deside if they want to enable it or not, and you could actually add the plug-ins as seperate download, and make a customable path to a plug-in folder that Joplin reads, and all plug-ins in that folder will be selectable in the plug-in area of joplin...

That way Joplin CORE will not be bloated for those who need a small footprint.
(Same could be done with reports, make them add-ons, and make a simple howto so users can make their own reports if they want to...).

@stale

This comment has been minimized.

Copy link

@stale stale bot commented Sep 23, 2019

Hey there, it looks like there has been no activity on this issue recently. Has the issue been fixed, or does it still require the community's attention? This issue may be closed if no further activity occurs. You may also label this issue as "backlog" and I will leave it open. Thank you for your contributions.

@stale stale bot added the stale label Sep 23, 2019
@lumogas

This comment has been minimized.

Copy link

@lumogas lumogas commented Sep 23, 2019

I still think this feature is worth looking into. Any news?

@stale stale bot removed the stale label Sep 23, 2019
@steve28

This comment has been minimized.

Copy link

@steve28 steve28 commented Nov 24, 2019

Oh please! If Joplin wants to be an Evernote replacement, this needs to be added. This was my main use of evernote - archiving receipts, printed manuals, snail mail that needs to be saved for reference, etc. I rely on this being searchable.

@Shamp0o

This comment has been minimized.

Copy link

@Shamp0o Shamp0o commented Dec 10, 2019

FYI somebody made a script that monitors a folder and uploads new files to Joplin. It also does an tesseract ORC scan of the files and attaches the text as a comment in the markdown code of the note to make it searchable.
Works quite well with printed text but is terrible for handwritten text from my experience. If you're willing to put some effort in you have quite a few options to customize tesseract to fit your needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.