Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opensource OCR Service (PDF / TIFF / Scan to Text Conversion Service) #88

Closed
rufuspollock opened this issue Oct 31, 2014 · 62 comments
Closed

Comments

@rufuspollock
Copy link
Member

@rufuspollock rufuspollock commented Oct 31, 2014

Originally: http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service/

Note: for generic PDF to text (including but not necessarily OCR) - see #52 (simple pdf to text service)

Quote from Tim:

Last weekend, I created an OCR pipeline with OCRopus, Tesseract & Celery/RabbitMQ. I need to do a little bit of work to make it available as a web service.

OCRopus does layout analysis, splitting the image into lines/words. These split files is then sent to Tesseract for OCR and reassembled to create hOCR output. Celery is used for ad-hoc clustering, making it trivial to add more processing capacity.

@pudo
Copy link
Member

@pudo pudo commented Oct 31, 2014

Other libs:

PDF2 Text

  • Poppler utils (pdftohtml)
  • Apache Tika / Apache Stanbol
  • Tabula (I think it uses Tika?)
@mattfullerton
Copy link

@mattfullerton mattfullerton commented Nov 14, 2014

I'm going to start building this next week. This is my attempt to pull together the various suggestions/ideas so far.

The question is what to start with, as there is so much out there. But as my primary interest is setting something up that can extract text from as many formats as possible, and also be accesible from multiple projects, it seems like wrapping textract in a web service is the best place to start. And it seems that nobody did this yet. It might be good to start with celery from the outset so we can build capacity later given that we will want parallel jobs in any case.

@pudo I'll try to bear in mind the philosophy in centipede and flesh out the API as I go.
@pudo There are thoughts on integrating Tika into textract as an additional processing method (deanmalmgren/textract#12)

@cleder If detecting no text and handing off the images to tesseract is not handled well by textract or tricky to implement, the code from the Plone work might be really helpful.

I think at a later stage we can add OCRopus as per @timClicks work to improve quality. @timClicks if you can contribute any of your code, that would be great.

Textract/Language choice

There is a textract for Python, which @pudo mentions. I like Python. Although I've worked with Flask, I do not have a lot of experience building web services with Python. There seems to be good support for using Flask or Django with celery, I'm sure the same goes for Pyramid. There is also a separate but identically named node.js module: https://github.com/dbashford/textract. I have worked with nodejs/express before and was tempted to fork webshot (https://github.com/okfn/webshot) as a starting point which is also nodejs (but see stuff below: I'm tempted to go with Python if webshot is the wrong starting point). Any strong opinions either way? The list of supported formats (https://github.com/dbashford/textract#currently-extracts vs. http://textract.readthedocs.org/en/latest/#currently-supporting) is similar (the big difference being text from sound formats, but do we need that!?). Neither is handling OCR for PDFs, but both offer it for images, so we may have to tweak that part for the case that pdf conversion returns no intelligible text, or offer it as an option (see comment above).

Slow REST

That being said about webshot, we are probably going to need something that returns a reference to a job that can be queried giving the status of the conversion, and potentially (embracing @pudo's efforts to frame services as part of a pipeline) multiple job stages (@pudo, correct me if if I'm misunderstanding), so it may not be the best starting point. Do we know of any nice Python or node projects that implement such an asynchronous API that could be used as a starting point? I've seen some nice references on patterns in general but haven't looked for an example project yet.

Security

Is our primary aim to produce a stack that people have to install themselves and can secure whatever way they like (seems almost a pity when we go to the effort to create a web service) or a publicly accessible service (like webshot)? The latter will require some request limiting and maybe the distribution of API keys, given how resource-intensive the operations could be.

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Nov 14, 2014

OK, there was one very important suggestion I missed, that we just set up an instance of the Data Science Toolkit (https://github.com/petewarden/dstk)

It is Ruby, and doesn't support the wealth of formats that textract does. Scalability would be done by load balancing to multiple instances.

Specifically its the 'file2text' API that would interest us, handled here:
https://github.com/petewarden/dstk/blob/595e4b51261db715af4e71a5be0f37e0ecd75ab6/dstk_server.rb#L1114

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Nov 15, 2014

@mattfullerton i think nodejs has some nice benefits (e.g. the async setup means when you deploy on e.g. heroku you can serve many clients at once - one request won't block the system) which could be esp relevant here.

However, my guess is that this will be driven by the libraries available. Given that you have got textract in node (though it is just a wrapper on command line utilities) it may be worth going with that - though note we'll have to deploy on a "proper" machine not heroku if we want all those utilities (but we can use labs machines.

Lastly: am I right that textract python and textract node are pretty much identical in functionality? (My guess is that python one may be slightly stabler and better (??))

I have to say webshot as UI and API might be quite nice as inspiration for UI and API.

Last 2c: is it worth trying to write out some explicit user stories - even if obvious. I always find this invaluable :-)

@pudo
Copy link
Member

@pudo pudo commented Nov 17, 2014

Just to be clear: we're talking about the OCR bit only, right? It'd be very cool to use this to work out bits of the API for the centipede API (ping @malev). One thing in particular is this: is it more useful to hand around the actual documents, or a link to the documents on an S3 bucket? Obviously pushing around the documents is simpler, but using references makes it more lightweight.

I'm very interested to see good implementation of slow REST (e.g. job references) vs. long waits (e.g. node running stuff synchronously and letting you wait on the line) - both have merits, I'd like to know which one is nicer in practice :)

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Nov 17, 2014

We're talking about file to text, including OCR if necessary, and not necessarily about general pipeline/document management.

I had already come to the conclusion that I want to try this with Python/Flask+Celery/Redis when I saw you (@pudo) have already started with that combination for centipede. I've forked that repo to get started and will try and build on the existing ideas for the API to create something useful also for other document operations.

I guess both slow-REST and long waits enact some pain on the 'user' (client side developer). But if we're going with Python and wouldn't have been using Heroku for node anyway, I think we should try slow-REST first.

@chrismattmann
Copy link

@chrismattmann chrismattmann commented Nov 28, 2014

Hi Guys, just FYI on this. Apache Tika provides a wrapped version of Tesseract, as a web service. See: http://wiki.apache.org/tika/TikaOCR

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Dec 28, 2014

@mattfullerton any updates here from your end?

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Dec 30, 2014

I was working on extending https://github.com/OpenNewsLabs/centipede, but ran out of time due to other project priorities. I like the pipeline/task concept.

But I think it would be easier if I just set up an instance of tika-server for us to test. Ping me again in a week if I haven't done that yet. It looks great:
http://wiki.apache.org/tika/TikaJAXRS#Tika_Resource
(link doesn't work for me)
http://webcache.googleusercontent.com/search?q=cache:MC8ekfYmifcJ:wiki.apache.org/tika/TikaJAXRS+&cd=1&hl=en&ct=clnk&gl=de

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Jan 6, 2015

I have a working version of Tika dev (1.8) with tesseract here: http://beta.offenedaten.de:9998/tika

Test by doing things like:

curl -T multipage_tiff_example.tif http:///beta.offenedaten.de:9998/tika

Fuller instructions here:
https://wiki.apache.org/tika/TikaOCR

You can run your own using Docker by doing:

sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
sudo docker run -d -p 9998:9998 tika

I'm very open to improvements to the Docker build files, I am no expert there.

What is lacking now (AFAIK) is detection that standard text extraction from a PDF 'failed' with a fallback to tesseract. We should look into that.

@chrismattmann
Copy link

@chrismattmann chrismattmann commented Jan 25, 2015

Hey @mattfullerton good work - we're still working through MultiCompositeParsers in Tika (having multiple for a single type instead of our AutoDetect algorithm which picks the best one). We did a work around in Tika 1.7 and 1.8-dev (so far) to combine the ImageExtractor for metadata and then call Tesseract on images. However, for PDF if you want Tesseract to be called, you can always override the declared Mime types for the parser and/or sub-class it and rebuild Tika to get it to work on PDFs.

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Jan 31, 2015

@chrismattmann Thanks for the tips. Concretely, does that mean that with some passed config there will be support for using tesseract on PDFs instead of the default PDF parser (i.e. client detects if OCR is needed)? Or do you intend to go further and detect the lack of text in the PDF internally (i.e. server detects if OCR is needed)?

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Feb 2, 2015

@mattfullerton just want to say this really excellent - and do ping the labs list to let them know of your progress (and would you like to do a quick blog post?)

@pudo
Copy link
Member

@pudo pudo commented Feb 2, 2015

Hi all, just wanted to share a quick update on the document processing pipeline I've been working on, which consists of docpipe (a document processing tool with configurable pipelines) and barn (an OFS knock-off which a slightly more comprehensive API, also used in the openspending S3 data storage branch).

I've invested quite a lot of time into both recently, making sure barn runs against S3 which should be good in terms of the original centipede idea of having pipeline components run on different hosts but access the same virtual data store. At the same time, I've hacked up docpipe to have full support for textract (which does roughly the same thing as Tika, in Python).

All of this is the backend to an app called aleph which I'm using to allow journos to search and tag documents. The whole pipeline is a bit slow, but getting there.

Would be cool to see if there are any docking points?

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Feb 3, 2015

@rgrp I made a post to the list at the time: https://lists.okfn.org/pipermail/okfn-labs/2015-January/001548.html
I'll work on a blog post

@pudo That's great that things are moving forward with the pipeline approach and that it includes textract. Am I right that what is still missing is the web api? Or maybe I missed it.

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Feb 21, 2015

This is fantastic @mattfullerton - will tweet out more monday!

I'd also like to offer a nice url for the service e.g. tika.okfnlabs.org (if we can think of an even cooler subdomain let me know!). This would not require a move of server - just configuring apache/nginx at your end and setting up DNS for the subdomain with open knowledge sysadmins.

wdyt?

@todrobbins
Copy link

@todrobbins todrobbins commented Feb 22, 2015

Does Tika offer JP2 support? Just curious about other archival image types.

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Feb 23, 2015

@rgrp Good idea, as long as @ddie has no objections. The alternative of course is to try out the docker image on a labs machine. ATM there is nothing to set up at our end (although nginx would let us drop the port) tika.okfnlabs is fairly clear, but we could also go for something fun like text. or givemetext. or x2text...

@todrobbins Yes: http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (search for jpeg or jp2)

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Feb 23, 2015

givemetext.okfnlabs.org sounds great. a docker image also sounds really great - but requires other work so let's start with dns.

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Mar 16, 2015

@mattfullerton shall we ask @nigelbabu do set up the dns for this and you put in the ServerName alias in Apache/Nginx? I'm sure @ddie has no objections re the domain name.

@nigelbabu
Copy link

@nigelbabu nigelbabu commented Mar 16, 2015

This has already been requested and setup.

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Mar 16, 2015

@nigelbabu awesome! @mattfullerton you need to set the server alias - http://givemetext.okfnlabs.org/ is still offenendaten ;-)

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Mar 16, 2015

There's not a whole lot I can do about that given that there is no web
front end (yet). On the Tika port it works.
On 16 Mar 2015 15:52, "Rufus Pollock" notifications@github.com wrote:

@nigelbabu https://github.com/nigelbabu awesome! @mattfullerton
https://github.com/mattfullerton you need to set the server alias -
http://givemetext.okfnlabs.org/ is still offenendaten ;-)


Reply to this email directly or view it on GitHub
#88 (comment).

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Mar 17, 2015

@mattfullerton not sure I understand. You just need to do a reverse proxy from givemetext.okfnlabs.org on port 80 to your thing running on whatever port you have. Is the site using nginx or apache as main webserver? if you let us know we can help.

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Mar 20, 2015

I think its nginx - I was just doubting the logic of putting the thing on port 80. Right now these are the two possibilities for showing people when they arrive at givemetext.okfnlabs.org:
http://beta.offenedaten.de:9998/
http://beta.offenedaten.de:9998/tika

I haven't promoted the service anywhere as anything other than a listener on that port for PUT requests. And if someone is using it in that way its probably irrelevant what port is in use. If you think it adds value I can add the proxy, but as I would rather serve up a simple one page app on port 80 that allows the uploading of the file to the service and returns the returned text.

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Mar 20, 2015

Serving it on port 80 is great if you are happy to do that - makes life easier. I'd also server /tika at base location if that's where the action is.

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Mar 20, 2015

That would be logical, just a pity the text there is so boring at present ;-)

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Mar 20, 2015

OK, Done

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented May 7, 2015

@mattfullerton any further thoughts about the nicer front page?

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented May 8, 2015

awesome - and we are about to have a standard labs bootstrap theme we can apply to make it look swish ;-)

@mattfullerton
Copy link

@mattfullerton mattfullerton commented May 8, 2015

There's a typo in there somewhere. We do have one or we don't?

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented May 8, 2015

fixed - we are about to have one (literally a couple of days).

@tbpalsulich
Copy link

@tbpalsulich tbpalsulich commented May 8, 2015

@mattfullerton, no, I don't think it has Tesseract installed (just tried parsing the Google logo -- nothin').

No, I'm not looking for a lot of traffic. I intended the site as more of a quick demonstration of what Tika can do. I happy you guys found it useful!

See https://issues.apache.org/jira/browse/TIKA-1585 for a little more detail. The Tika server is running on a server donated by Rackspace. We use it for testing Tika against large corpuses. So, I don't want to overload it with requests.

@chrismattmann
Copy link

@chrismattmann chrismattmann commented May 9, 2015

@tpalsulich we could probably contact Rackspace and ask them what they think about the traffic, etc. @rgrp @mattfullerton would be happy to join forces! :) FYI too I just completed http://github.com/chrismattmann/tika-python/ which entirely relies now on the REST server and exposes Translation, Language Detection and the full suite of things to make it really usable entirely as a Python library to Tika. So, great timing.

@tpalsulich worst comes to worse, can't they just fork your code and run your code on their OKFN servers?

@mattfullerton
Copy link

@mattfullerton mattfullerton commented May 9, 2015

@chrismattmann We already had a Tika instance running (actually generously hosted by OKF Germany), just without an HTML upload button. @tpalsulich frontend does that and I forked it yesterday: http://givemetext.okfnlabs.org/.

The Python stuff sounds amazing! If I ever get to the point of using the server what I actually wanted it for (full text search for a CKAN instance), I might be able to make good use of it.

@tbpalsulich
Copy link

@tbpalsulich tbpalsulich commented May 9, 2015

@mattfullerton, awesome! I'm happy you like it. But, I'm getting a DNS lookup error when loading http://www.givemetext.okfnlabs.org/.

@mattfullerton
Copy link

@mattfullerton mattfullerton commented May 10, 2015

Oops! No www.

@tbpalsulich
Copy link

@tbpalsulich tbpalsulich commented May 10, 2015

That was it. Working now. 👍

@chrismattmann
Copy link

@chrismattmann chrismattmann commented May 10, 2015

woot 👍 great work @mattfullerton @tpalsulich !

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented May 11, 2015

@mattfullerton theme is now ready for you try - see http://okfnlabs.org/app-theme/ and https://github.com/okfn/app-theme. This is generic and you can adapt as you want.

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Jun 20, 2015

@mattfullerton some quick thoughts / suggestions:

Website tweaks:

  • Would it be worth adding some small API examples to front page somewhere below main form (perhaps also link to blog post ...)
  • Could we add a small tagline making clear what this is for the uninitiated e.g. "An easy to use free web service to extract text from PDFs and other documents - OCR support included!"
  • Can we have support for passing a URL to a file online as well as uploading?
@mattfullerton
Copy link

@mattfullerton mattfullerton commented Jun 22, 2015

@rgrp All good ideas, especially the examples bit I wanted to do. I'm not really finished with the theme either, that was done very quickly. Just a bit swamped at the moment.

Regarding the source repo, in case its not clear - my only contribution here is creating a Dockerfile that gets the slightly complicated Tika, and specifically Tika-server, up and running with OCR support built in. Tika is an Apache project: https://tika.apache.org/
My small contribution is here: https://github.com/mattfullerton/tika-tesseract-docker

Hence getting support for file URLs within the API (if not already there, I'll have to look) would require modifying Tika itself. I'll look into it but the Tika developers would have the final say; the alternative of course would be our own little micro man-in-the-middle service to download the link in the background.

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Jun 22, 2015

@mattfullerton got you re time and thanks for flagging the repo - i can open issues there going forward - and people can contribute there (e.g. i assume the front page code is there).

Re the url point: do you have to modify Tika?I thought Tika could take a "stream" like object - and you can just open a url as a stream. Worse case you cache the URL contents to disk as file and then load.

@jaakkokorhonen
Copy link

@jaakkokorhonen jaakkokorhonen commented Jun 29, 2015

@rgrp any plans to make OCR work with Froide? We are getting blacked FOI documents into http://tietopyynto.fi, and are looking for a solution to read the scanned or bitmapped documents back into data.

@chrismattmann
Copy link

@chrismattmann chrismattmann commented Aug 20, 2015

@rgrp @mattfullerton Tika can take a stream and yes you can open a URL as a stream to it. See: http://wiki.apache.org/tika/TikaJAXRS#Extracting_A_Document_From_A_URL

@cleder
Copy link

@cleder cleder commented Aug 26, 2015

maybe a bit OT but have a look at https://pypi.python.org/pypi/ocrmypdf

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Aug 28, 2015

@jaakkokorhonen - It may be of little help as its a Ruby project, but the service is being used by kleineanfragen.de: https://github.com/robbi5/kleineanfragen/blob/master/app/jobs/extract_last_modified_from_paper_job.rb#L34

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Aug 28, 2015

@chrismattmann I'm running a very recent version of the latest Tika and can't get the URL processing to work. Would you happen to have a complete curl command for a URL that works? The example in the link is not for a real URL.

@robbi5
Copy link

@robbi5 robbi5 commented Aug 28, 2015

@mattfullerton @chrismattmann Tika removed the URL processing feature again, see https://issues.apache.org/jira/browse/TIKA-1690

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Aug 28, 2015

@rgrp I think I've responded to all of your points with an update to http://givemetext.okfnlabs.org/ and a new blog post at http://okfnlabs.org/blog/2015/08/28/give-me-text.html. The URL thing could take a while (see above!).

@chrismattmann
Copy link

@chrismattmann chrismattmann commented Aug 28, 2015

hey @mattfullerton @robbi5 yeah there was a pretty big security hole with that. Sorry posted it before we found the hole.

@rufuspollock
Copy link
Member Author

@rufuspollock rufuspollock commented Nov 28, 2015

FIXED. http://okfnlabs.org/blog/2015/08/28/give-me-text.html

@mattfullerton I think we can mark this as closed :-) The service is now operational and i regularly use it. Big well done to you for all your efforts here and creating a great, useful, service.

@chrismattmann
Copy link

@chrismattmann chrismattmann commented Nov 29, 2015

woot! @mattfullerton @rgrp great work

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Dec 2, 2015

Thanks guys. Please post feature requests etc. either on https://issues.apache.org/jira/browse/TIKA/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel if they are specific to the file->text or ocr integration (TIKA) and on https://github.com/mattfullerton/tika-tesseract-docker if specific to the packaging (unlikely) or on https://github.com/mattfullerton/TikaExamples for the web frontend.

@mattfullerton
Copy link

@mattfullerton mattfullerton commented Mar 2, 2016

@nigelbabu - could you update the DNS records as follows:
givemetext.okfnlabs.org AAAA 2a01:4f8:201:5006::2
www.givemetext.okfnlabs.org A 148.251.0.14
www.givemetext.okfnlabs.org AAAA 2a01:4f8:201:5006::2

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.