Opensource OCR Service (PDF / TIFF / Scan to Text Conversion Service) #88

Closed
rufuspollock opened this Issue Oct 31, 2014 · 62 comments

Comments

Projects
None yet
10 participants
@rufuspollock
Member

rufuspollock commented Oct 31, 2014

Originally: http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service/

Note: for generic PDF to text (including but not necessarily OCR) - see #52 (simple pdf to text service)

Quote from Tim:

Last weekend, I created an OCR pipeline with OCRopus, Tesseract & Celery/RabbitMQ. I need to do a little bit of work to make it available as a web service.

OCRopus does layout analysis, splitting the image into lines/words. These split files is then sent to Tesseract for OCR and reassembled to create hOCR output. Celery is used for ad-hoc clustering, making it trivial to add more processing capacity.

@pudo

This comment has been minimized.

Show comment
Hide comment
@pudo

pudo Oct 31, 2014

Member

Other libs:

PDF2 Text

  • Poppler utils (pdftohtml)
  • Apache Tika / Apache Stanbol
  • Tabula (I think it uses Tika?)
Member

pudo commented Oct 31, 2014

Other libs:

PDF2 Text

  • Poppler utils (pdftohtml)
  • Apache Tika / Apache Stanbol
  • Tabula (I think it uses Tika?)
@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Nov 14, 2014

I'm going to start building this next week. This is my attempt to pull together the various suggestions/ideas so far.

The question is what to start with, as there is so much out there. But as my primary interest is setting something up that can extract text from as many formats as possible, and also be accesible from multiple projects, it seems like wrapping textract in a web service is the best place to start. And it seems that nobody did this yet. It might be good to start with celery from the outset so we can build capacity later given that we will want parallel jobs in any case.

@pudo I'll try to bear in mind the philosophy in centipede and flesh out the API as I go.
@pudo There are thoughts on integrating Tika into textract as an additional processing method (deanmalmgren/textract#12)

@cleder If detecting no text and handing off the images to tesseract is not handled well by textract or tricky to implement, the code from the Plone work might be really helpful.

I think at a later stage we can add OCRopus as per @timClicks work to improve quality. @timClicks if you can contribute any of your code, that would be great.

Textract/Language choice

There is a textract for Python, which @pudo mentions. I like Python. Although I've worked with Flask, I do not have a lot of experience building web services with Python. There seems to be good support for using Flask or Django with celery, I'm sure the same goes for Pyramid. There is also a separate but identically named node.js module: https://github.com/dbashford/textract. I have worked with nodejs/express before and was tempted to fork webshot (https://github.com/okfn/webshot) as a starting point which is also nodejs (but see stuff below: I'm tempted to go with Python if webshot is the wrong starting point). Any strong opinions either way? The list of supported formats (https://github.com/dbashford/textract#currently-extracts vs. http://textract.readthedocs.org/en/latest/#currently-supporting) is similar (the big difference being text from sound formats, but do we need that!?). Neither is handling OCR for PDFs, but both offer it for images, so we may have to tweak that part for the case that pdf conversion returns no intelligible text, or offer it as an option (see comment above).

Slow REST

That being said about webshot, we are probably going to need something that returns a reference to a job that can be queried giving the status of the conversion, and potentially (embracing @pudo's efforts to frame services as part of a pipeline) multiple job stages (@pudo, correct me if if I'm misunderstanding), so it may not be the best starting point. Do we know of any nice Python or node projects that implement such an asynchronous API that could be used as a starting point? I've seen some nice references on patterns in general but haven't looked for an example project yet.

Security

Is our primary aim to produce a stack that people have to install themselves and can secure whatever way they like (seems almost a pity when we go to the effort to create a web service) or a publicly accessible service (like webshot)? The latter will require some request limiting and maybe the distribution of API keys, given how resource-intensive the operations could be.

I'm going to start building this next week. This is my attempt to pull together the various suggestions/ideas so far.

The question is what to start with, as there is so much out there. But as my primary interest is setting something up that can extract text from as many formats as possible, and also be accesible from multiple projects, it seems like wrapping textract in a web service is the best place to start. And it seems that nobody did this yet. It might be good to start with celery from the outset so we can build capacity later given that we will want parallel jobs in any case.

@pudo I'll try to bear in mind the philosophy in centipede and flesh out the API as I go.
@pudo There are thoughts on integrating Tika into textract as an additional processing method (deanmalmgren/textract#12)

@cleder If detecting no text and handing off the images to tesseract is not handled well by textract or tricky to implement, the code from the Plone work might be really helpful.

I think at a later stage we can add OCRopus as per @timClicks work to improve quality. @timClicks if you can contribute any of your code, that would be great.

Textract/Language choice

There is a textract for Python, which @pudo mentions. I like Python. Although I've worked with Flask, I do not have a lot of experience building web services with Python. There seems to be good support for using Flask or Django with celery, I'm sure the same goes for Pyramid. There is also a separate but identically named node.js module: https://github.com/dbashford/textract. I have worked with nodejs/express before and was tempted to fork webshot (https://github.com/okfn/webshot) as a starting point which is also nodejs (but see stuff below: I'm tempted to go with Python if webshot is the wrong starting point). Any strong opinions either way? The list of supported formats (https://github.com/dbashford/textract#currently-extracts vs. http://textract.readthedocs.org/en/latest/#currently-supporting) is similar (the big difference being text from sound formats, but do we need that!?). Neither is handling OCR for PDFs, but both offer it for images, so we may have to tweak that part for the case that pdf conversion returns no intelligible text, or offer it as an option (see comment above).

Slow REST

That being said about webshot, we are probably going to need something that returns a reference to a job that can be queried giving the status of the conversion, and potentially (embracing @pudo's efforts to frame services as part of a pipeline) multiple job stages (@pudo, correct me if if I'm misunderstanding), so it may not be the best starting point. Do we know of any nice Python or node projects that implement such an asynchronous API that could be used as a starting point? I've seen some nice references on patterns in general but haven't looked for an example project yet.

Security

Is our primary aim to produce a stack that people have to install themselves and can secure whatever way they like (seems almost a pity when we go to the effort to create a web service) or a publicly accessible service (like webshot)? The latter will require some request limiting and maybe the distribution of API keys, given how resource-intensive the operations could be.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Nov 14, 2014

OK, there was one very important suggestion I missed, that we just set up an instance of the Data Science Toolkit (https://github.com/petewarden/dstk)

It is Ruby, and doesn't support the wealth of formats that textract does. Scalability would be done by load balancing to multiple instances.

Specifically its the 'file2text' API that would interest us, handled here:
https://github.com/petewarden/dstk/blob/595e4b51261db715af4e71a5be0f37e0ecd75ab6/dstk_server.rb#L1114

OK, there was one very important suggestion I missed, that we just set up an instance of the Data Science Toolkit (https://github.com/petewarden/dstk)

It is Ruby, and doesn't support the wealth of formats that textract does. Scalability would be done by load balancing to multiple instances.

Specifically its the 'file2text' API that would interest us, handled here:
https://github.com/petewarden/dstk/blob/595e4b51261db715af4e71a5be0f37e0ecd75ab6/dstk_server.rb#L1114

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 15, 2014

Member

@mattfullerton i think nodejs has some nice benefits (e.g. the async setup means when you deploy on e.g. heroku you can serve many clients at once - one request won't block the system) which could be esp relevant here.

However, my guess is that this will be driven by the libraries available. Given that you have got textract in node (though it is just a wrapper on command line utilities) it may be worth going with that - though note we'll have to deploy on a "proper" machine not heroku if we want all those utilities (but we can use labs machines.

Lastly: am I right that textract python and textract node are pretty much identical in functionality? (My guess is that python one may be slightly stabler and better (??))

I have to say webshot as UI and API might be quite nice as inspiration for UI and API.

Last 2c: is it worth trying to write out some explicit user stories - even if obvious. I always find this invaluable :-)

Member

rufuspollock commented Nov 15, 2014

@mattfullerton i think nodejs has some nice benefits (e.g. the async setup means when you deploy on e.g. heroku you can serve many clients at once - one request won't block the system) which could be esp relevant here.

However, my guess is that this will be driven by the libraries available. Given that you have got textract in node (though it is just a wrapper on command line utilities) it may be worth going with that - though note we'll have to deploy on a "proper" machine not heroku if we want all those utilities (but we can use labs machines.

Lastly: am I right that textract python and textract node are pretty much identical in functionality? (My guess is that python one may be slightly stabler and better (??))

I have to say webshot as UI and API might be quite nice as inspiration for UI and API.

Last 2c: is it worth trying to write out some explicit user stories - even if obvious. I always find this invaluable :-)

@pudo

This comment has been minimized.

Show comment
Hide comment
@pudo

pudo Nov 17, 2014

Member

Just to be clear: we're talking about the OCR bit only, right? It'd be very cool to use this to work out bits of the API for the centipede API (ping @malev). One thing in particular is this: is it more useful to hand around the actual documents, or a link to the documents on an S3 bucket? Obviously pushing around the documents is simpler, but using references makes it more lightweight.

I'm very interested to see good implementation of slow REST (e.g. job references) vs. long waits (e.g. node running stuff synchronously and letting you wait on the line) - both have merits, I'd like to know which one is nicer in practice :)

Member

pudo commented Nov 17, 2014

Just to be clear: we're talking about the OCR bit only, right? It'd be very cool to use this to work out bits of the API for the centipede API (ping @malev). One thing in particular is this: is it more useful to hand around the actual documents, or a link to the documents on an S3 bucket? Obviously pushing around the documents is simpler, but using references makes it more lightweight.

I'm very interested to see good implementation of slow REST (e.g. job references) vs. long waits (e.g. node running stuff synchronously and letting you wait on the line) - both have merits, I'd like to know which one is nicer in practice :)

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Nov 17, 2014

We're talking about file to text, including OCR if necessary, and not necessarily about general pipeline/document management.

I had already come to the conclusion that I want to try this with Python/Flask+Celery/Redis when I saw you (@pudo) have already started with that combination for centipede. I've forked that repo to get started and will try and build on the existing ideas for the API to create something useful also for other document operations.

I guess both slow-REST and long waits enact some pain on the 'user' (client side developer). But if we're going with Python and wouldn't have been using Heroku for node anyway, I think we should try slow-REST first.

We're talking about file to text, including OCR if necessary, and not necessarily about general pipeline/document management.

I had already come to the conclusion that I want to try this with Python/Flask+Celery/Redis when I saw you (@pudo) have already started with that combination for centipede. I've forked that repo to get started and will try and build on the existing ideas for the API to create something useful also for other document operations.

I guess both slow-REST and long waits enact some pain on the 'user' (client side developer). But if we're going with Python and wouldn't have been using Heroku for node anyway, I think we should try slow-REST first.

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Nov 28, 2014

Hi Guys, just FYI on this. Apache Tika provides a wrapped version of Tesseract, as a web service. See: http://wiki.apache.org/tika/TikaOCR

Hi Guys, just FYI on this. Apache Tika provides a wrapped version of Tesseract, as a web service. See: http://wiki.apache.org/tika/TikaOCR

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Dec 28, 2014

Member

@mattfullerton any updates here from your end?

Member

rufuspollock commented Dec 28, 2014

@mattfullerton any updates here from your end?

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Dec 30, 2014

I was working on extending https://github.com/OpenNewsLabs/centipede, but ran out of time due to other project priorities. I like the pipeline/task concept.

But I think it would be easier if I just set up an instance of tika-server for us to test. Ping me again in a week if I haven't done that yet. It looks great:
http://wiki.apache.org/tika/TikaJAXRS#Tika_Resource
(link doesn't work for me)
http://webcache.googleusercontent.com/search?q=cache:MC8ekfYmifcJ:wiki.apache.org/tika/TikaJAXRS+&cd=1&hl=en&ct=clnk&gl=de

I was working on extending https://github.com/OpenNewsLabs/centipede, but ran out of time due to other project priorities. I like the pipeline/task concept.

But I think it would be easier if I just set up an instance of tika-server for us to test. Ping me again in a week if I haven't done that yet. It looks great:
http://wiki.apache.org/tika/TikaJAXRS#Tika_Resource
(link doesn't work for me)
http://webcache.googleusercontent.com/search?q=cache:MC8ekfYmifcJ:wiki.apache.org/tika/TikaJAXRS+&cd=1&hl=en&ct=clnk&gl=de

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Jan 6, 2015

I have a working version of Tika dev (1.8) with tesseract here: http://beta.offenedaten.de:9998/tika

Test by doing things like:

curl -T multipage_tiff_example.tif http:///beta.offenedaten.de:9998/tika

Fuller instructions here:
https://wiki.apache.org/tika/TikaOCR

You can run your own using Docker by doing:

sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
sudo docker run -d -p 9998:9998 tika

I'm very open to improvements to the Docker build files, I am no expert there.

What is lacking now (AFAIK) is detection that standard text extraction from a PDF 'failed' with a fallback to tesseract. We should look into that.

I have a working version of Tika dev (1.8) with tesseract here: http://beta.offenedaten.de:9998/tika

Test by doing things like:

curl -T multipage_tiff_example.tif http:///beta.offenedaten.de:9998/tika

Fuller instructions here:
https://wiki.apache.org/tika/TikaOCR

You can run your own using Docker by doing:

sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
sudo docker run -d -p 9998:9998 tika

I'm very open to improvements to the Docker build files, I am no expert there.

What is lacking now (AFAIK) is detection that standard text extraction from a PDF 'failed' with a fallback to tesseract. We should look into that.

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Jan 25, 2015

Hey @mattfullerton good work - we're still working through MultiCompositeParsers in Tika (having multiple for a single type instead of our AutoDetect algorithm which picks the best one). We did a work around in Tika 1.7 and 1.8-dev (so far) to combine the ImageExtractor for metadata and then call Tesseract on images. However, for PDF if you want Tesseract to be called, you can always override the declared Mime types for the parser and/or sub-class it and rebuild Tika to get it to work on PDFs.

Hey @mattfullerton good work - we're still working through MultiCompositeParsers in Tika (having multiple for a single type instead of our AutoDetect algorithm which picks the best one). We did a work around in Tika 1.7 and 1.8-dev (so far) to combine the ImageExtractor for metadata and then call Tesseract on images. However, for PDF if you want Tesseract to be called, you can always override the declared Mime types for the parser and/or sub-class it and rebuild Tika to get it to work on PDFs.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Jan 31, 2015

@chrismattmann Thanks for the tips. Concretely, does that mean that with some passed config there will be support for using tesseract on PDFs instead of the default PDF parser (i.e. client detects if OCR is needed)? Or do you intend to go further and detect the lack of text in the PDF internally (i.e. server detects if OCR is needed)?

@chrismattmann Thanks for the tips. Concretely, does that mean that with some passed config there will be support for using tesseract on PDFs instead of the default PDF parser (i.e. client detects if OCR is needed)? Or do you intend to go further and detect the lack of text in the PDF internally (i.e. server detects if OCR is needed)?

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Feb 2, 2015

Member

@mattfullerton just want to say this really excellent - and do ping the labs list to let them know of your progress (and would you like to do a quick blog post?)

Member

rufuspollock commented Feb 2, 2015

@mattfullerton just want to say this really excellent - and do ping the labs list to let them know of your progress (and would you like to do a quick blog post?)

@pudo

This comment has been minimized.

Show comment
Hide comment
@pudo

pudo Feb 2, 2015

Member

Hi all, just wanted to share a quick update on the document processing pipeline I've been working on, which consists of docpipe (a document processing tool with configurable pipelines) and barn (an OFS knock-off which a slightly more comprehensive API, also used in the openspending S3 data storage branch).

I've invested quite a lot of time into both recently, making sure barn runs against S3 which should be good in terms of the original centipede idea of having pipeline components run on different hosts but access the same virtual data store. At the same time, I've hacked up docpipe to have full support for textract (which does roughly the same thing as Tika, in Python).

All of this is the backend to an app called aleph which I'm using to allow journos to search and tag documents. The whole pipeline is a bit slow, but getting there.

Would be cool to see if there are any docking points?

Member

pudo commented Feb 2, 2015

Hi all, just wanted to share a quick update on the document processing pipeline I've been working on, which consists of docpipe (a document processing tool with configurable pipelines) and barn (an OFS knock-off which a slightly more comprehensive API, also used in the openspending S3 data storage branch).

I've invested quite a lot of time into both recently, making sure barn runs against S3 which should be good in terms of the original centipede idea of having pipeline components run on different hosts but access the same virtual data store. At the same time, I've hacked up docpipe to have full support for textract (which does roughly the same thing as Tika, in Python).

All of this is the backend to an app called aleph which I'm using to allow journos to search and tag documents. The whole pipeline is a bit slow, but getting there.

Would be cool to see if there are any docking points?

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Feb 3, 2015

@rgrp I made a post to the list at the time: https://lists.okfn.org/pipermail/okfn-labs/2015-January/001548.html
I'll work on a blog post

@pudo That's great that things are moving forward with the pipeline approach and that it includes textract. Am I right that what is still missing is the web api? Or maybe I missed it.

@rgrp I made a post to the list at the time: https://lists.okfn.org/pipermail/okfn-labs/2015-January/001548.html
I'll work on a blog post

@pudo That's great that things are moving forward with the pipeline approach and that it includes textract. Am I right that what is still missing is the web api? Or maybe I missed it.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Feb 21, 2015

Member

This is fantastic @mattfullerton - will tweet out more monday!

I'd also like to offer a nice url for the service e.g. tika.okfnlabs.org (if we can think of an even cooler subdomain let me know!). This would not require a move of server - just configuring apache/nginx at your end and setting up DNS for the subdomain with open knowledge sysadmins.

wdyt?

Member

rufuspollock commented Feb 21, 2015

This is fantastic @mattfullerton - will tweet out more monday!

I'd also like to offer a nice url for the service e.g. tika.okfnlabs.org (if we can think of an even cooler subdomain let me know!). This would not require a move of server - just configuring apache/nginx at your end and setting up DNS for the subdomain with open knowledge sysadmins.

wdyt?

@todrobbins

This comment has been minimized.

Show comment
Hide comment
@todrobbins

todrobbins Feb 22, 2015

Does Tika offer JP2 support? Just curious about other archival image types.

Does Tika offer JP2 support? Just curious about other archival image types.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Feb 23, 2015

@rgrp Good idea, as long as @ddie has no objections. The alternative of course is to try out the docker image on a labs machine. ATM there is nothing to set up at our end (although nginx would let us drop the port) tika.okfnlabs is fairly clear, but we could also go for something fun like text. or givemetext. or x2text...

@todrobbins Yes: http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (search for jpeg or jp2)

@rgrp Good idea, as long as @ddie has no objections. The alternative of course is to try out the docker image on a labs machine. ATM there is nothing to set up at our end (although nginx would let us drop the port) tika.okfnlabs is fairly clear, but we could also go for something fun like text. or givemetext. or x2text...

@todrobbins Yes: http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (search for jpeg or jp2)

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Feb 23, 2015

Member

givemetext.okfnlabs.org sounds great. a docker image also sounds really great - but requires other work so let's start with dns.

Member

rufuspollock commented Feb 23, 2015

givemetext.okfnlabs.org sounds great. a docker image also sounds really great - but requires other work so let's start with dns.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Mar 16, 2015

Member

@mattfullerton shall we ask @nigelbabu do set up the dns for this and you put in the ServerName alias in Apache/Nginx? I'm sure @ddie has no objections re the domain name.

Member

rufuspollock commented Mar 16, 2015

@mattfullerton shall we ask @nigelbabu do set up the dns for this and you put in the ServerName alias in Apache/Nginx? I'm sure @ddie has no objections re the domain name.

@nigelbabu

This comment has been minimized.

Show comment
Hide comment
@nigelbabu

nigelbabu Mar 16, 2015

This has already been requested and setup.

This has already been requested and setup.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Mar 16, 2015

Member

@nigelbabu awesome! @mattfullerton you need to set the server alias - http://givemetext.okfnlabs.org/ is still offenendaten ;-)

Member

rufuspollock commented Mar 16, 2015

@nigelbabu awesome! @mattfullerton you need to set the server alias - http://givemetext.okfnlabs.org/ is still offenendaten ;-)

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Mar 16, 2015

There's not a whole lot I can do about that given that there is no web
front end (yet). On the Tika port it works.
On 16 Mar 2015 15:52, "Rufus Pollock" notifications@github.com wrote:

@nigelbabu https://github.com/nigelbabu awesome! @mattfullerton
https://github.com/mattfullerton you need to set the server alias -
http://givemetext.okfnlabs.org/ is still offenendaten ;-)


Reply to this email directly or view it on GitHub
#88 (comment).

There's not a whole lot I can do about that given that there is no web
front end (yet). On the Tika port it works.
On 16 Mar 2015 15:52, "Rufus Pollock" notifications@github.com wrote:

@nigelbabu https://github.com/nigelbabu awesome! @mattfullerton
https://github.com/mattfullerton you need to set the server alias -
http://givemetext.okfnlabs.org/ is still offenendaten ;-)


Reply to this email directly or view it on GitHub
#88 (comment).

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Mar 17, 2015

Member

@mattfullerton not sure I understand. You just need to do a reverse proxy from givemetext.okfnlabs.org on port 80 to your thing running on whatever port you have. Is the site using nginx or apache as main webserver? if you let us know we can help.

Member

rufuspollock commented Mar 17, 2015

@mattfullerton not sure I understand. You just need to do a reverse proxy from givemetext.okfnlabs.org on port 80 to your thing running on whatever port you have. Is the site using nginx or apache as main webserver? if you let us know we can help.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Mar 20, 2015

I think its nginx - I was just doubting the logic of putting the thing on port 80. Right now these are the two possibilities for showing people when they arrive at givemetext.okfnlabs.org:
http://beta.offenedaten.de:9998/
http://beta.offenedaten.de:9998/tika

I haven't promoted the service anywhere as anything other than a listener on that port for PUT requests. And if someone is using it in that way its probably irrelevant what port is in use. If you think it adds value I can add the proxy, but as I would rather serve up a simple one page app on port 80 that allows the uploading of the file to the service and returns the returned text.

I think its nginx - I was just doubting the logic of putting the thing on port 80. Right now these are the two possibilities for showing people when they arrive at givemetext.okfnlabs.org:
http://beta.offenedaten.de:9998/
http://beta.offenedaten.de:9998/tika

I haven't promoted the service anywhere as anything other than a listener on that port for PUT requests. And if someone is using it in that way its probably irrelevant what port is in use. If you think it adds value I can add the proxy, but as I would rather serve up a simple one page app on port 80 that allows the uploading of the file to the service and returns the returned text.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Mar 20, 2015

Member

Serving it on port 80 is great if you are happy to do that - makes life easier. I'd also server /tika at base location if that's where the action is.

Member

rufuspollock commented Mar 20, 2015

Serving it on port 80 is great if you are happy to do that - makes life easier. I'd also server /tika at base location if that's where the action is.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Mar 20, 2015

That would be logical, just a pity the text there is so boring at present ;-)

That would be logical, just a pity the text there is so boring at present ;-)

@mattfullerton

This comment has been minimized.

Show comment
Hide comment

OK, Done

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock May 7, 2015

Member

@mattfullerton any further thoughts about the nicer front page?

Member

rufuspollock commented May 7, 2015

@mattfullerton any further thoughts about the nicer front page?

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton May 7, 2015

I started building a little Angular App to do the upload and show the result a while ago, and then I got busy :) Will get back to it soon...

I started building a little Angular App to do the upload and show the result a while ago, and then I got busy :) Will get back to it soon...

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann May 8, 2015

FYI @tpalsulich built a Tika REST upload page

FYI @tpalsulich built a Tika REST upload page

@tbpalsulich

This comment has been minimized.

Show comment
Hide comment
@tbpalsulich

tbpalsulich May 8, 2015

See http://tpalsulich.github.io/TikaExamples/. You can upload a file and see what text Tika pulls out.

See http://tpalsulich.github.io/TikaExamples/. You can upload a file and see what text Tika pulls out.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock May 8, 2015

Member

@chrismattmann / @tpalsulich that's great - @mattfullerton has already built http://givemetext.okfnlabs.org/ (see above part of the thread). Perhaps we can join forces?

Member

rufuspollock commented May 8, 2015

@chrismattmann / @tpalsulich that's great - @mattfullerton has already built http://givemetext.okfnlabs.org/ (see above part of the thread). Perhaps we can join forces?

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton May 8, 2015

I will use this instead. Have to update our tika instance so that it supports POSTing instead of PUTing.

@tpalsulich - Does your instance also include tesseract/ocr, and are you looking for traffic? We could include it as a backup server.

I will use this instead. Have to update our tika instance so that it supports POSTing instead of PUTing.

@tpalsulich - Does your instance also include tesseract/ocr, and are you looking for traffic? We could include it as a backup server.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton May 8, 2015

Done. The Docker image is now on Tika 1.9, and I added CORS for the service as well so that other web apps can use it.

http://givemetext.okfnlabs.org/ - Web UI, proxied from http://mattfullerton.github.io/TikaExamples/
http://givemetext.okfnlabs.org/tika - proxied from http://givemetext.okfnlabs.org:9998/tika (different from before, where / was proxied to this)

http://givemetext.okfnlabs.org:9998 is open as before but without CORS header

Some kind of friendly instructions on the page like on http://webshot.okfnlabs.org/ would be nice to have.

Done. The Docker image is now on Tika 1.9, and I added CORS for the service as well so that other web apps can use it.

http://givemetext.okfnlabs.org/ - Web UI, proxied from http://mattfullerton.github.io/TikaExamples/
http://givemetext.okfnlabs.org/tika - proxied from http://givemetext.okfnlabs.org:9998/tika (different from before, where / was proxied to this)

http://givemetext.okfnlabs.org:9998 is open as before but without CORS header

Some kind of friendly instructions on the page like on http://webshot.okfnlabs.org/ would be nice to have.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock May 8, 2015

Member

awesome - and we are about to have a standard labs bootstrap theme we can apply to make it look swish ;-)

Member

rufuspollock commented May 8, 2015

awesome - and we are about to have a standard labs bootstrap theme we can apply to make it look swish ;-)

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton May 8, 2015

There's a typo in there somewhere. We do have one or we don't?

There's a typo in there somewhere. We do have one or we don't?

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock May 8, 2015

Member

fixed - we are about to have one (literally a couple of days).

Member

rufuspollock commented May 8, 2015

fixed - we are about to have one (literally a couple of days).

@tbpalsulich

This comment has been minimized.

Show comment
Hide comment
@tbpalsulich

tbpalsulich May 8, 2015

@mattfullerton, no, I don't think it has Tesseract installed (just tried parsing the Google logo -- nothin').

No, I'm not looking for a lot of traffic. I intended the site as more of a quick demonstration of what Tika can do. I happy you guys found it useful!

See https://issues.apache.org/jira/browse/TIKA-1585 for a little more detail. The Tika server is running on a server donated by Rackspace. We use it for testing Tika against large corpuses. So, I don't want to overload it with requests.

@mattfullerton, no, I don't think it has Tesseract installed (just tried parsing the Google logo -- nothin').

No, I'm not looking for a lot of traffic. I intended the site as more of a quick demonstration of what Tika can do. I happy you guys found it useful!

See https://issues.apache.org/jira/browse/TIKA-1585 for a little more detail. The Tika server is running on a server donated by Rackspace. We use it for testing Tika against large corpuses. So, I don't want to overload it with requests.

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann May 9, 2015

@tpalsulich we could probably contact Rackspace and ask them what they think about the traffic, etc. @rgrp @mattfullerton would be happy to join forces! :) FYI too I just completed http://github.com/chrismattmann/tika-python/ which entirely relies now on the REST server and exposes Translation, Language Detection and the full suite of things to make it really usable entirely as a Python library to Tika. So, great timing.

@tpalsulich worst comes to worse, can't they just fork your code and run your code on their OKFN servers?

@tpalsulich we could probably contact Rackspace and ask them what they think about the traffic, etc. @rgrp @mattfullerton would be happy to join forces! :) FYI too I just completed http://github.com/chrismattmann/tika-python/ which entirely relies now on the REST server and exposes Translation, Language Detection and the full suite of things to make it really usable entirely as a Python library to Tika. So, great timing.

@tpalsulich worst comes to worse, can't they just fork your code and run your code on their OKFN servers?

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton May 9, 2015

@chrismattmann We already had a Tika instance running (actually generously hosted by OKF Germany), just without an HTML upload button. @tpalsulich frontend does that and I forked it yesterday: http://givemetext.okfnlabs.org/.

The Python stuff sounds amazing! If I ever get to the point of using the server what I actually wanted it for (full text search for a CKAN instance), I might be able to make good use of it.

@chrismattmann We already had a Tika instance running (actually generously hosted by OKF Germany), just without an HTML upload button. @tpalsulich frontend does that and I forked it yesterday: http://givemetext.okfnlabs.org/.

The Python stuff sounds amazing! If I ever get to the point of using the server what I actually wanted it for (full text search for a CKAN instance), I might be able to make good use of it.

@tbpalsulich

This comment has been minimized.

Show comment
Hide comment
@tbpalsulich

tbpalsulich May 9, 2015

@mattfullerton, awesome! I'm happy you like it. But, I'm getting a DNS lookup error when loading http://www.givemetext.okfnlabs.org/.

@mattfullerton, awesome! I'm happy you like it. But, I'm getting a DNS lookup error when loading http://www.givemetext.okfnlabs.org/.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton May 10, 2015

Oops! No www.

Oops! No www.

@tbpalsulich

This comment has been minimized.

Show comment
Hide comment
@tbpalsulich

tbpalsulich May 10, 2015

That was it. Working now. 👍

That was it. Working now. 👍

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann May 10, 2015

woot 👍 great work @mattfullerton @tpalsulich !

woot 👍 great work @mattfullerton @tpalsulich !

@rufuspollock rufuspollock referenced this issue in okfn/jekyll-template May 11, 2015

Open

Places to apply our generic theme when ready #4

0 of 3 tasks complete
@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock May 11, 2015

Member

@mattfullerton theme is now ready for you try - see http://okfnlabs.org/app-theme/ and https://github.com/okfn/app-theme. This is generic and you can adapt as you want.

Member

rufuspollock commented May 11, 2015

@mattfullerton theme is now ready for you try - see http://okfnlabs.org/app-theme/ and https://github.com/okfn/app-theme. This is generic and you can adapt as you want.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Jun 20, 2015

Member

@mattfullerton some quick thoughts / suggestions:

Website tweaks:

  • Would it be worth adding some small API examples to front page somewhere below main form (perhaps also link to blog post ...)
  • Could we add a small tagline making clear what this is for the uninitiated e.g. "An easy to use free web service to extract text from PDFs and other documents - OCR support included!"
  • Can we have support for passing a URL to a file online as well as uploading?
Member

rufuspollock commented Jun 20, 2015

@mattfullerton some quick thoughts / suggestions:

Website tweaks:

  • Would it be worth adding some small API examples to front page somewhere below main form (perhaps also link to blog post ...)
  • Could we add a small tagline making clear what this is for the uninitiated e.g. "An easy to use free web service to extract text from PDFs and other documents - OCR support included!"
  • Can we have support for passing a URL to a file online as well as uploading?
@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Jun 22, 2015

@rgrp All good ideas, especially the examples bit I wanted to do. I'm not really finished with the theme either, that was done very quickly. Just a bit swamped at the moment.

Regarding the source repo, in case its not clear - my only contribution here is creating a Dockerfile that gets the slightly complicated Tika, and specifically Tika-server, up and running with OCR support built in. Tika is an Apache project: https://tika.apache.org/
My small contribution is here: https://github.com/mattfullerton/tika-tesseract-docker

Hence getting support for file URLs within the API (if not already there, I'll have to look) would require modifying Tika itself. I'll look into it but the Tika developers would have the final say; the alternative of course would be our own little micro man-in-the-middle service to download the link in the background.

@rgrp All good ideas, especially the examples bit I wanted to do. I'm not really finished with the theme either, that was done very quickly. Just a bit swamped at the moment.

Regarding the source repo, in case its not clear - my only contribution here is creating a Dockerfile that gets the slightly complicated Tika, and specifically Tika-server, up and running with OCR support built in. Tika is an Apache project: https://tika.apache.org/
My small contribution is here: https://github.com/mattfullerton/tika-tesseract-docker

Hence getting support for file URLs within the API (if not already there, I'll have to look) would require modifying Tika itself. I'll look into it but the Tika developers would have the final say; the alternative of course would be our own little micro man-in-the-middle service to download the link in the background.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Jun 22, 2015

Member

@mattfullerton got you re time and thanks for flagging the repo - i can open issues there going forward - and people can contribute there (e.g. i assume the front page code is there).

Re the url point: do you have to modify Tika?I thought Tika could take a "stream" like object - and you can just open a url as a stream. Worse case you cache the URL contents to disk as file and then load.

Member

rufuspollock commented Jun 22, 2015

@mattfullerton got you re time and thanks for flagging the repo - i can open issues there going forward - and people can contribute there (e.g. i assume the front page code is there).

Re the url point: do you have to modify Tika?I thought Tika could take a "stream" like object - and you can just open a url as a stream. Worse case you cache the URL contents to disk as file and then load.

@jaakkokorhonen

This comment has been minimized.

Show comment
Hide comment
@jaakkokorhonen

jaakkokorhonen Jun 29, 2015

@rgrp any plans to make OCR work with Froide? We are getting blacked FOI documents into http://tietopyynto.fi, and are looking for a solution to read the scanned or bitmapped documents back into data.

@rgrp any plans to make OCR work with Froide? We are getting blacked FOI documents into http://tietopyynto.fi, and are looking for a solution to read the scanned or bitmapped documents back into data.

@jaakkokorhonen jaakkokorhonen referenced this issue in okffi/tietopyynto Jun 29, 2015

Open

Testaa OCR #64

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Aug 20, 2015

@rgrp @mattfullerton Tika can take a stream and yes you can open a URL as a stream to it. See: http://wiki.apache.org/tika/TikaJAXRS#Extracting_A_Document_From_A_URL

@rgrp @mattfullerton Tika can take a stream and yes you can open a URL as a stream to it. See: http://wiki.apache.org/tika/TikaJAXRS#Extracting_A_Document_From_A_URL

@cleder

This comment has been minimized.

Show comment
Hide comment
@cleder

cleder Aug 26, 2015

maybe a bit OT but have a look at https://pypi.python.org/pypi/ocrmypdf

cleder commented Aug 26, 2015

maybe a bit OT but have a look at https://pypi.python.org/pypi/ocrmypdf

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Aug 28, 2015

@jaakkokorhonen - It may be of little help as its a Ruby project, but the service is being used by kleineanfragen.de: https://github.com/robbi5/kleineanfragen/blob/master/app/jobs/extract_last_modified_from_paper_job.rb#L34

@jaakkokorhonen - It may be of little help as its a Ruby project, but the service is being used by kleineanfragen.de: https://github.com/robbi5/kleineanfragen/blob/master/app/jobs/extract_last_modified_from_paper_job.rb#L34

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Aug 28, 2015

@chrismattmann I'm running a very recent version of the latest Tika and can't get the URL processing to work. Would you happen to have a complete curl command for a URL that works? The example in the link is not for a real URL.

@chrismattmann I'm running a very recent version of the latest Tika and can't get the URL processing to work. Would you happen to have a complete curl command for a URL that works? The example in the link is not for a real URL.

@robbi5

This comment has been minimized.

Show comment
Hide comment

robbi5 commented Aug 28, 2015

@mattfullerton @chrismattmann Tika removed the URL processing feature again, see https://issues.apache.org/jira/browse/TIKA-1690

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Aug 28, 2015

@rgrp I think I've responded to all of your points with an update to http://givemetext.okfnlabs.org/ and a new blog post at http://okfnlabs.org/blog/2015/08/28/give-me-text.html. The URL thing could take a while (see above!).

@rgrp I think I've responded to all of your points with an update to http://givemetext.okfnlabs.org/ and a new blog post at http://okfnlabs.org/blog/2015/08/28/give-me-text.html. The URL thing could take a while (see above!).

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Aug 28, 2015

hey @mattfullerton @robbi5 yeah there was a pretty big security hole with that. Sorry posted it before we found the hole.

hey @mattfullerton @robbi5 yeah there was a pretty big security hole with that. Sorry posted it before we found the hole.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 28, 2015

Member

FIXED. http://okfnlabs.org/blog/2015/08/28/give-me-text.html

@mattfullerton I think we can mark this as closed :-) The service is now operational and i regularly use it. Big well done to you for all your efforts here and creating a great, useful, service.

Member

rufuspollock commented Nov 28, 2015

FIXED. http://okfnlabs.org/blog/2015/08/28/give-me-text.html

@mattfullerton I think we can mark this as closed :-) The service is now operational and i regularly use it. Big well done to you for all your efforts here and creating a great, useful, service.

@chrismattmann

This comment has been minimized.

Show comment
Hide comment

woot! @mattfullerton @rgrp great work

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Dec 2, 2015

Thanks guys. Please post feature requests etc. either on https://issues.apache.org/jira/browse/TIKA/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel if they are specific to the file->text or ocr integration (TIKA) and on https://github.com/mattfullerton/tika-tesseract-docker if specific to the packaging (unlikely) or on https://github.com/mattfullerton/TikaExamples for the web frontend.

Thanks guys. Please post feature requests etc. either on https://issues.apache.org/jira/browse/TIKA/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel if they are specific to the file->text or ocr integration (TIKA) and on https://github.com/mattfullerton/tika-tesseract-docker if specific to the packaging (unlikely) or on https://github.com/mattfullerton/TikaExamples for the web frontend.

@mattfullerton

This comment has been minimized.

Show comment
Hide comment
@mattfullerton

mattfullerton Mar 2, 2016

@nigelbabu - could you update the DNS records as follows:
givemetext.okfnlabs.org AAAA 2a01:4f8:201:5006::2
www.givemetext.okfnlabs.org A 148.251.0.14
www.givemetext.okfnlabs.org AAAA 2a01:4f8:201:5006::2

Thanks!

@nigelbabu - could you update the DNS records as follows:
givemetext.okfnlabs.org AAAA 2a01:4f8:201:5006::2
www.givemetext.okfnlabs.org A 148.251.0.14
www.givemetext.okfnlabs.org AAAA 2a01:4f8:201:5006::2

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment