-
Notifications
You must be signed in to change notification settings - Fork 47
Image Data Creeping into Plain Text #138
Comments
Just re-ran the plain text extractor and this is still an issue. These are images where I think the mime type is erroneously set to html. Related to #163, is there any easy way to refine |
I suppose we could run Tika MIME type detection, but this would considerably slow processing speed down... |
Given some of the other stuff you're planning, I'd be surprised if Tika slows you down much. If you do decide to try it, you best bet is to just include tika-core (and not tika-parsers) on your classpath. In that case, the MIME-type detection will not open up and parse container/complex formats. It will just do binary signature detection, which is enough to spot common image formats. The only 'trick' is to use a buffered stream wrapper so the Tika code can parse the first few K and you can then reset the stream pointer to the start of the payload. |
Ah, I see you've tried it before. Your current implementation instantiates Tika on every request:
Tika isn't intended to be used this way, and will be very slow (as it's re-parsing the signature files etc every time). You could try re-using a singleton Tika instance instead.. I believe it's supposed to be threadsafe, but even if it isn't you could wrap it as a ThreadLocal singleton. |
Just keeping this alive – was playing around with an
My sense is this is binary data sneaking into things? |
Yes, as a workaround for now, add a |
I'm not quite sure how to grab the record name, as I've got limited errors thrown. I'll put the gist here and maybe we can quickly chat about it today when I'm up in DC. |
What's the script that you're running? Can you isolate which WARC file the error is coming from? That would be a start... |
Some images have been sneaking into the extracted plain text, perhaps because (as per @anjackson) we are trusting server Content-Type. The binary data throws off/breaks text analysis workflows. See figure below:
Current workaround is:
(or can be baked into workflows)
The text was updated successfully, but these errors were encountered: