Skip to content
This repository has been archived by the owner on Apr 27, 2018. It is now read-only.

Image Data Creeping into Plain Text #138

Open
ianmilligan1 opened this issue Jul 15, 2015 · 8 comments
Open

Image Data Creeping into Plain Text #138

ianmilligan1 opened this issue Jul 15, 2015 · 8 comments
Labels

Comments

@ianmilligan1
Copy link
Collaborator

Some images have been sneaking into the extracted plain text, perhaps because (as per @anjackson) we are trusting server Content-Type. The binary data throws off/breaks text analysis workflows. See figure below:

Error message

Current workaround is:

strings [input-file] > [output-file]

(or can be baked into workflows)

@ianmilligan1
Copy link
Collaborator Author

Just re-ran the plain text extractor and this is still an issue. These are images where I think the mime type is erroneously set to html. Related to #163, is there any easy way to refine keepValidPages further?

@lintool
Copy link
Owner

lintool commented Nov 24, 2015

I suppose we could run Tika MIME type detection, but this would considerably slow processing speed down...

@anjackson
Copy link
Contributor

Given some of the other stuff you're planning, I'd be surprised if Tika slows you down much. If you do decide to try it, you best bet is to just include tika-core (and not tika-parsers) on your classpath. In that case, the MIME-type detection will not open up and parse container/complex formats. It will just do binary signature detection, which is enough to spot common image formats. The only 'trick' is to use a buffered stream wrapper so the Tika code can parse the first few K and you can then reset the stream pointer to the start of the payload.

@anjackson
Copy link
Contributor

Ah, I see you've tried it before. Your current implementation instantiates Tika on every request:

val mimetype = new Tika(detector, parser).detect(is)

Tika isn't intended to be used this way, and will be very slow (as it's re-parsing the signature files etc every time). You could try re-using a singleton Tika instance instead.. I believe it's supposed to be threadsafe, but even if it isn't you could wrap it as a ThreadLocal singleton.

@ianmilligan1
Copy link
Collaborator Author

Just keeping this alive – was playing around with an ExtractEntities call on another test collection and crashed on:

Unparseable header line: [?slÑ???r???]QQoGâ?XyÚ  6?YÛ¤¶i·J­Ö¤Rö ?âØÌ6M·_¿Ï?©Òd ÙçóÝ}Çábw?Ó
                                                                              Á?vxÿg? ¹©¥ju?\iå
                                                                                               r3Ρ5wR«Øsp+EßØùí;à$
  fPLÓ=É?ÍYJ~Î
              ¨µiðí¡Ï£Ql eÃÖ´â<?+}Ê?å<Áë*K1 ©Òª&Yöu?¸½Pü?.  ½WIhIÓÈǺ¿3ñ²×õ3ÞÈNXÇ8Øxgä#¢S­µV`÷¡£s?ãJVj<Ej£ÖÄþÙô)o ÉzIHù"                                                   Ðt
iÙC§Ö5·.?6 ®´³ Õ=·VÖapP<Av¸X`Ì£uX¼Â8^fáÁw??ñ{Væ
                                               A6pr?¦y~C?¢?¥éd¸jô ¹|;Ïÿ¼?_þï ®çá¾ë ´éb` å°1]-?Èwé×? 1br??? ????????] (Offset 11).

My sense is this is binary data sneaking into things?

@lintool
Copy link
Owner

lintool commented Dec 16, 2015

Yes, as a workaround for now, add a .filter(...) and exclude that page by hand?

@ianmilligan1
Copy link
Collaborator Author

I'm not quite sure how to grab the record name, as I've got limited errors thrown. I'll put the gist here and maybe we can quickly chat about it today when I'm up in DC.

https://gist.github.com/ianmilligan1/8822295cf487b98d083e

@lintool
Copy link
Owner

lintool commented Dec 17, 2015

What's the script that you're running? Can you isolate which WARC file the error is coming from? That would be a start...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants