-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274
Conversation
…*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)
java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 139 at r1 (raw file):
imageMap.add(current); if (separatePages) { outputDir = Paths.get(path + "/tika-extracted/page-" + String.valueOf(pageNum + 1));
This might be an old bug since I didn't test it in your last PR. If I update testGetDetectionsPdf()
to use ORGANIZE_BY_PAGE=TRUE
, I see the following structure:
`-- TestRun
`-- tika-extracted
|-- a39c7392-3c08-4bef-8ac5-3c74378223a5
| `-- common
| |-- image0.jpg
| `-- image1.jpg
`-- page-6
|-- image2.jpg
`-- image3.png
page-6
needs to be under the UUID directory.
java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 209 at r1 (raw file):
testTrack = tracks.get(3); assertEquals("4", testTrack.getDetectionProperties().get("PAGE_NUM")); assertTrue(testTrack.getDetectionProperties().get("SAVED_IMAGES").contains("image0.emf"));
Looking at the pptx, I see that there are 3 large flowers, but only two of the extracted files are large flowers. One is a .eml file, and one is a .jpeg file. I assume that one of those images is used twice, but I have no idea which one.
For clarity, please use a different image in the pptx file to test out different image formats.
java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):
// Test extraction of images from *.docx format. // Six extracted images should be present in one track:
Do you mean "eight" extracted images?
When I look at the 8 images extracted, I see:
- 3 for the large flower
- 3 for the pink flowers
- 1 for the mountain
- 1 for the flower drawing
However, that's not what's shown in the doc. The doc shows:
- 4 for the large flower
- 2 for the pink flowers
- 1 for the mountain
- 1 for the flower drawing
Do you know why? It's a bit confusing.
java/TikaImageDetection/src/test/resources/data/NOTICE, line 35 at r1 (raw file):
Public Domain # test-tika-image-extraction.pdf
Is this supposed to be test-tika-image-extraction.doc
?
That file also contains:
# bloom-blossom-flora-40797.jpg
https://www.pexels.com/photo/nature-flowers-blue-summer-40797/
Public Domain
java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file): Previously, jrobble (Jeff Robble) wrote…
Also, more generally, can we avoid extracting the same image file more than once? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 8 files reviewed, 4 unresolved discussions (waiting on @jrobble)
java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 139 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
This might be an old bug since I didn't test it in your last PR. If I update
testGetDetectionsPdf()
to useORGANIZE_BY_PAGE=TRUE
, I see the following structure:`-- TestRun `-- tika-extracted |-- a39c7392-3c08-4bef-8ac5-3c74378223a5 | `-- common | |-- image0.jpg | `-- image1.jpg `-- page-6 |-- image2.jpg `-- image3.png
page-6
needs to be under the UUID directory.
Fixed, thanks for catching that. I've updated the outputDir with the uniqueId
tag.
java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 209 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Looking at the pptx, I see that there are 3 large flowers, but only two of the extracted files are large flowers. One is a .eml file, and one is a .jpeg file. I assume that one of those images is used twice, but I have no idea which one.
For clarity, please use a different image in the pptx file to test out different image formats.
Thanks! I recreated the test documents with more distinct images. Along the way I also tested out more image duplication within the Microsoft documents (ppt/docx).
Overall, the word document duplication issue appears to have been resolved (see below comment).
PowerPoint document still contains an extra "emf" file, so I'm looking to see if that can be cleaned up but it could be an artifact. When I tested out some other image formatting options, tika sometimes extracted it as a separate embedded file.
java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Also, more generally, can we avoid extracting the same image file more than once?
Hey Jeff, so I was examining the tika-parsers/microsoft
code + testing out more media files.
I think the issue is mainly how duplicate images are created in the word/ppt documents. When I created the new test files (w/ multiple copies of a flower + brick/sphere scene), the duplicate images were only reported once. Perhaps other ways of creating duplicates (copying from sources / copying then resizing/formatting images) will generate unique copies that can't be filtered out by the parsers.
Looking through the Microsoft parsers right now, I don't see any immediate options for cleaning up embeddings of similar/duplicated images.
I'll keep looking to see if other options exist, but for now I've resolved the duplicate image extraction issue for our test documents.
java/TikaImageDetection/src/test/resources/data/NOTICE, line 35 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Is this supposed to be
test-tika-image-extraction.doc
?That file also contains:
# bloom-blossom-flora-40797.jpg https://www.pexels.com/photo/nature-flowers-blue-summer-40797/ Public Domain
Fixed, thanks! I've remade the test documents (with more distinct images) and updated the NOTICE
file.
…r Tika components.
java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 238 at r3 (raw file):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 6 files at r2, 6 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @hhuangMITRE)
Issues:
This change is