Update TikaImageDetection component to extract images from document (.doc, .docx) and PowerPoint (.ppt, .pptx) files. #274

hhuangMITRE · 2021-08-04T22:27:40Z

Issues:

Tika Image Detection misses images in PowerPoint and Word documents. openmpf#1372

This change is

…*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files.

jrobble

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 139 at r1 (raw file):

            imageMap.add(current);
            if (separatePages) {
                outputDir = Paths.get(path + "/tika-extracted/page-" + String.valueOf(pageNum + 1));

This might be an old bug since I didn't test it in your last PR. If I update testGetDetectionsPdf() to use ORGANIZE_BY_PAGE=TRUE, I see the following structure:

`-- TestRun
    `-- tika-extracted
        |-- a39c7392-3c08-4bef-8ac5-3c74378223a5
        |   `-- common
        |       |-- image0.jpg
        |       `-- image1.jpg
        `-- page-6
            |-- image2.jpg
            `-- image3.png

page-6 needs to be under the UUID directory.

java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 209 at r1 (raw file):

        testTrack = tracks.get(3);
        assertEquals("4", testTrack.getDetectionProperties().get("PAGE_NUM"));
        assertTrue(testTrack.getDetectionProperties().get("SAVED_IMAGES").contains("image0.emf"));

Looking at the pptx, I see that there are 3 large flowers, but only two of the extracted files are large flowers. One is a .eml file, and one is a .jpeg file. I assume that one of those images is used twice, but I have no idea which one.

For clarity, please use a different image in the pptx file to test out different image formats.

java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):

        // Test extraction of images from *.docx format.
        // Six extracted images should be present in one track:

Do you mean "eight" extracted images?

When I look at the 8 images extracted, I see:

3 for the large flower
3 for the pink flowers
1 for the mountain
1 for the flower drawing

However, that's not what's shown in the doc. The doc shows:

4 for the large flower
2 for the pink flowers
1 for the mountain
1 for the flower drawing

Do you know why? It's a bit confusing.

java/TikaImageDetection/src/test/resources/data/NOTICE, line 35 at r1 (raw file):

    Public Domain

# test-tika-image-extraction.pdf

Is this supposed to be test-tika-image-extraction.doc ?

That file also contains:

    # bloom-blossom-flora-40797.jpg
    https://www.pexels.com/photo/nature-flowers-blue-summer-40797/
    Public Domain

jrobble · 2021-08-05T16:09:29Z

java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Do you mean "eight" extracted images?

When I look at the 8 images extracted, I see:

3 for the large flower

3 for the pink flowers

1 for the mountain

1 for the flower drawing

However, that's not what's shown in the doc. The doc shows:

4 for the large flower

2 for the pink flowers

1 for the mountain

1 for the flower drawing

Do you know why? It's a bit confusing.

Also, more generally, can we avoid extracting the same image file more than once?

… into feature/tika-ppt-doc-image-extraction

hhuangMITRE

Reviewable status: 2 of 8 files reviewed, 4 unresolved discussions (waiting on @jrobble)

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 139 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This might be an old bug since I didn't test it in your last PR. If I update testGetDetectionsPdf() to use ORGANIZE_BY_PAGE=TRUE, I see the following structure:
`-- TestRun
    `-- tika-extracted
        |-- a39c7392-3c08-4bef-8ac5-3c74378223a5
        |   `-- common
        |       |-- image0.jpg
        |       `-- image1.jpg
        `-- page-6
            |-- image2.jpg
            `-- image3.png
page-6 needs to be under the UUID directory.

Fixed, thanks for catching that. I've updated the outputDir with the uniqueId tag.

java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 209 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Looking at the pptx, I see that there are 3 large flowers, but only two of the extracted files are large flowers. One is a .eml file, and one is a .jpeg file. I assume that one of those images is used twice, but I have no idea which one.

For clarity, please use a different image in the pptx file to test out different image formats.

Thanks! I recreated the test documents with more distinct images. Along the way I also tested out more image duplication within the Microsoft documents (ppt/docx).

Overall, the word document duplication issue appears to have been resolved (see below comment).

PowerPoint document still contains an extra "emf" file, so I'm looking to see if that can be cleaned up but it could be an artifact. When I tested out some other image formatting options, tika sometimes extracted it as a separate embedded file.

java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Also, more generally, can we avoid extracting the same image file more than once?

Hey Jeff, so I was examining the tika-parsers/microsoft code + testing out more media files.

I think the issue is mainly how duplicate images are created in the word/ppt documents. When I created the new test files (w/ multiple copies of a flower + brick/sphere scene), the duplicate images were only reported once. Perhaps other ways of creating duplicates (copying from sources / copying then resizing/formatting images) will generate unique copies that can't be filtered out by the parsers.

Looking through the Microsoft parsers right now, I don't see any immediate options for cleaning up embeddings of similar/duplicated images.

I'll keep looking to see if other options exist, but for now I've resolved the duplicate image extraction issue for our test documents.

java/TikaImageDetection/src/test/resources/data/NOTICE, line 35 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Is this supposed to be test-tika-image-extraction.doc ?

That file also contains:
    # bloom-blossom-flora-40797.jpg
    https://www.pexels.com/photo/nature-flowers-blue-summer-40797/
    Public Domain

Fixed, thanks! I've remade the test documents (with more distinct images) and updated the NOTICE file.

…r Tika components.

jrobble · 2021-09-07T18:37:13Z

java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 238 at r3 (raw file):
Note to self: Make sure this is fixed:

NPE on jrobble-test.odp

jrobble

Reviewed 3 of 6 files at r2, 6 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @hhuangMITRE)

Update TikaImageDetection component to extract images from document (…

fd1705b

…*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files.

hhuangMITRE requested a review from jrobble August 5, 2021 14:56

jrobble requested changes Aug 5, 2021

View reviewed changes

hhuangMITRE added 3 commits August 13, 2021 03:25

Duplicate image report fix and duplicate test image extraction cleanup.

4c19245

Merge branch 'develop' of https://github.com/openmpf/openmpf-components…

4cf4c0c

… into feature/tika-ppt-doc-image-extraction

Duplicate image report fix and duplicate test image extraction cleanup.

cfafcab

hhuangMITRE commented Aug 13, 2021

View reviewed changes

Update TikaImageDetection to process odp files. Minor code cleanup fo…

e6e44fb

…r Tika components.

jrobble assigned hhuangMITRE Sep 7, 2021

jrobble added this to To do in OpenMPF: Development via automation Sep 7, 2021

Documentation update.

fc0717c

jrobble reviewed Jul 7, 2022

View reviewed changes

hhuangMITRE closed this Jul 12, 2022

OpenMPF: Development automation moved this from To do to Closed Jul 12, 2022

hhuangMITRE deleted the feature/tika-ppt-doc-image-extraction branch September 12, 2022 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TikaImageDetection component to extract images from document (.doc, .docx) and PowerPoint (.ppt, .pptx) files. #274

Update TikaImageDetection component to extract images from document (.doc, .docx) and PowerPoint (.ppt, .pptx) files. #274

hhuangMITRE commented Aug 4, 2021 •

edited by jrobble

jrobble left a comment

jrobble commented Aug 5, 2021

hhuangMITRE left a comment

jrobble commented Sep 7, 2021

jrobble left a comment

Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274

Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274

Conversation

hhuangMITRE commented Aug 4, 2021 • edited by jrobble

jrobble left a comment

Choose a reason for hiding this comment

jrobble commented Aug 5, 2021

hhuangMITRE left a comment

Choose a reason for hiding this comment

jrobble commented Sep 7, 2021

jrobble left a comment

Choose a reason for hiding this comment

Update TikaImageDetection component to extract images from document (.doc, .docx) and PowerPoint (.ppt, .pptx) files. #274

Update TikaImageDetection component to extract images from document (.doc, .docx) and PowerPoint (.ppt, .pptx) files. #274

hhuangMITRE commented Aug 4, 2021 •

edited by jrobble