Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274

Closed
wants to merge 6 commits into from

Conversation

hhuangMITRE
Copy link
Contributor

@hhuangMITRE hhuangMITRE commented Aug 4, 2021

…*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files.
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 139 at r1 (raw file):

            imageMap.add(current);
            if (separatePages) {
                outputDir = Paths.get(path + "/tika-extracted/page-" + String.valueOf(pageNum + 1));

This might be an old bug since I didn't test it in your last PR. If I update testGetDetectionsPdf() to use ORGANIZE_BY_PAGE=TRUE, I see the following structure:

`-- TestRun
    `-- tika-extracted
        |-- a39c7392-3c08-4bef-8ac5-3c74378223a5
        |   `-- common
        |       |-- image0.jpg
        |       `-- image1.jpg
        `-- page-6
            |-- image2.jpg
            `-- image3.png

page-6 needs to be under the UUID directory.


java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 209 at r1 (raw file):

        testTrack = tracks.get(3);
        assertEquals("4", testTrack.getDetectionProperties().get("PAGE_NUM"));
        assertTrue(testTrack.getDetectionProperties().get("SAVED_IMAGES").contains("image0.emf"));

Looking at the pptx, I see that there are 3 large flowers, but only two of the extracted files are large flowers. One is a .eml file, and one is a .jpeg file. I assume that one of those images is used twice, but I have no idea which one.

For clarity, please use a different image in the pptx file to test out different image formats.


java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):

        // Test extraction of images from *.docx format.
        // Six extracted images should be present in one track:

Do you mean "eight" extracted images?

When I look at the 8 images extracted, I see:

  • 3 for the large flower
  • 3 for the pink flowers
  • 1 for the mountain
  • 1 for the flower drawing

However, that's not what's shown in the doc. The doc shows:

  • 4 for the large flower
  • 2 for the pink flowers
  • 1 for the mountain
  • 1 for the flower drawing

Do you know why? It's a bit confusing.


java/TikaImageDetection/src/test/resources/data/NOTICE, line 35 at r1 (raw file):

    Public Domain

# test-tika-image-extraction.pdf

Is this supposed to be test-tika-image-extraction.doc ?

That file also contains:

    # bloom-blossom-flora-40797.jpg
    https://www.pexels.com/photo/nature-flowers-blue-summer-40797/
    Public Domain

@jrobble
Copy link
Member

jrobble commented Aug 5, 2021


java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Do you mean "eight" extracted images?

When I look at the 8 images extracted, I see:

  • 3 for the large flower
  • 3 for the pink flowers
  • 1 for the mountain
  • 1 for the flower drawing

However, that's not what's shown in the doc. The doc shows:

  • 4 for the large flower
  • 2 for the pink flowers
  • 1 for the mountain
  • 1 for the flower drawing

Do you know why? It's a bit confusing.

Also, more generally, can we avoid extracting the same image file more than once?

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 8 files reviewed, 4 unresolved discussions (waiting on @jrobble)


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 139 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This might be an old bug since I didn't test it in your last PR. If I update testGetDetectionsPdf() to use ORGANIZE_BY_PAGE=TRUE, I see the following structure:

`-- TestRun
    `-- tika-extracted
        |-- a39c7392-3c08-4bef-8ac5-3c74378223a5
        |   `-- common
        |       |-- image0.jpg
        |       `-- image1.jpg
        `-- page-6
            |-- image2.jpg
            `-- image3.png

page-6 needs to be under the UUID directory.

Fixed, thanks for catching that. I've updated the outputDir with the uniqueId tag.


java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 209 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Looking at the pptx, I see that there are 3 large flowers, but only two of the extracted files are large flowers. One is a .eml file, and one is a .jpeg file. I assume that one of those images is used twice, but I have no idea which one.

For clarity, please use a different image in the pptx file to test out different image formats.

Thanks! I recreated the test documents with more distinct images. Along the way I also tested out more image duplication within the Microsoft documents (ppt/docx).

Overall, the word document duplication issue appears to have been resolved (see below comment).

PowerPoint document still contains an extra "emf" file, so I'm looking to see if that can be cleaned up but it could be an artifact. When I tested out some other image formatting options, tika sometimes extracted it as a separate embedded file.


java/TikaImageDetection/src/test/java/org/mitre/mpf/detection/tika/TestTikaImageDetectionComponent.java, line 249 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Also, more generally, can we avoid extracting the same image file more than once?

Hey Jeff, so I was examining the tika-parsers/microsoft code + testing out more media files.

I think the issue is mainly how duplicate images are created in the word/ppt documents. When I created the new test files (w/ multiple copies of a flower + brick/sphere scene), the duplicate images were only reported once. Perhaps other ways of creating duplicates (copying from sources / copying then resizing/formatting images) will generate unique copies that can't be filtered out by the parsers.

Looking through the Microsoft parsers right now, I don't see any immediate options for cleaning up embeddings of similar/duplicated images.

I'll keep looking to see if other options exist, but for now I've resolved the duplicate image extraction issue for our test documents.


java/TikaImageDetection/src/test/resources/data/NOTICE, line 35 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Is this supposed to be test-tika-image-extraction.doc ?

That file also contains:

    # bloom-blossom-flora-40797.jpg
    https://www.pexels.com/photo/nature-flowers-blue-summer-40797/
    Public Domain

Fixed, thanks! I've remade the test documents (with more distinct images) and updated the NOTICE file.

@jrobble
Copy link
Member

jrobble commented Sep 7, 2021

@jrobble jrobble added this to To do in OpenMPF: Development via automation Sep 7, 2021
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 6 files at r2, 6 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @hhuangMITRE)

OpenMPF: Development automation moved this from To do to Closed Jul 12, 2022
@hhuangMITRE hhuangMITRE deleted the feature/tika-ppt-doc-image-extraction branch September 12, 2022 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

None yet

2 participants