Feature: Update Tika and text components to process derivative images extracted from PDFs. #266

hhuangMITRE · 2021-07-07T06:46:09Z

Issues:

Perform detection on images extracted from PDFs openmpf#792
Update Tika Image Detection to generate one track per piece of extracted media openmpf#803
Update Tika Image Detection to return MEDIA tracks openmpf#1303
Update text detection components to ignore feed-forward tracks which already have a TEXT property openmpf#1304
Update Tika Text Detection component to not use leading zeros for PAGE_NUM openmpf#808
Tika Image Detection misses images in PowerPoint and Word documents. openmpf#1372
Update Apache Tika to 2.4.1 for TikaImageDetection and TikaTextDetection Components. openmpf#1384

Related PRs:

This change is

…XT tracks.

… into feature/tika-pdf-tesseract-pipeline

…to the correct descriptor.json files.

jrobble

Reviewed 2 of 13 files at r1, 17 of 17 files at r2.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @hhuangMITRE)

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 57 at r1 (raw file):

    private int id, pagenum;
    private ArrayList<ArrayList<String>> imageMap;
    private LinkedHashMap<String, TreeSet<String>> imagePageMap;

@hhuangMITRE : Please leave a comment stating what imagePageMap contains. What is the key? What is the value a set of?

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 119 at r2 (raw file):

            imageMap.add(current);
            if (separatePages) {
                outputDir = Paths.get(path + "/tika-extracted/page-" + (pagenum + 1)); // TODO: Shouldn't this use the UUID?

@hhuangMITRE : Please address this.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/ImageExtractionContentHandler.java, line 49 at r2 (raw file):

    public void startElement (String uri, String localName, String qName, Attributes atts) {
        if (pageTag.equals(qName) && (atts.getValue("class").equals("page"))) { // TODO: NPE on jrobble-test.odp

@hhuangMITRE : Please look into this. I will send you the file separately.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 166 at r1 (raw file):

        List<MPFGenericTrack> tracks = new LinkedList<MPFGenericTrack>();
        int page = 0;

page is not used anywhere, so I removed it.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 131 at r2 (raw file):

        Map<String,String> properties = mpfGenericJob.getJobProperties();
        boolean separatePages = false;
        boolean emptyPages = false; // TODO: This is not read anywhere

@hhuangMITRE : Please address this. Did you intend to use emptyPages for something?

java/TikaTextDetection/Dockerfile, line 39 at r1 (raw file):

COPY . .

ARG RUN_TESTS=true

I changed this so that it's false by default, like in other components. The Jenkinsfile will set it to true here:

            sh 'docker build -f openmpf_build/Dockerfile .. --build-arg RUN_TESTS=true ' +
                    "$commonBuildArgs $labelArgs -t openmpf_build:$inProgressTag"

It's false by default so we don't run the tests when doing local builds.

hhuangMITRE

Assuming we land the derivative branch changes first, I've fixed the UUID issues and cleaned up the unused job parameter. Let me know if any further updates are requested. Thanks!

Reviewable status: 14 of 17 files reviewed, 4 unresolved discussions (waiting on @brosenberg42 and @jrobble)

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 57 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please leave a comment stating what imagePageMap contains. What is the key? What is the value a set of?

Done.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 119 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please address this.

Done. This change is also included in the PDF image extraction PR, but I'm also placing the fix here if we land this first.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/ImageExtractionContentHandler.java, line 49 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please look into this. I will send you the file separately.

Done. This issue was resolved in the image extraction PR. Let me know if you want me to move that change over here, thanks!

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 131 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please address this. Did you intend to use emptyPages for something?

I believe this originally was meant for the page-per-track version of the component. During development, I wasn't fully sure if we needed this property anymore as we are reporting by image-per-track with these changes (and reporting empty pages would create confusion). I've decided to remove this job property as a result.

jrobble

Reviewed 2 of 3 files at r3.
Reviewable status: 16 of 17 files reviewed, 1 unresolved discussion (waiting on @brosenberg42, @hhuangMITRE, and @jrobble)

java/TikaImageDetection/README.md, line 3 at r3 (raw file):

# Overview

This directory contains source code for the OpenMPF Tika image detection component.

Please update this:

Still mentions ALLOW_EMPTY_PAGES .
Still mentions creating a track for each occurrence of the image. Instead, we're now listing all occurrences in the same track under PAGE_NUM, which can be a list.
Use tick marks around ORGANIZE_BY_PAGE.
IMAGES_FILES is now DERIVATIVE_MEDIA_PATH.
PAGE is nowPAGE_NUM.
In your other PR, mention that .odp files are supported, and list any other files types that are supported.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 119 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Done. This change is also included in the PDF image extraction PR, but I'm also placing the fix here if we land this first.

Ok. The plan is to land this first.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/ImageExtractionContentHandler.java, line 49 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Done. This issue was resolved in the image extraction PR. Let me know if you want me to move that change over here, thanks!

I think it's fine to leave in the PR for now. I left a note to myself there.

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 131 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

I believe this originally was meant for the page-per-track version of the component. During development, I wasn't fully sure if we needed this property anymore as we are reporting by image-per-track with these changes (and reporting empty pages would create confusion). I've decided to remove this job property as a result.

Ok. Yeah, I think it's best to remove it.

jrobble

Reviewed 1 of 3 files at r3.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @hhuangMITRE)

jrobble

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @hhuangMITRE)

java/TikaImageDetection/README.md, line 3 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please update this:

Still mentions ALLOW_EMPTY_PAGES .

Still mentions creating a track for each occurrence of the image. Instead, we're now listing all occurrences in the same track under PAGE_NUM, which can be a list.

Use tick marks around ORGANIZE_BY_PAGE.

IMAGES_FILES is now DERIVATIVE_MEDIA_PATH.

PAGE is nowPAGE_NUM.

In your other PR, mention that .odp files are supported, and list any other files types that are supported.

@hhuangMITRE , forgot to tag you.

hhuangMITRE

Reviewable status: 16 of 17 files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @jrobble)

java/TikaImageDetection/README.md, line 3 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE , forgot to tag you.

Thanks Jeff! I've updated the README for both branches. I will merge this content into the image extraction PR after the derivative media update lands.

Co-authored-by: Brian Rosenberg <brosenberg@mitre.org>

jrobble

Reviewed 6 of 8 files at r5, 7 of 7 files at r6, 1 of 1 files at r7.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @jrobble)

brosenberg42

Reviewed 6 of 8 files at r5.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jrobble)

brosenberg42

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @hhuangMITRE and @jrobble)

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/PageNumberExtractionContentHandler.java line 50 at r7 (raw file):

        var classAttr = atts.getValue("class");
        if (classAttr.equals("page")) { // TODO: NPE on jrobble-test.odp

Does this need to be addressed?

jrobble · 2022-07-05T20:50:13Z

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/PageNumberExtractionContentHandler.java line 50 at r7 (raw file):

Previously, brosenberg42 wrote…

Does this need to be addressed?

I test this and it's fixed in my recent updates.

jrobble

Reviewed 11 of 11 files at r8.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)

brosenberg42

Reviewed 2 of 7 files at r6, 10 of 11 files at r8, 1 of 1 files at r9.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @hhuangMITRE)

brosenberg42

Reviewed 2 of 3 files at r10.
Reviewable status: 25 of 26 files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE and @jrobble)

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 50 at r10 (raw file):

    private StringBuilder _sectionText;

    private final Map<Integer, List<StringBuilder>> _pageToSections = new LinkedHashMap<>();

This should be HashMap

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 99 at r10 (raw file):

    private void createPage() {
        _sectionText = new StringBuilder();
        _pageToSections.put(_pageNumber, new LinkedList<>());

This should be an ArrayList. It is accessed by index in TikaTextDetectionComponent.getDetections resulting in O(n^2) behavior. Also, in general, there are very few cases where LinkedList should be used. LinkedList is slower and takes up 3x the space (it's doubly linked) of an ArrayList.

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 106 at r10 (raw file):

        _pageNumber = 0;
        _pageToSections.clear();
        createPage();

Should _allText be cleared too?

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 111 at r10 (raw file):

    private void newSection() {
        // Skip blank sections.
        if (_sectionText.toString().isBlank()) {

In the case where this is true because the section is just white space, won't the white space be carried over in to the next section?

jrobble

Reviewed 1 of 1 files at r9, 2 of 3 files at r10, 11 of 11 files at r11.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 50 at r10 (raw file):

Previously, brosenberg42 wrote…

This should be HashMap

Done. Thanks for the sanity check. I ended up changing this to HashMap because of how we access by key.

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 99 at r10 (raw file):

Previously, brosenberg42 wrote…

This should be an ArrayList. It is accessed by index in TikaTextDetectionComponent.getDetections resulting in O(n^2) behavior. Also, in general, there are very few cases where LinkedList should be used. LinkedList is slower and takes up 3x the space (it's doubly linked) of an ArrayList.

Done.

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 106 at r10 (raw file):

Previously, brosenberg42 wrote…

Should _allText be cleared too?

Yes, thanks.

java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 111 at r10 (raw file):

Previously, brosenberg42 wrote…

In the case where this is true because the section is just white space, won't the white space be carried over in to the next section?

Yes, but it gets trimmed out later.

jrobble

Reviewed 6 of 6 files at r12.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)

jrobble

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)

a discussion (no related file):
@hhuangMITRE :

Please verify that the changes in Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274 have been applied here properly. We can then close that PR.
Please sanity-check the changes Brian and I made to determine if we changed / omitted anything important that should be corrected.
Please remove the code where we add leading "0"s to PAGE_NUM and SECTION_NUM. I noticed that was done in TikaTextDetection but not TikaImageDetection. In retrospect, that may mess up the job consumers. Sorry to flip-flop on this. A better solution to enforcing track order in the JSON is to allow the components to provide a track index, but that's out of scope for this task.
Please apply your local changes not reflected in this PR yet.

hhuangMITRE

Reviewable status: 32 of 34 files reviewed, 5 unresolved discussions (waiting on @brosenberg42 and @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE :

Please verify that the changes in Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274 have been applied here properly. We can then close that PR.

Please sanity-check the changes Brian and I made to determine if we changed / omitted anything important that should be corrected.

Please remove the code where we add leading "0"s to PAGE_NUM and SECTION_NUM. I noticed that was done in TikaTextDetection but not TikaImageDetection. In retrospect, that may mess up the job consumers. Sorry to flip-flop on this. A better solution to enforcing track order in the JSON is to allow the components to provide a track index, but that's out of scope for this task.

Please apply your local changes not reflected in this PR yet.

@jrobble
Hi Jeff,

I've reviewed the code and gone through the action items:

Verified that the changes in Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274 have been applied properly. Key change for processing .odp files was transferred. I've closed the PR now.
Did a sanity check and local test of the changes in Feature: Update Tika and text components to process derivative images extracted from PDFs. #266. Looks good.
Removed code for adding leading zeros to PAGE_NUM and SECTION_NUM. This was a light change so I've included it as a follow-up commit for this branch.
I've done a sweep through my old VM environment and looked for any other changes that were missed. I found the PR update for Tika 2.0.0 that was intended to land in Milestone Update AUTO_ROTATE description #2. I've reviewed and transferred those changes into a follow-up PR, now updated to Tika 2.4.1. The changes are stored in PR Updating Tika (and associated PDF parser code) to version 2.4.1. #298, addressing issue #1384 (update Tika to 2.4.1) and set to merge into the current branch (or right after it lands).

The Tika 2.4.1 branch updates the PDF parser code + the JPEG output file extension in test cases (mainly to swap .jpg ->.jpeg extensions).

Please note that Optimaize Language Detection is still at version 1.28.1 after the Tika 2.4.1 update. Notes + docs for Apache Tika v2 do not list the library anymore, but I will keep looking.

jrobble

Reviewed 2 of 2 files at r14.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)

Co-authored-by: jrobble <jrobble@mitre.org>

jrobble

Reviewed 23 of 23 files at r15.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)

hhuangMITRE added 3 commits June 28, 2021 04:24

Updating Tika and Tesseract components to process derivative media.

b766694

Update text extraction components to skip any generic feed-forward TE…

b8c5664

…XT tracks.

Merge branch 'develop' of https://github.com/openmpf/openmpf-components…

e83a992

… into feature/tika-pdf-tesseract-pipeline

hhuangMITRE requested a review from jrobble July 7, 2021 06:46

hhuangMITRE mentioned this pull request Jul 7, 2021

Feature: Update WFM to process derivative images. openmpf/openmpf#1365

Merged

jrobble added 10 commits July 29, 2021 17:57

Cleanup.

a6d5e20

WIP.

a275274

Now use FACECV.

a78f1ef

Merge branch 'develop' into jrobble/tika-pdf-tesseract-pipeline

9041874

Revert some changes.

8adcf55

Cleanup merge. Fix tests.

d5cf75f

Save in "derivatives" dir instead of "artifacts" dir.

7a10fe8

Add task merging.

a819c4e

Switch from DERIVATIVE_MEDIA_URI to DERIVATIVE_MEDIA_PATH.

66e0337

Move source-only and derivative-only actions, tasks, and pipelines in…

201fd48

…to the correct descriptor.json files.

jrobble requested changes Aug 24, 2021

View reviewed changes

jrobble self-assigned this Aug 25, 2021

jrobble added this to To do in OpenMPF: Development via automation Aug 25, 2021

jrobble requested a review from brosenberg42 August 25, 2021 22:26

jrobble mentioned this pull request Aug 26, 2021

Add Derivative Media Guide. openmpf/openmpf.github.io#136

Merged

Code cleanup, fix UUID issue, and remove unused job parameter.

25909ce

hhuangMITRE commented Aug 30, 2021

View reviewed changes

jrobble requested changes Sep 7, 2021

View reviewed changes

jrobble reviewed Sep 7, 2021

View reviewed changes

jrobble requested changes Sep 7, 2021

View reviewed changes

Documentation update.

37c5357

hhuangMITRE commented Sep 9, 2021

View reviewed changes

jrobble and others added 3 commits June 22, 2022 22:01

Use DERIVATIVE_MEDIA_TEMP_PATH.

0ba824f

Merge branch 'develop' into feature/tika-pdf-tesseract-pipeline

82e1597

Test tika changes (#297)

ef42fdf

Co-authored-by: Brian Rosenberg <brosenberg@mitre.org>

Improve extension detection.

2710dd0

jrobble reviewed Jun 30, 2022

View reviewed changes

brosenberg42 approved these changes Jul 1, 2022

View reviewed changes

brosenberg42 requested changes Jul 1, 2022

View reviewed changes

Update tests for new document formats.

d294aa2

jrobble reviewed Jul 6, 2022

View reviewed changes

Fix test.

de78315

brosenberg42 approved these changes Jul 6, 2022

View reviewed changes

jrobble added 3 commits July 7, 2022 10:53

Refactor.

11531eb

Refactor.

217a97d

Minor update.

ab7a10f

brosenberg42 requested changes Jul 7, 2022

View reviewed changes

Update text tests for new document formats.

873e1f0

jrobble reviewed Jul 8, 2022

View reviewed changes

jrobble added 2 commits July 8, 2022 16:00

Remove blanket catch clause.

9013988

Use PAGE_NUM = -1 where appropriate.

d893dc0

jrobble reviewed Jul 9, 2022

View reviewed changes

jrobble requested changes Jul 9, 2022

View reviewed changes

jrobble and others added 2 commits July 11, 2022 17:43

Fix ODP test.

829b9aa

Remove leading 0's to PAGE_NUM and SECTION_NUM.

7ae11fe

hhuangMITRE commented Jul 12, 2022

View reviewed changes

jrobble reviewed Jul 13, 2022

View reviewed changes

Updating Tika (and associated PDF parser code) to version 2.4.1. (#298)

de8edad

Co-authored-by: jrobble <jrobble@mitre.org>

jrobble approved these changes Jul 13, 2022

View reviewed changes

jrobble merged commit ce4d315 into develop Jul 14, 2022

OpenMPF: Development automation moved this from To do to Closed Jul 14, 2022

jrobble deleted the feature/tika-pdf-tesseract-pipeline branch July 14, 2022 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Update Tika and text components to process derivative images extracted from PDFs. #266

Feature: Update Tika and text components to process derivative images extracted from PDFs. #266

hhuangMITRE commented Jul 7, 2021 •

edited by jrobble

jrobble left a comment

hhuangMITRE left a comment

jrobble left a comment

jrobble left a comment

jrobble left a comment

hhuangMITRE left a comment

jrobble left a comment

brosenberg42 left a comment

brosenberg42 left a comment

jrobble commented Jul 5, 2022

jrobble left a comment

brosenberg42 left a comment

brosenberg42 left a comment

jrobble left a comment

jrobble left a comment

jrobble left a comment

hhuangMITRE left a comment

jrobble left a comment

jrobble left a comment

Feature: Update Tika and text components to process derivative images extracted from PDFs. #266

Feature: Update Tika and text components to process derivative images extracted from PDFs. #266

Conversation

hhuangMITRE commented Jul 7, 2021 • edited by jrobble

jrobble left a comment

Choose a reason for hiding this comment

hhuangMITRE left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

hhuangMITRE left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

brosenberg42 left a comment

Choose a reason for hiding this comment

brosenberg42 left a comment

Choose a reason for hiding this comment

jrobble commented Jul 5, 2022

jrobble left a comment

Choose a reason for hiding this comment

brosenberg42 left a comment

Choose a reason for hiding this comment

brosenberg42 left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

hhuangMITRE left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

jrobble left a comment

Choose a reason for hiding this comment

hhuangMITRE commented Jul 7, 2021 •

edited by jrobble