Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Update Tika and text components to process derivative images extracted from PDFs. #266

Merged
merged 30 commits into from Jul 14, 2022

Conversation

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 13 files at r1, 17 of 17 files at r2.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @hhuangMITRE)


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 57 at r1 (raw file):

    private int id, pagenum;
    private ArrayList<ArrayList<String>> imageMap;
    private LinkedHashMap<String, TreeSet<String>> imagePageMap;

@hhuangMITRE : Please leave a comment stating what imagePageMap contains. What is the key? What is the value a set of?


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 119 at r2 (raw file):

            imageMap.add(current);
            if (separatePages) {
                outputDir = Paths.get(path + "/tika-extracted/page-" + (pagenum + 1)); // TODO: Shouldn't this use the UUID?

@hhuangMITRE : Please address this.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/ImageExtractionContentHandler.java, line 49 at r2 (raw file):

    public void startElement (String uri, String localName, String qName, Attributes atts) {
        if (pageTag.equals(qName) && (atts.getValue("class").equals("page"))) { // TODO: NPE on jrobble-test.odp

@hhuangMITRE : Please look into this. I will send you the file separately.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 166 at r1 (raw file):

        List<MPFGenericTrack> tracks = new LinkedList<MPFGenericTrack>();
        int page = 0;

page is not used anywhere, so I removed it.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 131 at r2 (raw file):

        Map<String,String> properties = mpfGenericJob.getJobProperties();
        boolean separatePages = false;
        boolean emptyPages = false; // TODO: This is not read anywhere

@hhuangMITRE : Please address this. Did you intend to use emptyPages for something?


java/TikaTextDetection/Dockerfile, line 39 at r1 (raw file):

COPY . .

ARG RUN_TESTS=true

I changed this so that it's false by default, like in other components. The Jenkinsfile will set it to true here:

            sh 'docker build -f openmpf_build/Dockerfile .. --build-arg RUN_TESTS=true ' +
                    "$commonBuildArgs $labelArgs -t openmpf_build:$inProgressTag"

It's false by default so we don't run the tests when doing local builds.

@jrobble jrobble self-assigned this Aug 25, 2021
@jrobble jrobble added this to To do in OpenMPF: Development via automation Aug 25, 2021
Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we land the derivative branch changes first, I've fixed the UUID issues and cleaned up the unused job parameter. Let me know if any further updates are requested. Thanks!

Reviewable status: 14 of 17 files reviewed, 4 unresolved discussions (waiting on @brosenberg42 and @jrobble)


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 57 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please leave a comment stating what imagePageMap contains. What is the key? What is the value a set of?

Done.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 119 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please address this.

Done. This change is also included in the PDF image extraction PR, but I'm also placing the fix here if we land this first.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/ImageExtractionContentHandler.java, line 49 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please look into this. I will send you the file separately.

Done. This issue was resolved in the image extraction PR. Let me know if you want me to move that change over here, thanks!


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 131 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE : Please address this. Did you intend to use emptyPages for something?

I believe this originally was meant for the page-per-track version of the component. During development, I wasn't fully sure if we needed this property anymore as we are reporting by image-per-track with these changes (and reporting empty pages would create confusion). I've decided to remove this job property as a result.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r3.
Reviewable status: 16 of 17 files reviewed, 1 unresolved discussion (waiting on @brosenberg42, @hhuangMITRE, and @jrobble)


java/TikaImageDetection/README.md, line 3 at r3 (raw file):

# Overview

This directory contains source code for the OpenMPF Tika image detection component.

Please update this:

  • Still mentions ALLOW_EMPTY_PAGES .

  • Still mentions creating a track for each occurrence of the image. Instead, we're now listing all occurrences in the same track under PAGE_NUM, which can be a list.

  • Use tick marks around ORGANIZE_BY_PAGE.

  • IMAGES_FILES is now DERIVATIVE_MEDIA_PATH.

  • PAGE is nowPAGE_NUM.

  • In your other PR, mention that .odp files are supported, and list any other files types that are supported.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/EmbeddedContentExtractor.java, line 119 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Done. This change is also included in the PDF image extraction PR, but I'm also placing the fix here if we land this first.

Ok. The plan is to land this first.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/ImageExtractionContentHandler.java, line 49 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Done. This issue was resolved in the image extraction PR. Let me know if you want me to move that change over here, thanks!

I think it's fine to leave in the PR for now. I left a note to myself there.


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/TikaImageDetectionComponent.java, line 131 at r2 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

I believe this originally was meant for the page-per-track version of the component. During development, I wasn't fully sure if we needed this property anymore as we are reporting by image-per-track with these changes (and reporting empty pages would create confusion). I've decided to remove this job property as a result.

Ok. Yeah, I think it's best to remove it.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r3.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @hhuangMITRE)

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @hhuangMITRE)


java/TikaImageDetection/README.md, line 3 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please update this:

  • Still mentions ALLOW_EMPTY_PAGES .

  • Still mentions creating a track for each occurrence of the image. Instead, we're now listing all occurrences in the same track under PAGE_NUM, which can be a list.

  • Use tick marks around ORGANIZE_BY_PAGE.

  • IMAGES_FILES is now DERIVATIVE_MEDIA_PATH.

  • PAGE is nowPAGE_NUM.

  • In your other PR, mention that .odp files are supported, and list any other files types that are supported.

@hhuangMITRE , forgot to tag you.

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 17 files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @jrobble)


java/TikaImageDetection/README.md, line 3 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE , forgot to tag you.

Thanks Jeff! I've updated the README for both branches. I will merge this content into the image extraction PR after the derivative media update lands.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 8 files at r5, 7 of 7 files at r6, 1 of 1 files at r7.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @jrobble)

Copy link
Member

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 8 files at r5.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jrobble)

Copy link
Member

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @hhuangMITRE and @jrobble)


java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/PageNumberExtractionContentHandler.java line 50 at r7 (raw file):

        var classAttr = atts.getValue("class");
        if (classAttr.equals("page")) { // TODO: NPE on jrobble-test.odp

Does this need to be addressed?

@jrobble
Copy link
Member

jrobble commented Jul 5, 2022

java/TikaImageDetection/src/main/java/org/mitre/mpf/detection/tika/PageNumberExtractionContentHandler.java line 50 at r7 (raw file):

Previously, brosenberg42 wrote…

Does this need to be addressed?

I test this and it's fixed in my recent updates.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r8.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)

Copy link
Member

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 7 files at r6, 10 of 11 files at r8, 1 of 1 files at r9.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @hhuangMITRE)

Copy link
Member

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r10.
Reviewable status: 25 of 26 files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE and @jrobble)


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 50 at r10 (raw file):

    private StringBuilder _sectionText;

    private final Map<Integer, List<StringBuilder>> _pageToSections = new LinkedHashMap<>();

This should be HashMap


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 99 at r10 (raw file):

    private void createPage() {
        _sectionText = new StringBuilder();
        _pageToSections.put(_pageNumber, new LinkedList<>());

This should be an ArrayList. It is accessed by index in TikaTextDetectionComponent.getDetections resulting in O(n^2) behavior. Also, in general, there are very few cases where LinkedList should be used. LinkedList is slower and takes up 3x the space (it's doubly linked) of an ArrayList.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 106 at r10 (raw file):

        _pageNumber = 0;
        _pageToSections.clear();
        createPage();

Should _allText be cleared too?


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 111 at r10 (raw file):

    private void newSection() {
        // Skip blank sections.
        if (_sectionText.toString().isBlank()) {

In the case where this is true because the section is just white space, won't the white space be carried over in to the next section?

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r9, 2 of 3 files at r10, 11 of 11 files at r11.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 50 at r10 (raw file):

Previously, brosenberg42 wrote…

This should be HashMap

Done. Thanks for the sanity check. I ended up changing this to HashMap because of how we access by key.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 99 at r10 (raw file):

Previously, brosenberg42 wrote…

This should be an ArrayList. It is accessed by index in TikaTextDetectionComponent.getDetections resulting in O(n^2) behavior. Also, in general, there are very few cases where LinkedList should be used. LinkedList is slower and takes up 3x the space (it's doubly linked) of an ArrayList.

Done.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 106 at r10 (raw file):

Previously, brosenberg42 wrote…

Should _allText be cleared too?

Yes, thanks.


java/TikaTextDetection/src/main/java/org/mitre/mpf/detection/tika/TextExtractionContentHandler.java line 111 at r10 (raw file):

Previously, brosenberg42 wrote…

In the case where this is true because the section is just white space, won't the white space be carried over in to the next section?

Yes, but it gets trimmed out later.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r12.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)

a discussion (no related file):
@hhuangMITRE :

  1. Please verify that the changes in Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274 have been applied here properly. We can then close that PR.

  2. Please sanity-check the changes Brian and I made to determine if we changed / omitted anything important that should be corrected.

  3. Please remove the code where we add leading "0"s to PAGE_NUM and SECTION_NUM. I noticed that was done in TikaTextDetection but not TikaImageDetection. In retrospect, that may mess up the job consumers. Sorry to flip-flop on this. A better solution to enforcing track order in the JSON is to allow the components to provide a track index, but that's out of scope for this task.

  4. Please apply your local changes not reflected in this PR yet.


Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 32 of 34 files reviewed, 5 unresolved discussions (waiting on @brosenberg42 and @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

@hhuangMITRE :

  1. Please verify that the changes in Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274 have been applied here properly. We can then close that PR.

  2. Please sanity-check the changes Brian and I made to determine if we changed / omitted anything important that should be corrected.

  3. Please remove the code where we add leading "0"s to PAGE_NUM and SECTION_NUM. I noticed that was done in TikaTextDetection but not TikaImageDetection. In retrospect, that may mess up the job consumers. Sorry to flip-flop on this. A better solution to enforcing track order in the JSON is to allow the components to provide a track index, but that's out of scope for this task.

  4. Please apply your local changes not reflected in this PR yet.

@jrobble
Hi Jeff,

I've reviewed the code and gone through the action items:

  1. Verified that the changes in Update TikaImageDetection component to extract images from document (*.doc, *.docx) and PowerPoint (*.ppt, *.pptx) files. #274 have been applied properly. Key change for processing .odp files was transferred. I've closed the PR now.
  2. Did a sanity check and local test of the changes in Feature: Update Tika and text components to process derivative images extracted from PDFs. #266. Looks good.
  3. Removed code for adding leading zeros to PAGE_NUM and SECTION_NUM. This was a light change so I've included it as a follow-up commit for this branch.
  4. I've done a sweep through my old VM environment and looked for any other changes that were missed. I found the PR update for Tika 2.0.0 that was intended to land in Milestone Update AUTO_ROTATE description #2. I've reviewed and transferred those changes into a follow-up PR, now updated to Tika 2.4.1. The changes are stored in PR Updating Tika (and associated PDF parser code) to version 2.4.1. #298, addressing issue #1384 (update Tika to 2.4.1) and set to merge into the current branch (or right after it lands).

The Tika 2.4.1 branch updates the PDF parser code + the JPEG output file extension in test cases (mainly to swap .jpg ->.jpeg extensions).

Please note that Optimaize Language Detection is still at version 1.28.1 after the Tika 2.4.1 update. Notes + docs for Apache Tika v2 do not list the library anymore, but I will keep looking.


Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r14.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 23 of 23 files at r15.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42)

@jrobble jrobble merged commit ce4d315 into develop Jul 14, 2022
OpenMPF: Development automation moved this from To do to Closed Jul 14, 2022
@jrobble jrobble deleted the feature/tika-pdf-tesseract-pipeline branch July 14, 2022 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

None yet

3 participants