Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Tika (and associated PDF parser code) to version 2.4.1. #298

Conversation

hhuangMITRE
Copy link
Contributor

@hhuangMITRE hhuangMITRE commented Jul 12, 2022

@jrobble jrobble added this to To do in OpenMPF: Development via automation Jul 13, 2022
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 22 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)

a discussion (no related file):
We should also update the version of Tika used by the WFM. I created a separate task. I don't plan on landing that in 7.0.



java/TikaImageDetection/src/main/java/org/apache/tika/parser/pdf/image/ImageGraphicsEngine.java line 1 at r1 (raw file):

/******************************************************************************

Please add a comment in the code near your modifications that explains them. For example:

// OpenMPF modification: blah blah

java/TikaTextDetection/pom.xml line 74 at r1 (raw file):

        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>

Add a comment here explaining why this is still 1.28.1.

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 20 of 23 files reviewed, 3 unresolved discussions (waiting on @brosenberg42, @hhuangMITRE, and @jrobble)


java/TikaImageDetection/src/main/java/org/apache/tika/parser/pdf/image/ImageGraphicsEngine.java line 1 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please add a comment in the code near your modifications that explains them. For example:

// OpenMPF modification: blah blah

Done! I've added more details to the OpenMPF edit lines below. Let me know if any other edits are needed, thanks!


java/TikaTextDetection/pom.xml line 74 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Add a comment here explaining why this is still 1.28.1.

Done. I did another sweep through the documents again, and then dug into the source code for the optimaize package. I found it next to four other language detection modules. I've updated the POM file and the library import for the Optimaize library, then confirmed the changes worked in mvn test. Creating a new task to investigate the other modules next.

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r2, all commit messages.
Reviewable status: 22 of 23 files reviewed, 1 unresolved discussion (waiting on @brosenberg42, @hhuangMITRE, and @jrobble)


java/TikaImageDetection/src/main/java/org/apache/tika/parser/pdf/image/ImageGraphicsEngine.java line 1 at r1 (raw file):

Previously, hhuangMITRE (Howard W Huang) wrote…

Done! I've added more details to the OpenMPF edit lines below. Let me know if any other edits are needed, thanks!

Thanks. I'm going to update the lines to say "OpenMPF" just because that's the search term I use and why I didn't see the lines the first time. Not that I would expect you to know that :)

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42 and @hhuangMITRE)

Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

@jrobble jrobble merged commit de8edad into feature/tika-pdf-tesseract-pipeline Jul 13, 2022
@jrobble jrobble deleted the feature/hhuang-collect-updates-tika-pdf-tesseract-pipeline branch July 13, 2022 20:05
OpenMPF: Development automation moved this from To do to Closed Jul 13, 2022
jrobble added a commit that referenced this pull request Jul 14, 2022
* Support new document formats.

* Use PAGE_NUM = -1 where appropriate.

* Remove leading 0's to PAGE_NUM and SECTION_NUM.

* Updating Tika (and associated PDF parser code) to version 2.4.1. (#298)

Co-authored-by: Jeff Robble <jrobble@mitre.org>
Co-authored-by: Brian Rosenberg <brosenberg@mitre.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants