Grobid parser for Scientific Articles from PDF #6729

corranmac · 2023-06-25T23:38:51Z

Scientific Article PDF Parsing via Grobid

Description:
This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path.

Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter.

Dependencies:
For use, the grobid repo must be cloned and Java must be installed, for colab this is:

!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install

Once installed the server is ran on localhost:8070 via

get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')

@rlancemartin, @eyurtsev

Twitter Handle: @corranmac

Grobid Demo Notebook is here.

vercel · 2023-06-25T23:38:54Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 29, 2023 9:29pm

rlancemartin · 2023-06-26T03:06:55Z

While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter.

Very interesting. This is related to the functionality of MarkdownHeaderTextSplitter, but appears to work more generally on documents. Great addition.

rlancemartin

Please add a notebook w/ example usage to this directory.

Also add unit tests here.

rlancemartin · 2023-06-26T03:09:27Z

langchain/document_loaders/grobid.py

@@ -0,0 +1,59 @@
+import requests
+from bs4 import BeautifulSoup


Remove. Make this a local import inside the class definition, which you already have. See here for reference.

rlancemartin · 2023-06-26T03:12:00Z

langchain/document_loaders/grobid.py

+        self.xml_data=xml_data
+        self.segment_sentences=segment_sentences
+
+    def load(self) -> List[Document]:


Add lazy_load. See example here.

rlancemartin · 2023-06-26T03:15:57Z

Finally, run make format to resolve lint errors raised in tests.

Looks very interesting overall. Looking forward to seeing an example notebook to test!

rlancemartin · 2023-06-28T21:38:40Z

I just took care of formatting and added notebook.

Nice work! Everything works on my end (on a Mac). Added instructions for Mac setup.

Will merge once tests pass.

Created the grobid document loader. This is a long maintained python library that uses machine learning to process the pdf of a scientific article and generate an xml file which contains the paper title, authors, references, sections and section text and more. See https://grobid.readthedocs.io/en/latest/

Adds parsing of the XML file produced by Grobid into a Langchain Document

Add grobid loader

Fixed duplicate variable

Fix typo

…t yield blobs

rlancemartin · 2023-06-29T00:34:52Z

One added point: currently any parser accept blobs and inherits from BaseBlobParser. In this case, it seemed easier to move your parser code into a new text splitter rather than modify your loader to yield blobs (since you do some processing and produce XML data). Open to feedback from others and will discuss w/ @eyurtsev when he is back from travels. IMO this is something we need to clarify in the documentation (specifically, differences between loaders -> splitters vs blob loaders -> blob parsers).

hwchase17

left some comments, this is a bit of a misuse of a "splitter"

this appears to be a preprocessing step fairly tightly coupled to the grobid loader. i would just leave in that page. thats the only place its used currently. could it be used other places... maybe? but then lets use there cause im struggling to see it

hwchase17 · 2023-06-29T05:21:36Z

langchain/text_splitter.py

@@ -1056,3 +1057,97 @@ def __init__(self, **kwargs: Any) -> None:
        """Initialize a LatexTextSplitter."""
        separators = self.get_separators_for_language(Language.LATEX)
        super().__init__(separators=separators, **kwargs)
+
+
+class GrobidSplitter:


this follows a completely different API than the text splitter

its fine to not inherit from the base text splitter, but this SHOULD inherit from BaseDocumentTransformer and expose relevant methods there

right. there is some confusion on where it should live.

the current flow is:
(1) use grobid loader to ingest doc and produce an XML file.
(2) use grobid parser to parse the XML file.

parsers currently operate on blob loaders (inherit from BaseBlobParser).

a few options:

(1) use a existing blob loader to load PDFs and add the XML generation to the blob parser.

pro: blob loader -> parser flow is a useful design pattern avoiding loader boilderplate duplication

con: large code change, UX* less obvious, but does follow pattern from recent PRs:

loader = GenericLoader.from_filesystem( "./example_data/source_code", glob="*", suffixes=[".pdf"], parser= GrobidParser(segment_sentences=True) ) docs = loader.load()

(2) use existing loader and move parsing logic to text splitter.

pro: intuitive, historical precedent

con: the splitter is specific to this loader (due to ingestion of XML), bloating text_splitter

(3) move "parser" as a function in the loader.

pro: clear, intuitive UX*, encapsulation of all function in the loader,

con: breaks from blob loader -> parser paradigm, which we may want to move towards

*UX would be same as we currently have: loader=GrobidLoader(file_path,segment_sentences=False)

hwchase17 · 2023-06-29T05:22:16Z

langchain/text_splitter.py

@@ -1056,3 +1057,97 @@ def __init__(self, **kwargs: Any) -> None:
        """Initialize a LatexTextSplitter."""
        separators = self.get_separators_for_language(Language.LATEX)
        super().__init__(separators=separators, **kwargs)
+
+
+class GrobidSplitter:


this doesnt use grobid at all?

it doesn't use Grobid, but accepts the XML produced by Grobid. so there is indeed tight coupling.

rlancemartin · 2023-06-29T19:12:39Z

I went with option (1) mentioned in code above.

use a existing blob loader to load PDFs and add all the logic (XML generation) to the globid blob parser.

pro:

simplifies the code by consolidating all logic into the parser (rather than separated btwn loader+parser)
re-uses existing blog loader for flexibility
avoids adding bloat in text_splitter
UX is consistent w/ recent context aware splitters

UX:

loader = GenericLoader.from_filesystem(
    "./example_data/source_code",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

will add better documentation to help contributors recognize this.

… grobid loader and remove

hwchase17

nice, lgtm

@rlancemartin

### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>

@rlancemartin

### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>

vercel bot temporarily deployed to Preview June 25, 2023 23:46 Inactive

dev2049 requested a review from rlancemartin June 26, 2023 02:56

rlancemartin reviewed Jun 26, 2023

View reviewed changes

rlancemartin self-assigned this Jun 26, 2023

dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels Jun 26, 2023

vercel bot temporarily deployed to Preview June 26, 2023 11:04 Inactive

vercel bot temporarily deployed to Preview June 28, 2023 21:43 Inactive

rlancemartin force-pushed the master branch from 36fea51 to ea41960 Compare June 28, 2023 21:50

vercel bot temporarily deployed to Preview June 28, 2023 22:02 Inactive

rlancemartin force-pushed the master branch from ea41960 to a6c4329 Compare June 28, 2023 22:12

vercel bot temporarily deployed to Preview June 28, 2023 22:18 Inactive

rlancemartin force-pushed the master branch from a6c4329 to ca9b3f3 Compare June 28, 2023 22:25

vercel bot temporarily deployed to Preview June 28, 2023 22:35 Inactive

corranmac added 12 commits June 28, 2023 15:43

Create grobid.py

c43aeef

Adds parsing of the XML file produced by Grobid into a Langchain Document

Update __init__.py

c600895

Add grobid loader

Update grobid.py

566584a

Fixed duplicate variable

Create grobid.mdx

6738c68

Update grobid.mdx

60a3094

Update __init__.py

d0fa005

Update __init__.py

7ee7b95

Fix typo

Update __init__.py

01ba536

Update grobid.py

eeeec9c

Update grobid.py

efe202b

Bugfixes

15b871e

More formatting

5a2c85c

rlancemartin force-pushed the master branch from a458b1e to 5a2c85c Compare June 28, 2023 22:44

vercel bot deployed to Preview June 28, 2023 22:52 View deployment

vercel bot deployed to Preview June 29, 2023 00:21 View deployment

Move blob parser logic into text_splitter.py since the loader does no…

d63ea62

…t yield blobs

rlancemartin force-pushed the master branch from 9681515 to d63ea62 Compare June 29, 2023 00:24

vercel bot deployed to Preview June 29, 2023 00:29 View deployment

hwchase17 requested changes Jun 29, 2023

View reviewed changes

rlancemartin force-pushed the master branch from f40b9c1 to 1267113 Compare June 29, 2023 19:09

rlancemartin force-pushed the master branch from 1267113 to be14a38 Compare June 29, 2023 19:21

vercel bot temporarily deployed to Preview June 29, 2023 19:31 Inactive

rlancemartin force-pushed the master branch from be14a38 to afa2445 Compare June 29, 2023 19:53

rlancemartin changed the title ~~Integrate Grobid: Scientific Article Parsing from PDF via DocumentLoader & Parser~~ Grobid parser for Scientific Articles from PDF Jun 29, 2023

rlancemartin force-pushed the master branch from afa2445 to a3163bc Compare June 29, 2023 19:57

vercel bot temporarily deployed to Preview June 29, 2023 20:07 Inactive

rlancemartin force-pushed the master branch 2 times, most recently from 309ebf6 to d13b3a7 Compare June 29, 2023 20:13

vercel bot temporarily deployed to Preview June 29, 2023 20:21 Inactive

rlancemartin force-pushed the master branch from d13b3a7 to b2c36bb Compare June 29, 2023 20:38

vercel bot temporarily deployed to Preview June 29, 2023 20:44 Inactive

rlancemartin force-pushed the master branch from b2c36bb to d596b44 Compare June 29, 2023 21:09

Update the Grobid parser to operate as a blob parser, move logic from…

ecb3858

… grobid loader and remove

rlancemartin force-pushed the master branch from d596b44 to ecb3858 Compare June 29, 2023 21:17

hwchase17 approved these changes Jun 29, 2023

View reviewed changes

vercel bot temporarily deployed to Preview June 29, 2023 21:29 Inactive

rlancemartin merged commit 20c6ade into langchain-ai:master Jun 29, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grobid parser for Scientific Articles from PDF #6729

Grobid parser for Scientific Articles from PDF #6729

corranmac commented Jun 25, 2023 •

edited by rlancemartin

vercel bot commented Jun 25, 2023 •

edited

rlancemartin commented Jun 26, 2023

rlancemartin left a comment

rlancemartin Jun 26, 2023

rlancemartin Jun 26, 2023

rlancemartin commented Jun 26, 2023

rlancemartin commented Jun 28, 2023

rlancemartin commented Jun 29, 2023

hwchase17 left a comment

hwchase17 Jun 29, 2023

rlancemartin Jun 29, 2023 •

edited

hwchase17 Jun 29, 2023

rlancemartin Jun 29, 2023 •

edited

rlancemartin commented Jun 29, 2023 •

edited

hwchase17 left a comment

		@@ -0,0 +1,59 @@
		import requests
		from bs4 import BeautifulSoup

Grobid parser for Scientific Articles from PDF #6729

Grobid parser for Scientific Articles from PDF #6729

Conversation

corranmac commented Jun 25, 2023 • edited by rlancemartin

Scientific Article PDF Parsing via Grobid

vercel bot commented Jun 25, 2023 • edited

rlancemartin commented Jun 26, 2023

rlancemartin left a comment

Choose a reason for hiding this comment

rlancemartin Jun 26, 2023

Choose a reason for hiding this comment

rlancemartin Jun 26, 2023

Choose a reason for hiding this comment

rlancemartin commented Jun 26, 2023

rlancemartin commented Jun 28, 2023

rlancemartin commented Jun 29, 2023

hwchase17 left a comment

Choose a reason for hiding this comment

hwchase17 Jun 29, 2023

Choose a reason for hiding this comment

rlancemartin Jun 29, 2023 • edited

Choose a reason for hiding this comment

hwchase17 Jun 29, 2023

Choose a reason for hiding this comment

rlancemartin Jun 29, 2023 • edited

Choose a reason for hiding this comment

rlancemartin commented Jun 29, 2023 • edited

hwchase17 left a comment

Choose a reason for hiding this comment

corranmac commented Jun 25, 2023 •

edited by rlancemartin

vercel bot commented Jun 25, 2023 •

edited

rlancemartin Jun 29, 2023 •

edited

rlancemartin Jun 29, 2023 •

edited

rlancemartin commented Jun 29, 2023 •

edited