Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grobid parser for Scientific Articles from PDF #6729

Merged
merged 18 commits into from
Jun 29, 2023

Conversation

corranmac
Copy link
Contributor

@corranmac corranmac commented Jun 25, 2023

Scientific Article PDF Parsing via Grobid

Description:
This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path.

Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter.

Dependencies:
For use, the grobid repo must be cloned and Java must be installed, for colab this is:

!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install

Once installed the server is ran on localhost:8070 via

get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')

@rlancemartin, @eyurtsev

Twitter Handle: @corranmac

Grobid Demo Notebook is here.

@vercel
Copy link

vercel bot commented Jun 25, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 29, 2023 9:29pm

@vercel vercel bot temporarily deployed to Preview June 25, 2023 23:46 Inactive
@rlancemartin
Copy link
Collaborator

While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter.

Very interesting. This is related to the functionality of MarkdownHeaderTextSplitter, but appears to work more generally on documents. Great addition.

Copy link
Collaborator

@rlancemartin rlancemartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a notebook w/ example usage to this directory.

Also add unit tests here.

@@ -0,0 +1,59 @@
import requests
from bs4 import BeautifulSoup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove. Make this a local import inside the class definition, which you already have. See here for reference.

self.xml_data=xml_data
self.segment_sentences=segment_sentences

def load(self) -> List[Document]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add lazy_load. See example here.

@rlancemartin
Copy link
Collaborator

Finally, run make format to resolve lint errors raised in tests.

Looks very interesting overall. Looking forward to seeing an example notebook to test!

@rlancemartin rlancemartin self-assigned this Jun 26, 2023
@dev2049 dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels Jun 26, 2023
@vercel vercel bot temporarily deployed to Preview June 26, 2023 11:04 Inactive
@rlancemartin
Copy link
Collaborator

I just took care of formatting and added notebook.

Nice work! Everything works on my end (on a Mac). Added instructions for Mac setup.

Will merge once tests pass.

@vercel vercel bot temporarily deployed to Preview June 28, 2023 21:43 Inactive
@vercel vercel bot temporarily deployed to Preview June 28, 2023 22:02 Inactive
@vercel vercel bot temporarily deployed to Preview June 28, 2023 22:18 Inactive
@vercel vercel bot temporarily deployed to Preview June 28, 2023 22:35 Inactive
Created the grobid document loader. 

This is a long maintained python library that uses machine learning to process the pdf of a scientific article and generate an xml file which contains the paper title, authors, references, sections and section text and more.

See https://grobid.readthedocs.io/en/latest/
Adds parsing of the XML file produced by Grobid into a Langchain Document
Add grobid loader
Fixed duplicate variable
@rlancemartin
Copy link
Collaborator

One added point: currently any parser accept blobs and inherits from BaseBlobParser. In this case, it seemed easier to move your parser code into a new text splitter rather than modify your loader to yield blobs (since you do some processing and produce XML data). Open to feedback from others and will discuss w/ @eyurtsev when he is back from travels. IMO this is something we need to clarify in the documentation (specifically, differences between loaders -> splitters vs blob loaders -> blob parsers).

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, this is a bit of a misuse of a "splitter"

this appears to be a preprocessing step fairly tightly coupled to the grobid loader. i would just leave in that page. thats the only place its used currently. could it be used other places... maybe? but then lets use there cause im struggling to see it

@@ -1056,3 +1057,97 @@ def __init__(self, **kwargs: Any) -> None:
"""Initialize a LatexTextSplitter."""
separators = self.get_separators_for_language(Language.LATEX)
super().__init__(separators=separators, **kwargs)


class GrobidSplitter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this follows a completely different API than the text splitter

its fine to not inherit from the base text splitter, but this SHOULD inherit from BaseDocumentTransformer and expose relevant methods there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. there is some confusion on where it should live.

the current flow is:
(1) use grobid loader to ingest doc and produce an XML file.
(2) use grobid parser to parse the XML file.

parsers currently operate on blob loaders (inherit from BaseBlobParser).


a few options:

(1) use a existing blob loader to load PDFs and add the XML generation to the blob parser.

  • pro: blob loader -> parser flow is a useful design pattern avoiding loader boilderplate duplication
  • con: large code change, UX* less obvious, but does follow pattern from recent PRs:
loader = GenericLoader.from_filesystem(
    "./example_data/source_code",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

(2) use existing loader and move parsing logic to text splitter.

  • pro: intuitive, historical precedent
  • con: the splitter is specific to this loader (due to ingestion of XML), bloating text_splitter

(3) move "parser" as a function in the loader.

  • pro: clear, intuitive UX*, encapsulation of all function in the loader,
  • con: breaks from blob loader -> parser paradigm, which we may want to move towards

*UX would be same as we currently have: loader=GrobidLoader(file_path,segment_sentences=False)

@@ -1056,3 +1057,97 @@ def __init__(self, **kwargs: Any) -> None:
"""Initialize a LatexTextSplitter."""
separators = self.get_separators_for_language(Language.LATEX)
super().__init__(separators=separators, **kwargs)


class GrobidSplitter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesnt use grobid at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't use Grobid, but accepts the XML produced by Grobid. so there is indeed tight coupling.

@rlancemartin
Copy link
Collaborator

rlancemartin commented Jun 29, 2023

I went with option (1) mentioned in code above.

use a existing blob loader to load PDFs and add all the logic (XML generation) to the globid blob parser.

pro:

  • simplifies the code by consolidating all logic into the parser (rather than separated btwn loader+parser)
  • re-uses existing blog loader for flexibility
  • avoids adding bloat in text_splitter
  • UX is consistent w/ recent context aware splitters

UX:

loader = GenericLoader.from_filesystem(
    "./example_data/source_code",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

will add better documentation to help contributors recognize this.

@vercel vercel bot temporarily deployed to Preview June 29, 2023 19:31 Inactive
@rlancemartin rlancemartin changed the title Integrate Grobid: Scientific Article Parsing from PDF via DocumentLoader & Parser Grobid parser for Scientific Articles from PDF Jun 29, 2023
@rlancemartin rlancemartin force-pushed the master branch 2 times, most recently from 309ebf6 to d13b3a7 Compare June 29, 2023 20:13
@vercel vercel bot temporarily deployed to Preview June 29, 2023 20:21 Inactive
@vercel vercel bot temporarily deployed to Preview June 29, 2023 20:44 Inactive
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, lgtm

@vercel vercel bot temporarily deployed to Preview June 29, 2023 21:29 Inactive
@rlancemartin rlancemartin merged commit 20c6ade into langchain-ai:master Jun 29, 2023
13 checks passed
vowelparrot pushed a commit that referenced this pull request Jul 4, 2023
### Scientific Article PDF Parsing via Grobid

`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
      
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.

`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:

```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```

Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```

@rlancemartin, @eyurtsev

Twitter Handle: @corranmac

Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
aerrober pushed a commit to aerrober/langchain-fork that referenced this pull request Jul 24, 2023
### Scientific Article PDF Parsing via Grobid

`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
      
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.

`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:

```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```

Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```

@rlancemartin, @eyurtsev

Twitter Handle: @corranmac

Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants