-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grobid parser for Scientific Articles from PDF #6729
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Very interesting. This is related to the functionality of MarkdownHeaderTextSplitter, but appears to work more generally on documents. Great addition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a notebook w/ example usage to this directory.
Also add unit tests here.
langchain/document_loaders/grobid.py
Outdated
@@ -0,0 +1,59 @@ | |||
import requests | |||
from bs4 import BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove. Make this a local import inside the class definition, which you already have. See here for reference.
langchain/document_loaders/grobid.py
Outdated
self.xml_data=xml_data | ||
self.segment_sentences=segment_sentences | ||
|
||
def load(self) -> List[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add lazy_load
. See example here.
Finally, run Looks very interesting overall. Looking forward to seeing an example notebook to test! |
I just took care of formatting and added notebook. Nice work! Everything works on my end (on a Mac). Added instructions for Mac setup. Will merge once tests pass. |
Created the grobid document loader. This is a long maintained python library that uses machine learning to process the pdf of a scientific article and generate an xml file which contains the paper title, authors, references, sections and section text and more. See https://grobid.readthedocs.io/en/latest/
Adds parsing of the XML file produced by Grobid into a Langchain Document
Add grobid loader
Fixed duplicate variable
Fix typo
One added point: currently any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments, this is a bit of a misuse of a "splitter"
this appears to be a preprocessing step fairly tightly coupled to the grobid loader. i would just leave in that page. thats the only place its used currently. could it be used other places... maybe? but then lets use there cause im struggling to see it
langchain/text_splitter.py
Outdated
@@ -1056,3 +1057,97 @@ def __init__(self, **kwargs: Any) -> None: | |||
"""Initialize a LatexTextSplitter.""" | |||
separators = self.get_separators_for_language(Language.LATEX) | |||
super().__init__(separators=separators, **kwargs) | |||
|
|||
|
|||
class GrobidSplitter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this follows a completely different API than the text splitter
its fine to not inherit from the base text splitter, but this SHOULD inherit from BaseDocumentTransformer and expose relevant methods there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. there is some confusion on where it should live.
the current flow is:
(1) use grobid loader to ingest doc and produce an XML file.
(2) use grobid parser to parse the XML file.
parsers currently operate on blob loaders (inherit from BaseBlobParser
).
a few options:
(1) use a existing blob loader to load PDFs and add the XML generation to the blob parser.
- pro: blob loader -> parser flow is a useful design pattern avoiding loader boilderplate duplication
- con: large code change, UX* less obvious, but does follow pattern from recent PRs:
loader = GenericLoader.from_filesystem(
"./example_data/source_code",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()
(2) use existing loader and move parsing logic to text splitter.
- pro: intuitive, historical precedent
- con: the splitter is specific to this loader (due to ingestion of XML), bloating text_splitter
(3) move "parser" as a function in the loader.
- pro: clear, intuitive UX*, encapsulation of all function in the loader,
- con: breaks from blob loader -> parser paradigm, which we may want to move towards
*UX would be same as we currently have: loader=GrobidLoader(file_path,segment_sentences=False)
langchain/text_splitter.py
Outdated
@@ -1056,3 +1057,97 @@ def __init__(self, **kwargs: Any) -> None: | |||
"""Initialize a LatexTextSplitter.""" | |||
separators = self.get_separators_for_language(Language.LATEX) | |||
super().__init__(separators=separators, **kwargs) | |||
|
|||
|
|||
class GrobidSplitter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesnt use grobid at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't use Grobid, but accepts the XML produced by Grobid. so there is indeed tight coupling.
I went with option (1) mentioned in code above. use a existing blob loader to load PDFs and add all the logic (XML generation) to the globid blob parser. pro:
UX:
will add better documentation to help contributors recognize this. |
309ebf6
to
d13b3a7
Compare
… grobid loader and remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, lgtm
### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
Scientific Article PDF Parsing via Grobid
Description:
This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path.
Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter.
Dependencies:
For use, the grobid repo must be cloned and Java must be installed, for colab this is:
Once installed the server is ran on localhost:8070 via
@rlancemartin, @eyurtsev
Twitter Handle: @corranmac
Grobid Demo Notebook is here.