Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docugami DataLoader #4727

Merged
merged 19 commits into from May 15, 2023
Merged

Docugami DataLoader #4727

merged 19 commits into from May 15, 2023

Conversation

eyurtsev
Copy link
Collaborator

Adds a document loader for Docugami

Specifically:

  1. Adds a data loader that talks to the Docugami API to download processed documents as semantic XML
  2. Parses the semantic XML into chunks, with additional metadata capturing chunk semantics
  3. Adds a detailed notebook showing how you can use additional metadata returned by Docugami for techniques like the self-querying retriever
  4. Adds an integration test, and related documentation

Here is an example of a result that is not possible without the capabilities added by Docugami (from the notebook):

image


## What is Docugami?

Docugami converts business documents into a Document XML Knowledge Graph, generating forests of XML semantic trees representing entire documents. This is a rich representation that includes the semantic and structural characteristics of various chunks in the document as an XML tree.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "generating forests of XML semantic trees" -- I don't know what this phrase means, I suspect other readers will be confused as well. I'm going to land as is but feel free to send a new PR to update language or to address any nits below

access_token: Optional[str] = os.environ.get("DOCUGAMI_API_KEY")
docset_id: Optional[str]
document_ids: Optional[Sequence[str]]
file_paths: Optional[Sequence[Path]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjaffri I updated Lists to Sequences since we want things to be immutable. Consider adding support for str since users are likely to use either Path or str to work with files

Copy link
Contributor

@tjaffri tjaffri May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Good idea for str input I can send a follow up change for your review at your leisure

@@ -0,0 +1,28 @@
"""Test DocugamiLoader."""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjaffri this was moved to unit tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you

@eyurtsev eyurtsev merged commit 3c490b5 into master May 15, 2023
13 checks passed
@eyurtsev eyurtsev deleted the eugene/docugami branch May 15, 2023 14:53
@eyurtsev
Copy link
Collaborator Author

cc @tjaffri PR has been merged -- made minor changes to resolve merge conflicts, some changes in type annotations and moved to unit tests folder

@tjaffri
Copy link
Contributor

tjaffri commented May 15, 2023

@eyurtsev thank you!

dev2049 pushed a commit that referenced this pull request May 17, 2023
# Docs and code review fixes for Docugami DataLoader

1. I noticed a couple of hyperlinks that are not loading in the
langchain docs (I guess need explicit anchor tags). Added those.
2. In code review @eyurtsev had a
[suggestion](#4727 (comment))
to allow string paths. Turns out just updating the type works (I tested
locally with string paths).

# Pre-submission checks
I ran `make lint` and `make tests` successfully.

---------

Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants