In [1]:
%load_ext autoreload
%autoreload 2

# Docugami
This notebook covers how to load documents from `Docugami`. See [here](../../../../ecosystem/docugami.md) for more details, and the advantages of using this system over alternative data loaders.

## Prerequisites
1. Follow the Quick Start section in [this document](../../../../ecosystem/docugami.md)
2. Grab an access token for your workspace, and make sure it is set as the DOCUGAMI_API_KEY environment variable
3. Grab some docset and document IDs for your processed documents, as described here: https://help.docugami.com/home/docugami-api

In [None]:
# You need the lxml package to use the DocugamiLoader
!poetry run pip -q install lxml

In [2]:
import os
from langchain.document_loaders import DocugamiLoader

## Load Documents

If the DOCUGAMI_API_KEY environment variable is set, there is no need to pass it in to the loader explicitly otherwise you can pass it in as the `access_token` parameter.

In [18]:
DOCUGAMI_API_KEY=os.environ.get('DOCUGAMI_API_KEY')

# To load all docs in the given docset ID, just don't provide document_ids
loader = DocugamiLoader(docset_id="ecxqpipcoe2p", document_ids=["43rj0ds7s0ur"])
docs = loader.load()
docs

[Document(page_content='MUTUAL NON-DISCLOSURE AGREEMENT', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MutualNon-disclosure', 'documentId': '43rj0ds7s0ur', 'structure': 'h1', 'tag': 'MutualNon-disclosure', 'projects': []}),
 Document(page_content='This  Mutual Non-Disclosure Agreement  (this “ Agreement ”) is entered into and made effective as of  April  4 ,  2018  between  Docugami Inc. , a  Delaware  corporation , whose address is  150  Lake Street South ,  Suite  221 ,  Kirkland ,  Washington  98033 , and  Caleb Divine , an individual, whose address is  1201  Rt  300 ,  Newburgh  NY  12550 .', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:ThisMutualNon-disclosureAgreement', 'documentId': '43rj0ds7s0ur', 'structure': 'p', 'tag': 'ThisMutualNon-disclosureAgreement', 'projects': []}),


The `metadata` for each `Document` (really, a chunk of an actual PDF, DOC or DOCX) contains some useful additional information:

1. documentId: ID of the actual PDF, DOC or DOCX the chunk is sourced from within Docugami.
2. xpath: XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.
3. structure: Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.
4. tag: Semantic tag for the chunk, using various generative and extractive techniques. More details here: https://github.com/docugami/DFM-benchmarks

## Basic Use: Docugami Loader for Document QA

You can use the Docugami Loader like a standard loader for Document QA over multiple docs, albeit with much better chunks that follow the natural contours of the document. There are many great tutorials on how to do this, e.g. [this one](https://www.youtube.com/watch?v=3yPBVii7Ct0). We can just use the same code, but use the `DocugamiLoader` for better chunking, instead of loading text or PDF files directly with basic splitting techniques.

In [None]:
!poetry run pip -q install openai tiktoken chromadb 

In [None]:
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# In this example, we already have a docset with a set of lease documents
loader = DocugamiLoader(docset_id="wh2kned25uqm")
documents = loader.load()
documents[:5] # dump a few docs to output

In [None]:
# The documents returned by the loader are already split, but we can filter out
# some chunks e.g. tables in this example
texts = [Document(page_content=d.page_content) for d in documents if d.metadata["structure"] not in ["table"]]

In [None]:
# Set up embeddings and retrieval QA the usual way
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, embedding=embedding)
retriever = vectordb.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(), 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [None]:
# Try out the retriever with an example query
qa_chain("Are tenants allowed to place signs on their property?")

Notice in the results (specifically the source_documents) that some tenants are actually allowed to place signs, while others are not. The LLM used all that context and decided to answer in the negative even though that is not completely correct.

## Using Docugami to Add Metadata to Chunks for High Accuracy Document QA

One issue with large documents is that the correct answer to your question may depend on chunks that are far apart in the document. Typical chunking techniques, even with overlap, will struggle with providing the LLM sufficent context to answer such questions. With upcoming very large context LLMs, it may be possible to stuff a lot of tokens, perhaps even entire documents, inside the context but this will still hit limits at some point with very long documents, or a lot of documents.

For example, if we ask a more complex question that requires the LLM to draw on chunks from different parts of the document, even OpenAI's powerful LLM is unable to answer correctly. In the following example, the chunking of the document does not end up putting the Landlord name and the address in the same context, and instead ends up finding a few other unrelated chunks from other documents not even related to the Menlo Group landlord.

In [None]:
qa_chain("What is the address of the property leased out by Menlo Group?")

Docugami offers a better approach. Chunks are annotated with additional metadata created using different techniques if a user has been [using Docugami](https://help.docugami.com/home/reports). More technical approaches will be added later.

For example, we can build the texts again, but this time read the metadata on the documents returned by docugami, and put some simple key/value pairs on all the text chunks based on the metadata:

In [None]:
texts = []
for d in documents:
    if d.metadata["structure"] not in ["table"]: # for parity with the example above, continue to filter out tables
        metadata = {"tag": d.metadata["tag"]}
        for project in d.metadata["projects"]:
            for entry in project["entries"]:
                metadata[entry["heading"]] = entry["value"]
        texts.append(Document(page_content=d.page_content, metadata=metadata))

vectordb = Chroma.from_documents(documents=texts, embedding=embedding)
retriever = vectordb.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(), 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)


Let's run the same question again. It returns better results since all the chunks have metadata key/value pairs on them carrying key information about the document even if it is physically very far away.

In [None]:
qa_chain("How much prepaid rent does Menlo Group require at signing?")