<br>

# <font color="#76b900">Working with Large Documents</font>


### **Environment Setup:**


In [None]:
%pip install -qq langchain langchain-nvidia-ai-endpoints gradio
%pip install -qq arxiv pymupdf

import os
os.environ["NVIDIA_API_KEY"] = "xxxxxxxxxxxxxxxxxxxxxxxxxx"

from functools import partial
from rich.console import Console
from rich.style import Style
from rich.theme import Theme

console = Console()
base_style = Style(color="#76B900", bold=True)
pprint = partial(console.print, style=base_style)

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
ChatNVIDIA.get_available_models()

[Model(id='01-ai/yi-large', model_type=None),
 Model(id='adept/fuyu-8b', model_type=None),
 Model(id='ai21labs/jamba-1.5-large-instruct', model_type=None),
 Model(id='ai21labs/jamba-1.5-mini-instruct', model_type=None),
 Model(id='aisingapore/sea-lion-7b-instruct', model_type=None),
 Model(id='baai/bge-m3', model_type=None),
 Model(id='baichuan-inc/baichuan2-13b-chat', model_type=None),
 Model(id='bigcode/starcoder2-15b', model_type=None),
 Model(id='bigcode/starcoder2-7b', model_type=None),
 Model(id='deepseek-ai/deepseek-coder-6.7b-instruct', model_type=None),
 Model(id='google/codegemma-1.1-7b', model_type=None),
 Model(id='google/deplot', model_type=None),
 Model(id='google/gemma-2-27b-it', model_type=None),
 Model(id='google/gemma-2-2b-it', model_type=None),
 Model(id='google/gemma-2-9b-it', model_type=None),
 Model(id='google/paligemma', model_type=None),
 Model(id='google/shieldgemma-9b', model_type=None),
 Model(id='ibm/granite-34b-code-instruct', model_type=None),
 Model(id='i

In [None]:
## Useful utility method for printing intermediate states
from langchain_core.runnables import RunnableLambda
from functools import partial

def RPrint(preface="State: "):
    def print_and_return(x, preface=""):
        print(f"{preface}{x}")
        return x
    return RunnableLambda(partial(print_and_return, preface=preface))

def PPrint(preface="State: "):
    def print_and_return(x, preface=""):
        pprint(preface, x)
        return x
    return RunnableLambda(partial(print_and_return, preface=preface))

----

<br>

## Loading Documents

By default using `ArxivLoader` to load in one of either the [MRKL](https://arxiv.org/abs/2205.00445) or [ReAct](https://arxiv.org/abs/2210.03629) publication papers as you're likely to run into them at some point in your continued chat model research endeavors.

In [None]:
%%time
from langchain.document_loaders import UnstructuredFileLoader
from langchain.document_loaders import ArxivLoader

## Loading in the file

## Unstructured File Loader: Good for arbitrary "probably good enough" loader
# documents = UnstructuredFileLoader("llama2_paper.pdf").load()

## More specialized loader, won't work for everything, but simple API and usually better results
documents = ArxivLoader(query="2404.16130").load()  ## GraphRAG
# documents = ArxivLoader(query="2404.03622").load()  ## Visualization-of-Thought
# documents = ArxivLoader(query="2404.19756").load()  ## KAN: Kolmogorov-Arnold Networks
# documents = ArxivLoader(query="2404.07143").load()  ## Infini-Attention
# documents = ArxivLoader(query="2210.03629").load()  ## ReAct

CPU times: user 409 ms, sys: 71.1 ms, total: 480 ms
Wall time: 768 ms


<br>

We can see from our import that we this connector gives us access to two different components:
- The `page_content` is the actual body of the document in some human-interpretable format.
- The `metadata` is relevant information about the document that is provided by the connector via its data source.

Below, we can check out the length of our document body to see what's inside, and will probably notice an intractable document length:

In [None]:
## Printing out a sample of the content
print("Number of Documents Retrieved:", len(documents))
print(f"Sample of Document 1 Content (Total Length: {len(documents[0].page_content)}):")
print(documents[0].page_content[:1000])

Number of Documents Retrieved: 1
Sample of Document 1 Content (Total Length: 53880):
From Local to Global: A Graph RAG Approach to
Query-Focused Summarization
Darren Edge1†
Ha Trinh1†
Newman Cheng2
Joshua Bradley2
Alex Chao3
Apurva Mody3
Steven Truitt2
Jonathan Larson1
1Microsoft Research
2Microsoft Strategic Missions and Technologies
3Microsoft Office of the CTO
{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso}
@microsoft.com
†These authors contributed equally to this work
Abstract
The use of retrieval-augmented generation (RAG) to retrieve relevant informa-
tion from an external knowledge source enables large language models (LLMs)
to answer questions over private and/or previously unseen document collections.
However, RAG fails on global questions directed at an entire text corpus, such
as “What are the main themes in the dataset?”, since this is inherently a query-
focused summarization (QFS) task, rather than an explicit retrieval task. Prior
QFS methods

<br>

In contrast, the metadata will be much more conservatively-sized to the point of being viable context components for your favorite chat model:

In [None]:
pprint(documents[0].metadata)

----

<br>

## Transforming The Documents

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

## Some nice custom preprocessing
# documents[0].page_content = documents[0].page_content.replace(". .", "")
docs_split = text_splitter.split_documents(documents)

# def include_doc(doc):
#     ## Some chunks will be overburdened with useless numerical data, so we'll filter it out
#     string = doc.page_content
#     if len([l for l in string if l.isalpha()]) < (len(string)//2):
#         return False
#     return True

# docs_split = [doc for doc in docs_split if include_doc(doc)]
print(len(docs_split))

50


In [None]:
for i in (0, 1, 2, 15, -1):
    pprint(f"[Document {i}]")
    print(docs_split[i].page_content)
    pprint("="*64)

From Local to Global: A Graph RAG Approach to
Query-Focused Summarization
Darren Edge1†
Ha Trinh1†
Newman Cheng2
Joshua Bradley2
Alex Chao3
Apurva Mody3
Steven Truitt2
Jonathan Larson1
1Microsoft Research
2Microsoft Strategic Missions and Technologies
3Microsoft Office of the CTO
{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso}
@microsoft.com
†These authors contributed equally to this work
Abstract
The use of retrieval-augmented generation (RAG) to retrieve relevant informa-
tion from an external knowledge source enables large language models (LLMs)
to answer questions over private and/or previously unseen document collections.
However, RAG fails on global questions directed at an entire text corpus, such
as “What are the main themes in the dataset?”, since this is inherently a query-
focused summarization (QFS) task, rather than an explicit retrieval task. Prior
QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical
RAG systems. 

a Graph RAG approach to question answering over private text corpora that scales
with both the generality of user questions and the quantity of source text to be in-
dexed. Our approach uses an LLM to build a graph-based text index in two stages:
first to derive an entity knowledge graph from the source documents, then to pre-
generate community summaries for all groups of closely-related entities. Given a
question, each community summary is used to generate a partial response, before
all partial responses are again summarized in a final response to the user. For a
class of global sensemaking questions over datasets in the 1 million token range,
we show that Graph RAG leads to substantial improvements over a na¨
ıve RAG
baseline for both the comprehensiveness and diversity of generated answers. An
open-source, Python-based implementation of both global and local Graph RAG
approaches is forthcoming at https://aka.ms/graphrag.
1
Introduction
Human endeavors across a range of domains rely

collections of documents, often reaching conclusions that go beyond anything stated in the source
texts themselves. With the emergence of large language models (LLMs), we are already witnessing
attempts to automate human-like sensemaking in complex domains like scientific discovery (Mi-
crosoft, 2023) and intelligence analysis (Ranade and Joshi, 2023), where sensemaking is defined as
Preprint. Under review.
arXiv:2404.16130v1  [cs.CL]  24 Apr 2024
Source Documents
Text Chunks
text extraction
and chunking
Element Instances
domain-tailored
summarization
Element Summaries
domain-tailored
summarization
Graph Communities
community
detection
Community Summaries
domain-tailored
summarization
Community Answers
query-focused
summarization
Global Answer
query-focused
summarization
Indexing Time
Query Time
Pipeline Stage
Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This
index spans nodes (e.g., entities), edges (e.g., relationships), and covariates (e.g.,

2.6
Community Summaries →Community Answers →Global Answer
Given a user query, the community summaries generated in the previous step can be used to generate
a final answer in a multi-stage process. The hierarchical nature of the community structure also
means that questions can be answered using the community summaries from different levels, raising
the question of whether a particular level in the hierarchical community structure offers the best
balance of summary detail and scope for general sensemaking questions (evaluated in section 3).
For a given community level, the global answer to any user query is generated as follows:
• Prepare community summaries. Community summaries are randomly shuffled and divided
into chunks of pre-specified token size. This ensures relevant information is distributed
across chunks, rather than concentrated (and potentially lost) in a single context window.
• Map community answers. Generate intermediate answers in parallel, one for each chunk.
The LLM i

HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on
Empirical Methods in Natural Language Processing (EMNLP).
Yao, J.-g., Wan, X., and Xiao, J. (2017). Recent advances in document summarization. Knowledge
and Information Systems, 53:297–336.
14
Yao, L., Peng, J., Mao, C., and Luo, Y. (2023). Exploring large language models for knowledge
graph completion.
Zhang, J. (2023). Graph-toolformer: To empower llms with graph reasoning ability via prompt
augmented by chatgpt. arXiv preprint arXiv:2304.11116.
Zhang, Y., Zhang, Y., Gan, Y., Yao, L., and Wang, C. (2024). Causal graph discovery with retrieval-
augmented generation based large language models. arXiv preprint arXiv:2402.15301.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing,
E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural
Information Processing Systems, 36.
15


----

<br>

## Refining Summaries


#### **The DocumentSummaryBase Model**

A `DocumentSummaryBase` structure designed to encapsulate the essence of a document. The one below will use the `running_summary` field to query the model for a final summary while attempting to use the `main_ideas` and `loose_ends` fields as a bottleneck to keep the running summary from moving too fast. This is something we're going to have to enforce via prompt engineering, so the `summary_prompt` is also provided which shows how this information will be used. Feel free to modify it as necessary to make it work for your model of choice.

In [None]:
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables.passthrough import RunnableAssign
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.output_parsers import PydanticOutputParser

from langchain_nvidia_ai_endpoints import ChatNVIDIA

from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List
from IPython.display import clear_output


class DocumentSummaryBase(BaseModel):
    running_summary: str = Field("", description="Running description of the document. Do not override; only update!")
    main_ideas: List[str] = Field([], description="Most important information from the document (max 3)")
    loose_ends: List[str] = Field([], description="Open questions that would be good to incorporate into summary, but that are yet unknown (max 3)")


summary_prompt = ChatPromptTemplate.from_template(
    "You are generating a running summary of the document. Make it readable by a technical user."
    " After this, the old knowledge base will be replaced by the new one. Make sure a reader can still understand everything."
    " Keep it short, but as dense and useful as possible! The information should flow from chunk to (loose ends or main ideas) to running_summary."
    " The updated knowledge base keep all of the information from running_summary here: {info_base}."
    "\n\n{format_instructions}. Follow the format precisely, including quotations and commas"
    "\n\nWithout losing any of the info, update the knowledge base with the following: {input}"
)

In [None]:
def RExtract(pydantic_class, llm, prompt):
    '''
    Runnable Extraction module
    Returns a knowledge dictionary populated by slot-filling extraction
    '''
    parser = PydanticOutputParser(pydantic_object=pydantic_class)
    instruct_merge = RunnableAssign({'format_instructions' : lambda x: parser.get_format_instructions()})
    def preparse(string):
        if '{' not in string: string = '{' + string
        if '}' not in string: string = string + '}'
        string = (string
            .replace("\\_", "_")
            .replace("\n", " ")
            .replace("\]", "]")
            .replace("\[", "[")
        )
        # print(string)  ## Good for diagnostics
        return string
    return instruct_merge | prompt | llm | preparse | parser


<br>

With this in mind, the following code invokes the running state chain in a for-loop to iterate over your documents! The only modification necessary should be the `parse_chain` implementation, which should pass the the state through a properly-configured `RExtract` chain. After this, the system should work decently to maintain a running summary of the document (though some tweaking of the prompt may be required depending on the model used).

In [None]:
latest_summary = ""

## TODO: Use the techniques from the previous notebook to complete the exercise
def RSummarizer(knowledge, llm, prompt, verbose=False):
    '''
    Exercise: Create a chain that summarizes
    '''
    ###########################################################################################
    ## START TODO:

    def summarize_docs(docs):
        ## TODO: Initialize the parse_chain appropriately; should include an RExtract instance.
        ## HINT: You can get a class using the <object>.__class__ attribute...
        parse_chain = RunnableAssign({'info_base' : RExtract(knowledge.__class__, llm, prompt)})

        ## TODO: Initialize a valid starting state. Should be similar to notebook 4
        state = {'info_base' : knowledge}

        global latest_summary  ## If your loop crashes, you can check out the latest_summary

        for i, doc in enumerate(docs):
            ## TODO: Update the state as appropriate using your parse_chain component
            state['input'] = doc.page_content
            state = parse_chain.invoke(state)

            assert 'info_base' in state
            if verbose:
                print(f"Considered {i+1} documents")
                pprint(state['info_base'])
                latest_summary = state['info_base']
                clear_output(wait=True)

        return state['info_base']

    ## END TODO
    ###########################################################################################

    return RunnableLambda(summarize_docs)

# instruct_model = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1").bind(max_tokens=4096)
instruct_model = ChatNVIDIA(model="mistralai/mixtral-8x22b-instruct-v0.1").bind(max_tokens=4096)
instruct_llm = instruct_model | StrOutputParser()

## Take the first 10 document chunks and accumulate a DocumentSummaryBase
summarizer = RSummarizer(DocumentSummaryBase(), instruct_llm, summary_prompt, verbose=True)
summary = summarizer.invoke(docs_split[:15])

Considered 15 documents


In [None]:
pprint(latest_summary)