## Parent Document Retrivier

A method that involves chopping big parts (parent chunks) into even smaller portions (child chunks). Because they are broken up into smaller sections, the information they convey is more focused and retains its informational value over text paragraphs.

There's a tiny issue with all of this:

We need to segment our documents into manageable parts if we want to be exact when looking for the most pertinent information.
However, it is also crucial to give the LLM a solid context, which is accomplished by giving it in bigger portions.
The primary goal is to further divide the big pieces—the parent chunks and documents—into smaller ones—the child chunks and documents. After that, return the parents chunks to which the top K child document belongs and search for the most pertinent top K documents using the child chunks.




In [2]:
from langchain.schema import Document
from langchain.vectorstores import Chroma
    
## Text Splitting & Docloader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader

from langchain.embeddings import HuggingFaceBgeEmbeddings
import glob

In [3]:
# Specify the directory path (adjust as needed)
directory_path = "/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post"

# Use glob to find all .txt files in the directory and subdirectories
txt_files = glob.glob(f"{directory_path}**/*.txt", recursive=True)

# Initialize an empty list for loaders
loaders = [TextLoader(path) for path in txt_files]

# Initialize an empty list to store documents
docs = []

# Loop through each loader, load the document, and extend the docs list
for loader in loaders:
    docs.extend(loader.load())

In [4]:
model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)

  from tqdm.autonotebook import tqdm, trange


In [8]:
# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index both parent documents and child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function= bge_embeddings # or your preferred embedding function
)

# Function to index parent documents and their child chunks with metadata
def index_documents_with_metadata(parent_documents):
    for doc in parent_documents:
        # Add or update metadata for the parent document
        doc.metadata.update({
            'document_id': doc.metadata.get('source', 'unknown'),  # Example to add unique identifier
        })
        # Index the parent document with metadata
        vectorstore.add_texts([doc.page_content], metadatas=[{"type": "parent", **doc.metadata}])
        
        # Split the parent document into child chunks and index those with metadata
        child_chunks = child_splitter.split_text(doc.page_content)
        vectorstore.add_texts(child_chunks, metadatas=[{"type": "child", "parent_id": doc.metadata.get('document_id')} for _ in child_chunks])



# Indexing all documents with metadata into the vectorstore
index_documents_with_metadata(docs)

In [11]:
class UnifiedParentDocumentRetriever:
    def __init__(self, vectorstore, child_splitter, parent_splitter=None):
        self.vectorstore = vectorstore
        self.child_splitter = child_splitter
        self.parent_splitter = parent_splitter

    def retrieve(self, query, retrieve_parents=True, retrieve_children=True):
        # Prepare filters based on what you want to retrieve
        filters = []
        if retrieve_parents:
            filters.append({"type": "parent"})
        if retrieve_children:
            filters.append({"type": "child"})
        
        results = []
        for filter in filters:
            results.extend(self.vectorstore.similarity_search(query, filter=filter))
        
        return results

# Set up the retriever
retriever = UnifiedParentDocumentRetriever(
    vectorstore=vectorstore, 
    child_splitter=child_splitter,
    parent_splitter=None  # Assuming you may or may not use this, it's optional since we load each parent without chunking I have passed as None
)


In [12]:
# Retrieve both parent and child documents
results = retriever.retrieve(query="what is langsmith")


In [13]:
results

[Document(metadata={'document_id': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt', 'source': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt', 'type': 'parent'}, page_content='URL: https://blog.langchain.dev/peering-into-the-soul-of-ai-decision-making-with-langsmith/\nTitle: Peering Into the Soul of AI Decision-Making with LangSmith\n\nEditor\'s Note: This post was written by Paul Thomson from Commandbar. They\'ve been awesome partners as they brought their application into production with LangSmith, and we\'re excited to share their story getting there.\n\nDo you ever wonder why you’re getting unhinged responses from ChatGPT sometimes? Or why the heck Midjourney is giving your creations 7 weird fingers? As intelligent as AI is supposed to be, it does produce some prett

In [14]:
# Retrieve only parent documents
parent_results = retriever.retrieve(query="what is langsmith", retrieve_parents=True, retrieve_children=False)

In [15]:
parent_results

[Document(metadata={'document_id': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt', 'source': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt', 'type': 'parent'}, page_content='URL: https://blog.langchain.dev/peering-into-the-soul-of-ai-decision-making-with-langsmith/\nTitle: Peering Into the Soul of AI Decision-Making with LangSmith\n\nEditor\'s Note: This post was written by Paul Thomson from Commandbar. They\'ve been awesome partners as they brought their application into production with LangSmith, and we\'re excited to share their story getting there.\n\nDo you ever wonder why you’re getting unhinged responses from ChatGPT sometimes? Or why the heck Midjourney is giving your creations 7 weird fingers? As intelligent as AI is supposed to be, it does produce some prett

In [16]:
# Retrieve only child documents
child_results = retriever.retrieve(query="what is langsmith", retrieve_parents=False, retrieve_children=True)

In [17]:
child_results

[Document(metadata={'parent_id': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt', 'type': 'child'}, page_content='What Is LangSmith?\n\nLangSmith is a framework built on the shoulders of LangChain. It’s designed to track the inner workings of LLMs and AI agents within your product.\n\nThose LLM inner-workings can be categorized into 4 main buckets - each with its own flair of usefulness. Here’s a breakdown of how they all work in unison and what you can expect.\n\n\n\nDebugging:'),
 Document(metadata={'parent_id': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt', 'type': 'child'}, page_content='What Are LangSmith Traces?'),
 Document(metadata={'parent_id': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-o


If you start with a search that returns information from child chunks and then want to retrieve more comprehensive information (e.g., the entire parent document or additional related chunks), you can adjust your retrieval process as follows:

In [21]:
# Assuming initial_results is a list of documents or chunks below is top 4 child documents.
for result in child_results:
    parent_id = result.metadata.get('parent_id')
    print(parent_id)

/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt
/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt
/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt
/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_peering-into-the-soul-of-ai-decision-making-with-langsmith_.txt


In [22]:
# Retrieve the parent document using the parent_id from the child chunk's metadata. For illustration I pick the first one.
parent_id = child_results[0].metadata.get('parent_id')
parent_results = retriever.retrieve(query=parent_id, retrieve_parents=True, retrieve_children=False)

# Display the entire parent document content
for parent_result in parent_results:
    print(parent_result.page_content)


URL: https://blog.langchain.dev/peering-into-the-soul-of-ai-decision-making-with-langsmith/
Title: Peering Into the Soul of AI Decision-Making with LangSmith

Editor's Note: This post was written by Paul Thomson from Commandbar. They've been awesome partners as they brought their application into production with LangSmith, and we're excited to share their story getting there.

Do you ever wonder why you’re getting unhinged responses from ChatGPT sometimes? Or why the heck Midjourney is giving your creations 7 weird fingers? As intelligent as AI is supposed to be, it does produce some pretty unintelligent responses sometimes.

Now, if you’re using GPT to write your next “let ‘em down easy breakup message”, the stakes are low - it doesn’t really matter. But if a core product feature is leveraging AI and your customers depend on super-intelligent perfection, you’re going to want some security and assurances that the outputs are up to scratch. Enter, LangSmith.

Since the launch of HelpHub