Combine ParentdocumentRetriever with Reranking: Rerank retrieved child chunks #21966

weissenbacherpwc · 2024-05-21T12:16:06Z

weissenbacherpwc
May 21, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.retrievers import MultiVectorRetriever
# PG Vector von Kheiri
from langchain_community.vectorstores.pgvector import PGVector
# Neues PG Vector von Langchain
#from langchain_postgres import PGVector
#from langchain_postgres.vectorstores import PGVector
import os
import shutil
import json
from langchain.load import dumps, loads
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker

def build_parent_document_retriever_store(directory_path, embeddings, collection_name, connection, llm):
    # this returns langchain documents, processed with unstructured-io
    documents = read_preprocess_and_chunk_data(directory_path=directory_path, llm=llm)
    # Check if the LocalFileStore directory exists
    parent_retriever_file_store = f"vectorstore/parent_document_retriever/parent_doc_store_{collection_name}"
    if os.path.exists(parent_retriever_file_store):
        print(f"Clearing existing files in {parent_retriever_file_store}")
        # Delete the contents of the directory
        for filename in os.listdir(parent_retriever_file_store):
            file_path = os.path.join(parent_retriever_file_store, filename)
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
    fs = LocalFileStore(parent_retriever_file_store)
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=cfg.PARENT_CHUNK_SIZE)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=cfg.CHILD_CHUNK_SIZE)
    vectorstore = PGVector(
        embedding_function=embeddings,
        collection_name=collection_name,
        connection_string=connection,
        use_jsonb=True,
        pre_delete_collection=True
    )
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    retriever.add_documents(documents=documents)
    return retriever

# This is how I can load the created and saved ParentDocumentRetriever
def rebuild_parent_document_retriever(embeddings, COLLECTION_NAME, CONNECTION_STRING):
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    parent_retriever_file_store = f"vectorstore/parent_document_retriever/parent_doc_store_{COLLECTION_NAME}"
    vectorstore = load_PG_vectorstore(embeddings, COLLECTION_NAME, CONNECTION_STRING)
    fs = LocalFileStore(parent_retriever_file_store)
    store = create_kv_docstore(fs)
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    big_chunks_retriever.search_kwargs['k'] = cfg.VECTOR_COUNT
    return big_chunks_retriever

Description

Hi,

I want to combine ParentDocument-Retrieval with Reranking (e.g. ColBERT).
But I don't want to rerank the retrieved results at the end, as my Reranking model has a max_token = 512, and the Parent Chunks with 2000 chars won't fit into this model.
So I think it could be a good idea to rerank the child chunks and the get the parent chunks of the reranked child chunks.

But I am not sure how I can implement this.

This would be how the chunking works:

from langchain.retrievers import ContextualCompressionRetriever
from ragatouille import RAGPretrainedModel
reranking_model = RAGPretrainedModel.from_pretrained("antoinelouis/colbert-xm")
retriever = ContextualCompressionRetriever(
            base_compressor=reranking_model.as_langchain_document_compressor(), base_retriever=retriever
        )     
retriever.base_compressor.k = cfg.RERANKER_VECTOR_COUNT

So how can I use this reranking method after retrieving the child chunks and before referencing to the parent chunks?

So again, child documents should be searched, and reranked. The reranked child documents should then be passed to get the parent documents of it.

System Info

conda list langchain

packages in environment at /Users/mweissenba001/anaconda3:

Name Version Build Channel

langchain 0.1.17 pypi_0 pypi
langchain-chroma 0.1.0 pypi_0 pypi
langchain-community 0.0.36 pypi_0 pypi
langchain-core 0.1.50 pypi_0 pypi
langchain-experimental 0.0.49 pypi_0 pypi
langchain-groq 0.0.1 pypi_0 pypi
langchain-openai 0.1.7 pypi_0 pypi
langchain-postgres 0.0.3 pypi_0 pypi
langchain-text-splitters 0.0.1 pypi_0 pypi

Answered by weissenbacherpwc

May 22, 2024

Hi thanks for your suggestion! I was able to resolve it with modifying the MultiVectorRetriever (multi_vector.py). I added the function "get_matching_reranked_docs" which is returning the right order of the reranked results as Langchain Document with the "score" key in metadata. In "_get_relevant_documents" I implemented the reranking changes.

Here is the full code that works for me:

from enum import Enum
from typing import Dict, List, Optional

from langchain_core.callbacks import (
    AsyncCallbackManagerForRetrieverRun,
    CallbackManagerForRetrieverRun,
)
from langchain_core.documents import Document
from langchain_core.pydantic_v1 import Field, root_validator
from langchain_core.re…

View full answer

@dosu · 2024-05-21T12:16:41Z

dosubot[bot]
bot May 21, 2024

To implement reranking of child chunks before referencing their parent chunks, you can follow these steps:

Initialize the ContextualCompressionRetriever with the reranking model:

from langchain.retrievers import ContextualCompressionRetriever
from ragatouille import RAGPretrainedModel

reranking_model = RAGPretrainedModel.from_pretrained("antoinelouis/colbert-xm")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranking_model.as_langchain_document_compressor(), 
    base_retriever=retriever
)
compression_retriever.base_compressor.k = cfg.RERANKER_VECTOR_COUNT

Modify your build_parent_document_retriever_store function to include the reranking step:

def build_parent_document_retriever_store(directory_path, embeddings, collection_name, connection, llm):
    documents = read_preprocess_and_chunk_data(directory_path=directory_path, llm=llm)
    parent_retriever_file_store = f"vectorstore/parent_document_retriever/parent_doc_store_{collection_name}"
    if os.path.exists(parent_retriever_file_store):
        for filename in os.listdir(parent_retriever_file_store):
            file_path = os.path.join(parent_retriever_file_store, filename)
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
    fs = LocalFileStore(parent_retriever_file_store)
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=cfg.PARENT_CHUNK_SIZE)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=cfg.CHILD_CHUNK_SIZE)
    vectorstore = PGVector(
        embedding_function=embeddings,
        collection_name=collection_name,
        connection_string=connection,
        use_jsonb=True,
        pre_delete_collection=True
    )
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    retriever.add_documents(documents=documents)
    
    reranking_model = RAGPretrainedModel.from_pretrained("antoinelouis/colbert-xm")
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=reranking_model.as_langchain_document_compressor(), 
        base_retriever=retriever
    )
    compression_retriever.base_compressor.k = cfg.RERANKER_VECTOR_COUNT
    
    return compression_retriever

Retrieve the parent chunks of the reranked child chunks:

def retrieve_parent_chunks(query, compression_retriever):
    reranked_child_chunks = compression_retriever.invoke(query)
    
    parent_chunks = {}
    for child_chunk in reranked_child_chunks:
        parent_id = child_chunk.metadata["parent_id"]
        if parent_id not in parent_chunks:
            parent_chunks[parent_id] = []
        parent_chunks[parent_id].append(child_chunk)
    
    return parent_chunks

# Example usage
query = "What animation studio did Miyazaki found?"
parent_chunks = retrieve_parent_chunks(query, compression_retriever)
print(parent_chunks)

This approach ensures that you first rerank the child chunks and then organize them by their parent chunks, avoiding the issue of exceeding the token limit of your reranking model.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

1 reply

weissenbacherpwc May 21, 2024
Author

this does not work unfortunately

maximeperrindev · 2024-05-22T07:15:42Z

maximeperrindev
May 22, 2024

Hi @weissenbacherpwc,

You can implement your own version of ParentDocumentRetriever and define the reranking if the _get_relevant_documents method like this :

from typing import List
from langchain.retrievers import ParentDocumentRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain.retrievers.multi_vector import SearchType

class ParentDocumentReranker(ParentDocumentRetriever):
    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Get documents relevant to a query.
        Args:
            query: String to find relevant documents for
            run_manager: The callbacks handler to use
        Returns:
            List of relevant documents
        """
        if self.search_type == SearchType.mmr:
            sub_docs = self.vectorstore.max_marginal_relevance_search(
                query, **self.search_kwargs
            )
        else:
            sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)

        # Reranking logic here with sub_docs (child documents)
        ...

        # We do this to maintain the order of the ids that are returned
        ids = []
        for d in sub_docs:
            if self.id_key in d.metadata and d.metadata[self.id_key] not in ids:
                ids.append(d.metadata[self.id_key])
        docs = self.docstore.mget(ids)
        return [d for d in docs if d is not None]

In this code I commented the section where you need to create your reranking logic using sub documents.

Let me know if you have any questions !

3 replies

weissenbacherpwc May 22, 2024
Author

Hi thanks for your suggestion! I was able to resolve it with modifying the MultiVectorRetriever (multi_vector.py). I added the function "get_matching_reranked_docs" which is returning the right order of the reranked results as Langchain Document with the "score" key in metadata. In "_get_relevant_documents" I implemented the reranking changes.

Here is the full code that works for me:

from enum import Enum
from typing import Dict, List, Optional

from langchain_core.callbacks import (
    AsyncCallbackManagerForRetrieverRun,
    CallbackManagerForRetrieverRun,
)
from langchain_core.documents import Document
from langchain_core.pydantic_v1 import Field, root_validator
from langchain_core.retrievers import BaseRetriever
from langchain_core.stores import BaseStore, ByteStore
from langchain_core.vectorstores import VectorStore

from langchain.storage._lc_store import create_kv_docstore

# Meine Additions
from langchain.retrievers import ContextualCompressionRetriever
from ragatouille import RAGPretrainedModel


class SearchType(str, Enum):
    """Enumerator of the types of search to perform."""

    similarity = "similarity"
    """Similarity search."""
    mmr = "mmr"
    """Maximal Marginal Relevance reranking of similarity search."""


class MultiVectorRetriever(BaseRetriever):
    """Retrieve from a set of multiple embeddings for the same document."""

    vectorstore: VectorStore
    """The underlying vectorstore to use to store small chunks
    and their embedding vectors"""
    byte_store: Optional[ByteStore] = None
    """The lower-level backing storage layer for the parent documents"""
    docstore: BaseStore[str, Document]
    """The storage interface for the parent documents"""
    id_key: str = "doc_id"
    search_kwargs: dict = Field(default_factory=dict)
    """Keyword arguments to pass to the search function."""
    search_type: SearchType = SearchType.similarity
    """Type of search to perform (similarity / mmr)"""
    reranking_model: Optional[RAGPretrainedModel] = None
    """Type of Reranking Model loaded with RAGPretrainedModel"""
    reranking_top_k: Optional[int] = None
    """Type of how many (k) documents should be returned after reranking"""

    @root_validator(pre=True)
    def shim_docstore(cls, values: Dict) -> Dict:
        byte_store = values.get("byte_store")
        docstore = values.get("docstore")
        if byte_store is not None:
            docstore = create_kv_docstore(byte_store)
        elif docstore is None:
            raise Exception("You must pass a `byte_store` parameter.")
        values["docstore"] = docstore
        return values
    
    def get_matching_reranked_docs(self, child_chunk_results: List[Document], reranking_results: List[Dict]) -> List[Document]:
        """
        Return a list of strings that are present in both child_chunk_results and list2.

        Parameters:
        child_chunk_results (list of LangchainDoc): The first list of Langchain documents.
        reranking_results(list of dict): The second list of dictionaries with content strings.

        Returns:
        list of str: A list containing strings that are present in both input lists.
        """
        # Extract strings and scores from list2 and create an order map
        content_score_map = {d.get("content"): d.get("score") for d in reranking_results}
        order_map = {content: idx for idx, content in enumerate(content_score_map.keys())}

        # Filter list1 based on matching content in list2
        filtered_list1 = [doc for doc in child_chunk_results if doc.page_content in content_score_map]

        # Update metadata with scores and sort filtered list1 based on the order in list2
        for doc in filtered_list1:
            doc.metadata['score'] = content_score_map[doc.page_content]

        filtered_list1.sort(key=lambda doc: order_map[doc.page_content])
        
        return filtered_list1

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Get documents relevant to a query.
        Args:
            query: String to find relevant documents for
            run_manager: The callbacks handler to use
        Returns:
            List of relevant documents
        """
        if self.search_type == SearchType.mmr:
            sub_docs = self.vectorstore.max_marginal_relevance_search(
                query, **self.search_kwargs
            )
        else:
            if self.reranking_model:
                sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
                page_contents = []
                for i in sub_docs:
                    ct = i.page_content
                    page_contents.append(ct)
                reranking_docs = self.reranking_model.rerank(query=query, documents=page_contents, k=self.reranking_top_k)
                sub_docs = self.get_matching_reranked_docs(child_chunk_results=sub_docs, reranking_results=reranking_docs)
                print(f"AMOUNT OF SUB_DOCS AFTER RERANKING: {len(sub_docs)}")
            else:
                sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)

        # We do this to maintain the order of the ids that are returned
        ids = []
        for d in sub_docs:
            if self.id_key in d.metadata and d.metadata[self.id_key] not in ids:
                ids.append(d.metadata[self.id_key])
        docs = self.docstore.mget(ids)
        return [d for d in docs if d is not None]

    async def _aget_relevant_documents(
        self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Asynchronously get documents relevant to a query.
        Args:
            query: String to find relevant documents for
            run_manager: The callbacks handler to use
        Returns:
            List of relevant documents
        """
        if self.search_type == SearchType.mmr:
            sub_docs = await self.vectorstore.amax_marginal_relevance_search(
                query, **self.search_kwargs
            )
        else:
            sub_docs = await self.vectorstore.asimilarity_search(
                query, **self.search_kwargs
            )

        # We do this to maintain the order of the ids that are returned
        ids = []
        for d in sub_docs:
            if self.id_key in d.metadata and d.metadata[self.id_key] not in ids:
                ids.append(d.metadata[self.id_key])
        docs = await self.docstore.amget(ids)
        return [d for d in docs if d is not None]

Answer selected by weissenbacherpwc

maximeperrindev May 22, 2024

Glad to hear that ! However, I recommend you to create your own class of the retriever so it won't be changed when you will update langchain lib. That was the main motivation of my example above

weissenbacherpwc May 22, 2024
Author

yes you are right, this makes more sense!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine ParentdocumentRetriever with Reranking: Rerank retrieved child chunks #21966

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Details

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Combine ParentdocumentRetriever with Reranking: Rerank retrieved child chunks #21966

weissenbacherpwc May 21, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

packages in environment at /Users/mweissenba001/anaconda3:

Name Version Build Channel

Replies: 2 comments · 4 replies

dosubot[bot] bot May 21, 2024

Details

weissenbacherpwc May 21, 2024 Author

maximeperrindev May 22, 2024

weissenbacherpwc May 22, 2024 Author

maximeperrindev May 22, 2024

weissenbacherpwc May 22, 2024 Author

weissenbacherpwc
May 21, 2024

Replies: 2 comments 4 replies

dosubot[bot]
bot May 21, 2024

weissenbacherpwc May 21, 2024
Author

maximeperrindev
May 22, 2024

weissenbacherpwc May 22, 2024
Author

weissenbacherpwc May 22, 2024
Author