Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParentDocumentRetriever need splitter and not transformer #11968

Closed

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Oct 18, 2023

Description:
The current implementation of ParentDocumentRetriever need two splitter: child_splitter and parent_splitter.

But a splitter is a kind of BaseDocumentTransformer.

It is a limitation. If you want to apply transformations to the documents (for each fragment, generate questions, generate a summary, etc.) this is not possible.

I propose a new version of ParentDocumentRetriever (in the file parent_document_retriever_v2.py for the moment). This version waits child_transformer and child_parent. It's possible to use it with some splitter.

I must change the name of the attributs because it's now, some transformer. May be, it's possible to declare the child_splitter and parent_splitter to be compatible with the previous version. I can do it if you accept the principle of my pull-request.

The idea behind this is to improve RAGs, by increasing the versions of embeddings for each fragment.
With other pull-request, it may be possible to write:

vs = ParentDocumentVectorStore(
    vectorstore=vs,
    docstore=self.docstore,
    id_key=CV_ID_KEY,
    parent_transformer=RecursiveCharacterTextSplitter(),
    child_transformer=DocumentTransformerPipeline(
        transformers=[
            GenerateQuestions.from_llm(llm),
            SummarizeTransformer.from_llm(llm)
        ]
    ),
)

New version of parent_document_retriever to use "transformer" in place of "splitter"

Tag maintainer:
@baskaryan

@vercel
Copy link

vercel bot commented Oct 18, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2023 9:43am

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 18, 2023
)
"""

child_transformer: BaseDocumentTransformer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pprados let's change this on the original parent document retriever.

If there is a way to change names using pydantic features without making breaking changes that's great. If not, that's fine -- better have a slight unintuitive parameter name than a separate class with essentially identical functionality

Copy link
Contributor Author

@pprados pprados Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eyurtsev, I am working on a better implementation.

A question about the usage of parser.
When I use a parser in a prompt, now, the method predict_and_parse() throw a warning
The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain..

Does this mean that it is no longer possible to have a parser in a prompt?
However, this is a good idea for complex algorithms requiring several prompts, such as map_reduce, refine, etc.
I don't know how it's possible to pass an output parser directly to LLMChain. To do this, I must always receive an LLMChain and not a prompt as a parameter.

The map_reduce_chain receive a question_prompt, combine_prompt and collapse_prompt. It will be deprecated ?

And, how it's possible to reuse the LLMChain with differents parser?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on a better implementation.

Sounds good -- I would just suggest to discuss implementation / design prior to putting in the effort to make sure that the new code gets merged in


  1. LLMChain is a legacy object. It's not a good idea to build new functionality on top of it. Use LCEL runnables instead.

  2. LLMChain accepts a parser: https://github.com/langchain-ai/langchain/blob/68599d98c20b1e3fbdabaef1b1fbe54cd06b98a4/libs/langchain/langchain/chains/llm.py#L54C1-L54C1. (But again we don't want to be using LLMChain in new code)

  3. The map_reduce_chain receive a question_prompt, combine_prompt and collapse_prompt -- no plans to deprecate them as users are using them. We may at some point provide new implementations with LCEL

@pprados
Copy link
Contributor Author

pprados commented Oct 23, 2023

Hello @eyurtsev

My goal is to be able to combine all the advanced features of langchain. That's impossible at the moment.

I want to be able to :

  • cut a document into fragments
  • for each fragment, transform them into several versions
  • persist each version in a vs
  • when recovering a version, recover the original fragment
  • manage the lifecycle of all fragments and associated versions
  • when using VS,
    • combine SelfQueryRetriever, MultiQueryRetriever, ParentDocumentRetriever, etc.
    • combine filters
  • Manage the life cycle of fragment and associated transformation with index()
    It's that simple.

My idea is to offer a wrapper around the VS. This wrapper can be use in place of all VS.
Then, If I extend the SelfQueryRetriver, it's possible to use an equivalent of ParentDocumentRetriever, etc.

The code is not published now, but you can read a
notebook
with the ideas (see the chapter "Short version").

I will propose

  • a solution de generate multiple versions for each fragment (DocumentTransformerPipeline the name will be changed)
    • Generate 3 questions for the fragment
    • Summarize the fragment
    • etc.
  • a VectorStoreWrapper (to be compatible with SelfQueryRetriever)
  • An implementation of VectorStoreWrapper to manage multiple transformations of a fragment: ParentVectorStore

I think it's possible to enrich LCEL to introduce these new ideas:

  • Combine transformations with a pipe
  • etc.

What are your thoughts on this?

@pprados
Copy link
Contributor Author

pprados commented Oct 23, 2023

I think, the BaseDocumentTransformer must be RunnableSerializable[Sequence[Document], Sequence[Document]] to apply a better syntax.

But, all subclass (TextSplitter, etc.) must use Pydantic. (I try it).
I thinks it's the opportunity to add a ̀lazy_transform_documents() to yields some transformations.

I've started to implement this.

@eyurtsev eyurtsev self-assigned this Oct 29, 2023
@eyurtsev
Copy link
Collaborator

eyurtsev commented Nov 1, 2023

cut a document into fragments
for each fragment, transform them into several versions
persist each version in a vs
when recovering a version, recover the original fragment
manage the lifecycle of all fragments and associated versions
when using VS,
combine SelfQueryRetriever, MultiQueryRetriever, ParentDocumentRetriever, etc.
combine filters
Manage the life cycle of fragment and associated transformation with index()
It's that simple.

I don't know that storing all the information in the same vectorstore will be possible with all vectorstores. Some ofthe vectorstores are limited in terms of which operations they support and what kind of data can be stored in them.

Would a document store that can keep track of document content and metadata, and allows basic operations together with read, write, list based on metadata + a layer that supports caching intermediate processing results support this use case? This would allow fetch the relevant data (potentially cached) and index it into an arbitrary vectorstore?

I think it's possible to enrich LCEL to introduce these new ideas:

Combine transformations with a pipe

We've considered doing this, but we haven't pushed on it yet, some of the issues that needed to be worked out:

  1. Unclear that a document pipeline should leverage LCEL syntax -- would need an obvious benefit (e.g., async support out of the box)
  2. If using LCEL need to figure out the following issues:
    -- Existing loader interface takes all via init , not load, which doesn't play nicely with runnables. Can do tricks like DocumentLoaderClass.as_runnable( init params ) to workaround, but it does add complexity
    -- Needs to handle memory consumption
    -- May need to handle multi threading and multi processing optimizations

@eyurtsev
Copy link
Collaborator

eyurtsev commented Nov 1, 2023

@pprados notebook -- what's the new location?

@pprados
Copy link
Contributor Author

pprados commented Nov 3, 2023

@eyurtsev

I strongly urge you to look at the notebook. It completely demonstrates all the advantages of full integration of langchain features.

RagVectorStore

I don't know that storing all the information in the same vectorstore will be possible with all vectorstores. Some ofthe vectorstores are limited in terms of which operations they support and what kind of data can be stored in them.

Like the ParentDocumentRetriever, I combine a VectorStore and a DocStore. The code is here.

The new version of the notebook is here.

For a complete integration, with all advanced features, I propose to add some objects:

Why is important?

  • MemoryRecordManager is for test, a demo, or working all in memory with index() or aindex()
  • SQLStore is if you want to merge the VectorStore, the DocStore and the RecordManager in the same database (maybe with the compatibility with a unique transaction), it may be possible to reuse the engine. A first implementation is here. I will propose to accept an optional vectorstore.
  • WrapperVectorStore is to be compatible with SelfQueryRetriever. With this wrapper, it's possible to add the class in _get_builtin_translator(). At the moment I'm using a hack. Then, the RagVectorStore can use other's VectorStore with SelfQueryRetriever. Another advantage for developers is that when the VectorStore interface evolves, the wrappers generally remain compatible. Eventually, it is necessary to review the overloaded methods that have modified their signatures.

LCEL with Transformer

In my implementation, I propose an optional approach to use LCEL with Transformer.
You can read the unit test here and the code here

We propose an alternative way of making transformers compatible with LCEL.
The first keeps the current protocol (RunnableDocumentTransformer).
The second takes advantage of this to propose
a lazy approach to transformations (RunnableGeneratorDocumentTransformer).
It's better for the memory, pipeline, etc.

Now, it's possible to create a pipeline of transformer like:
Example:

    class UpperTransformer(RunnableGeneratorDocumentTransformer):
        def lazy_transform_documents(
                self,
                documents: Iterator[Document],
                **kwargs: Any
        ) -> Iterator[Document]:
            ...

        async def alazy_transform_documents(
                self,
                documents: Union[AsyncIterator[Document],Iterator[Document]],
                **kwargs: Any
        ) -> AsyncIterator[Document]:
            ...

    runnable = (UpperTransformer() | LowerTransformer())
    result = list(runnable.invoke(documents))

If you accept my future pull-request, you will have to review all the Transformer subclasses.

LCEL with retriever

May be, it's possible to propose a syntax to chain the retrievers.

vs.as_retriever() | EmbeddingsFilter(...) | ContextualCompressionRetriever(...)

Pull-request

At this time, I must add some tests. Then I would like to propose one or multiple pull-request:

But, it's very complicated to maintain this for each new version. I've already suffered a lot with Google Drive or qa_with_reference integration.

Perhaps you can encourage me in my approach and allow me to have more direct access, in order to complete this proposal as quickly as possible?

@eyurtsev, what would be the most effective communication channel?

@pprados pprados marked this pull request as draft November 4, 2023 07:29
@pprados pprados marked this pull request as ready for review November 10, 2023 16:39
@pprados
Copy link
Contributor Author

pprados commented Nov 28, 2023

@eyurtsev,
The PR13910 replace this.

@pprados pprados closed this Nov 28, 2023
@pprados pprados deleted the pprados/transformer_retriever branch October 7, 2024 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants