ParentDocumentRetriever need splitter and not transformer #11968

pprados · 2023-10-18T08:24:58Z

Description:
The current implementation of ParentDocumentRetriever need two splitter: child_splitter and parent_splitter.

But a splitter is a kind of BaseDocumentTransformer.

It is a limitation. If you want to apply transformations to the documents (for each fragment, generate questions, generate a summary, etc.) this is not possible.

I propose a new version of ParentDocumentRetriever (in the file parent_document_retriever_v2.py for the moment). This version waits child_transformer and child_parent. It's possible to use it with some splitter.

I must change the name of the attributs because it's now, some transformer. May be, it's possible to declare the child_splitter and parent_splitter to be compatible with the previous version. I can do it if you accept the principle of my pull-request.

The idea behind this is to improve RAGs, by increasing the versions of embeddings for each fragment.
With other pull-request, it may be possible to write:

vs = ParentDocumentVectorStore(
    vectorstore=vs,
    docstore=self.docstore,
    id_key=CV_ID_KEY,
    parent_transformer=RecursiveCharacterTextSplitter(),
    child_transformer=DocumentTransformerPipeline(
        transformers=[
            GenerateQuestions.from_llm(llm),
            SummarizeTransformer.from_llm(llm)
        ]
    ),
)

New version of parent_document_retriever to use "transformer" in place of "splitter"

Tag maintainer:
@baskaryan

…e of "splitter"

vercel · 2023-10-18T08:25:03Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 18, 2023 9:43am

eyurtsev · 2023-10-18T15:03:37Z

libs/langchain/langchain/retrievers/parent_document_retriever_v2.py

+            )
+    """
+
+    child_transformer: BaseDocumentTransformer


@pprados let's change this on the original parent document retriever.

If there is a way to change names using pydantic features without making breaking changes that's great. If not, that's fine -- better have a slight unintuitive parameter name than a separate class with essentially identical functionality

@eyurtsev, I am working on a better implementation.

A question about the usage of parser.
When I use a parser in a prompt, now, the method predict_and_parse() throw a warning
The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain..

Does this mean that it is no longer possible to have a parser in a prompt?
However, this is a good idea for complex algorithms requiring several prompts, such as map_reduce, refine, etc.
I don't know how it's possible to pass an output parser directly to LLMChain. To do this, I must always receive an LLMChain and not a prompt as a parameter.

The map_reduce_chain receive a question_prompt, combine_prompt and collapse_prompt. It will be deprecated ?

And, how it's possible to reuse the LLMChain with differents parser?

I am working on a better implementation.

Sounds good -- I would just suggest to discuss implementation / design prior to putting in the effort to make sure that the new code gets merged in

LLMChain is a legacy object. It's not a good idea to build new functionality on top of it. Use LCEL runnables instead.

LLMChain accepts a parser: https://github.com/langchain-ai/langchain/blob/68599d98c20b1e3fbdabaef1b1fbe54cd06b98a4/libs/langchain/langchain/chains/llm.py#L54C1-L54C1. (But again we don't want to be using LLMChain in new code)

The map_reduce_chain receive a question_prompt, combine_prompt and collapse_prompt -- no plans to deprecate them as users are using them. We may at some point provide new implementations with LCEL

pprados · 2023-10-23T07:23:36Z

Hello @eyurtsev

My goal is to be able to combine all the advanced features of langchain. That's impossible at the moment.

I want to be able to :

cut a document into fragments
for each fragment, transform them into several versions
persist each version in a vs
when recovering a version, recover the original fragment
manage the lifecycle of all fragments and associated versions
when using VS,
- combine SelfQueryRetriever, MultiQueryRetriever, ParentDocumentRetriever, etc.
- combine filters
Manage the life cycle of fragment and associated transformation with index()
It's that simple.

My idea is to offer a wrapper around the VS. This wrapper can be use in place of all VS.
Then, If I extend the SelfQueryRetriver, it's possible to use an equivalent of ParentDocumentRetriever, etc.

The code is not published now, but you can read a
notebook
with the ideas (see the chapter "Short version").

I will propose

a solution de generate multiple versions for each fragment (DocumentTransformerPipeline the name will be changed)
- Generate 3 questions for the fragment
- Summarize the fragment
- etc.
a VectorStoreWrapper (to be compatible with SelfQueryRetriever)
An implementation of VectorStoreWrapper to manage multiple transformations of a fragment: ParentVectorStore

I think it's possible to enrich LCEL to introduce these new ideas:

Combine transformations with a pipe
etc.

What are your thoughts on this?

pprados · 2023-10-23T09:07:33Z

I think, the BaseDocumentTransformer must be RunnableSerializable[Sequence[Document], Sequence[Document]] to apply a better syntax.

But, all subclass (TextSplitter, etc.) must use Pydantic. (I try it).
I thinks it's the opportunity to add a ̀lazy_transform_documents() to yields some transformations.

I've started to implement this.

eyurtsev · 2023-11-01T14:52:07Z

cut a document into fragments
for each fragment, transform them into several versions
persist each version in a vs
when recovering a version, recover the original fragment
manage the lifecycle of all fragments and associated versions
when using VS,
combine SelfQueryRetriever, MultiQueryRetriever, ParentDocumentRetriever, etc.
combine filters
Manage the life cycle of fragment and associated transformation with index()
It's that simple.

I don't know that storing all the information in the same vectorstore will be possible with all vectorstores. Some ofthe vectorstores are limited in terms of which operations they support and what kind of data can be stored in them.

Would a document store that can keep track of document content and metadata, and allows basic operations together with read, write, list based on metadata + a layer that supports caching intermediate processing results support this use case? This would allow fetch the relevant data (potentially cached) and index it into an arbitrary vectorstore?

I think it's possible to enrich LCEL to introduce these new ideas:

Combine transformations with a pipe

We've considered doing this, but we haven't pushed on it yet, some of the issues that needed to be worked out:

Unclear that a document pipeline should leverage LCEL syntax -- would need an obvious benefit (e.g., async support out of the box)
If using LCEL need to figure out the following issues:
-- Existing loader interface takes all via init , not load, which doesn't play nicely with runnables. Can do tricks like DocumentLoaderClass.as_runnable( init params ) to workaround, but it does add complexity
-- Needs to handle memory consumption
-- May need to handle multi threading and multi processing optimizations

eyurtsev · 2023-11-01T14:52:33Z

@pprados notebook -- what's the new location?

pprados · 2023-11-03T17:16:27Z

@eyurtsev

I strongly urge you to look at the notebook. It completely demonstrates all the advantages of full integration of langchain features.

RagVectorStore

I don't know that storing all the information in the same vectorstore will be possible with all vectorstores. Some ofthe vectorstores are limited in terms of which operations they support and what kind of data can be stored in them.

Like the ParentDocumentRetriever, I combine a VectorStore and a DocStore. The code is here.

The new version of the notebook is here.

For a complete integration, with all advanced features, I propose to add some objects:

A MemoryRecordManager
A SQLStore
A generic VectorStore wrapper WrapperVectorStore
And, at the end, the RagVectorStore

Why is important?

MemoryRecordManager is for test, a demo, or working all in memory with index() or aindex()
SQLStore is if you want to merge the VectorStore, the DocStore and the RecordManager in the same database (maybe with the compatibility with a unique transaction), it may be possible to reuse the engine. A first implementation is here. I will propose to accept an optional vectorstore.
WrapperVectorStore is to be compatible with SelfQueryRetriever. With this wrapper, it's possible to add the class in _get_builtin_translator(). At the moment I'm using a hack. Then, the RagVectorStore can use other's VectorStore with SelfQueryRetriever. Another advantage for developers is that when the VectorStore interface evolves, the wrappers generally remain compatible. Eventually, it is necessary to review the overloaded methods that have modified their signatures.

LCEL with Transformer

In my implementation, I propose an optional approach to use LCEL with Transformer.
You can read the unit test here and the code here

We propose an alternative way of making transformers compatible with LCEL.
The first keeps the current protocol (RunnableDocumentTransformer).
The second takes advantage of this to propose
a lazy approach to transformations (RunnableGeneratorDocumentTransformer).
It's better for the memory, pipeline, etc.

Now, it's possible to create a pipeline of transformer like:
Example:

    class UpperTransformer(RunnableGeneratorDocumentTransformer):
        def lazy_transform_documents(
                self,
                documents: Iterator[Document],
                **kwargs: Any
        ) -> Iterator[Document]:
            ...

        async def alazy_transform_documents(
                self,
                documents: Union[AsyncIterator[Document],Iterator[Document]],
                **kwargs: Any
        ) -> AsyncIterator[Document]:
            ...

    runnable = (UpperTransformer() | LowerTransformer())
    result = list(runnable.invoke(documents))

If you accept my future pull-request, you will have to review all the Transformer subclasses.

LCEL with retriever

May be, it's possible to propose a syntax to chain the retrievers.

vs.as_retriever() | EmbeddingsFilter(...) | ContextualCompressionRetriever(...)

Pull-request

At this time, I must add some tests. Then I would like to propose one or multiple pull-request:

One for the MemoryRecordManager (PR #13200)
One for the SQLStore (PR #13181)
One for WrapperVectorStore (PR #13190)
One for the first integration of LCEL
And the last one for RagVectorStore

But, it's very complicated to maintain this for each new version. I've already suffered a lot with Google Drive or qa_with_reference integration.

Perhaps you can encourage me in my approach and allow me to have more direct access, in order to complete this proposal as quickly as possible?

@eyurtsev, what would be the most effective communication channel?

pprados · 2023-11-28T07:14:16Z

@eyurtsev,
The PR13910 replace this.

New version of parent_document_retriever to use "transformer" in plac…

ae05444

…e of "splitter"

dosubot bot added Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 18, 2023

pprados mentioned this pull request Oct 18, 2023

Add some transformers and a transformers pipeline #11972

Closed

Fix

ad7bacb

eyurtsev reviewed Oct 18, 2023

View reviewed changes

eyurtsev self-assigned this Oct 29, 2023

pprados marked this pull request as draft November 4, 2023 07:29

pprados mentioned this pull request Nov 10, 2023

ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

Closed

14 tasks

pprados marked this pull request as ready for review November 10, 2023 16:39

pprados mentioned this pull request Nov 27, 2023

Pprados/rag vectorstore #13910

Closed

pprados closed this Nov 28, 2023

pprados deleted the pprados/transformer_retriever branch October 7, 2024 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParentDocumentRetriever need splitter and not transformer #11968

ParentDocumentRetriever need splitter and not transformer #11968

pprados commented Oct 18, 2023

vercel bot commented Oct 18, 2023 •

edited

Loading

eyurtsev Oct 18, 2023

pprados Oct 19, 2023 •

edited

Loading

eyurtsev Oct 19, 2023

pprados commented Oct 23, 2023

pprados commented Oct 23, 2023 •

edited

Loading

eyurtsev commented Nov 1, 2023

eyurtsev commented Nov 1, 2023

pprados commented Nov 3, 2023 •

edited

Loading

pprados commented Nov 28, 2023

ParentDocumentRetriever need splitter and not transformer #11968

ParentDocumentRetriever need splitter and not transformer #11968

Conversation

pprados commented Oct 18, 2023

vercel bot commented Oct 18, 2023 • edited Loading

eyurtsev Oct 18, 2023

Choose a reason for hiding this comment

pprados Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

eyurtsev Oct 19, 2023

Choose a reason for hiding this comment

pprados commented Oct 23, 2023

pprados commented Oct 23, 2023 • edited Loading

eyurtsev commented Nov 1, 2023

eyurtsev commented Nov 1, 2023

pprados commented Nov 3, 2023 • edited Loading

RagVectorStore

LCEL with Transformer

LCEL with retriever

Pull-request

pprados commented Nov 28, 2023

vercel bot commented Oct 18, 2023 •

edited

Loading

pprados Oct 19, 2023 •

edited

Loading

pprados commented Oct 23, 2023 •

edited

Loading

pprados commented Nov 3, 2023 •

edited

Loading