-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParentDocumentRetriever need splitter and not transformer #11968
ParentDocumentRetriever need splitter and not transformer #11968
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
) | ||
""" | ||
|
||
child_transformer: BaseDocumentTransformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pprados let's change this on the original parent document retriever.
If there is a way to change names using pydantic features without making breaking changes that's great. If not, that's fine -- better have a slight unintuitive parameter name than a separate class with essentially identical functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eyurtsev, I am working on a better implementation.
A question about the usage of parser.
When I use a parser in a prompt, now, the method predict_and_parse() throw a warning
The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
.
Does this mean that it is no longer possible to have a parser in a prompt?
However, this is a good idea for complex algorithms requiring several prompts, such as map_reduce, refine, etc.
I don't know how it's possible to pass an output parser directly to LLMChain. To do this, I must always receive an LLMChain and not a prompt as a parameter.
The map_reduce_chain
receive a question_prompt
, combine_prompt
and collapse_prompt
. It will be deprecated ?
And, how it's possible to reuse the LLMChain with differents parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on a better implementation.
Sounds good -- I would just suggest to discuss implementation / design prior to putting in the effort to make sure that the new code gets merged in
-
LLMChain is a legacy object. It's not a good idea to build new functionality on top of it. Use LCEL runnables instead.
-
LLMChain accepts a parser: https://github.com/langchain-ai/langchain/blob/68599d98c20b1e3fbdabaef1b1fbe54cd06b98a4/libs/langchain/langchain/chains/llm.py#L54C1-L54C1. (But again we don't want to be using LLMChain in new code)
-
The map_reduce_chain receive a question_prompt, combine_prompt and collapse_prompt -- no plans to deprecate them as users are using them. We may at some point provide new implementations with LCEL
Hello @eyurtsev My goal is to be able to combine all the advanced features of langchain. That's impossible at the moment. I want to be able to :
My idea is to offer a wrapper around the VS. This wrapper can be use in place of all VS. The code is not published now, but you can read a I will propose
I think it's possible to enrich LCEL to introduce these new ideas:
What are your thoughts on this? |
I think, the BaseDocumentTransformer must be RunnableSerializable[Sequence[Document], Sequence[Document]] to apply a better syntax. But, all subclass (TextSplitter, etc.) must use Pydantic. (I try it). I've started to implement this. |
I don't know that storing all the information in the same vectorstore will be possible with all vectorstores. Some ofthe vectorstores are limited in terms of which operations they support and what kind of data can be stored in them. Would a document store that can keep track of document content and metadata, and allows basic operations together with read, write, list based on metadata + a layer that supports caching intermediate processing results support this use case? This would allow fetch the relevant data (potentially cached) and index it into an arbitrary vectorstore?
We've considered doing this, but we haven't pushed on it yet, some of the issues that needed to be worked out:
|
I strongly urge you to look at the notebook. It completely demonstrates all the advantages of full integration of langchain features. RagVectorStore
Like the ParentDocumentRetriever, I combine a The new version of the notebook is here. For a complete integration, with all advanced features, I propose to add some objects:
Why is important?
LCEL with TransformerIn my implementation, I propose an optional approach to use LCEL with Transformer.
class UpperTransformer(RunnableGeneratorDocumentTransformer):
def lazy_transform_documents(
self,
documents: Iterator[Document],
**kwargs: Any
) -> Iterator[Document]:
...
async def alazy_transform_documents(
self,
documents: Union[AsyncIterator[Document],Iterator[Document]],
**kwargs: Any
) -> AsyncIterator[Document]:
...
runnable = (UpperTransformer() | LowerTransformer())
result = list(runnable.invoke(documents)) If you accept my future pull-request, you will have to review all the Transformer subclasses. LCEL with retrieverMay be, it's possible to propose a syntax to chain the retrievers. vs.as_retriever() | EmbeddingsFilter(...) | ContextualCompressionRetriever(...) Pull-requestAt this time, I must add some tests. Then I would like to propose one or multiple pull-request:
But, it's very complicated to maintain this for each new version. I've already suffered a lot with Google Drive or qa_with_reference integration. Perhaps you can encourage me in my approach and allow me to have more direct access, in order to complete this proposal as quickly as possible? @eyurtsev, what would be the most effective communication channel? |
Description:
The current implementation of
ParentDocumentRetriever
need two splitter:child_splitter
andparent_splitter
.But a splitter is a kind of
BaseDocumentTransformer
.It is a limitation. If you want to apply transformations to the documents (for each fragment, generate questions, generate a summary, etc.) this is not possible.
I propose a new version of ParentDocumentRetriever (in the file parent_document_retriever_v2.py for the moment). This version waits
child_transformer
andchild_parent
. It's possible to use it with some splitter.I must change the name of the attributs because it's now, some transformer. May be, it's possible to declare the
child_splitter
andparent_splitter
to be compatible with the previous version. I can do it if you accept the principle of my pull-request.The idea behind this is to improve RAGs, by increasing the versions of embeddings for each fragment.
With other pull-request, it may be possible to write:
New version of parent_document_retriever to use "transformer" in place of "splitter"
Tag maintainer:
@baskaryan