-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core[minor]: New integration of LCEL in langchain 0.1 #16437
base: master
Are you sure you want to change the base?
core[minor]: New integration of LCEL in langchain 0.1 #16437
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
) | ||
|
||
|
||
class CopyDocumentTransformer(RunnableGeneratorDocumentTransformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the moment, I've placed this class here, so as not to create a langchain_core.document_transformers (as in langchain). This can be changed.
@hwchase17 |
Description:
(note, it's a new version compatible with langchain 0.1)
Langchain introduces the
BaseDocumentTransformer
interface for document transformation, featuring two abstract methods:def transform_documents(self, documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]
async def atransform_documents(self, documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]
However, this interface lacks compatibility with LCEL.
Document transformation can vary in speed, from quick processes like text splitting to more extensive processes such as generating document summaries, translation via
DoctranTextTranslator
, and creating questions. Introducing an asynchronous lazy approach for transformations would be advantageous, facilitating use in streams and reducing memory consumption.Furthermore, making all this compatible with LCEL presents an opportunity. The proposed solution involves an incremental approach. The initial step is to introduce a new, asynchronous, and lazy LCEL-compatible interface. Subsequently, all existing transformations (21 classes) would migrate to this new interface.
The
RunnableGeneratorDocumentTransformer
interface, derived fromRunnableSerializable
andBaseDocumentTransformer
, offers two new methods:lazy_transform_documents()
alazy_transform_documents()
This class aims to replace
BaseDocumentTransformer
.These methods are generators, possibly asynchronous (
AsyncIterator[Document]
). The default implementations oftransform_documents()
andatransform_documents()
use the lazy versions internally to maintain compatibility with streams.A new version of
TextSplitter
andCharacterTextSplitter
is proposed to demonstrate modifying the current transformer code to accommodate the new interface. (Seelangchain/unit_test/test_new_text_splitter.py
).The update enables a series of transformations:
The addition of the
+
operator allows multiple transformations for the same inputs:Operators can be combined:
Currently, only the unit tests provide examples of transformations.
If this proposal is accepted, the plan is to refactor all Transformer implementations to make them compatible with a lazy, asynchronous, and LCEL-compatible approach. This will be undertaken post-integration into the master branch. Refer to the tests
test_lcel_transform_documents()
andtest_alcel_transform_documents()
.Why is the lazy approach important?
Consider an API retrieving a URL from a client, responsible for fetching all documents and importing them into a vector store. Applying three transformations to each fragment (via
ParentDocumentRetriever
), the current code will consume excessive memory. In contrast, a lazy approach significantly reduces memory requirements, making it feasible to respond to multiple users simultaneously without a memory-intensive instance.To improve the proximity of each piece, the
code applies three transformations to each fragment (via
ParentDocumentRetriever
):- a summary,
- the generation of 3 questions,
- and translation into French.
With the current langchain API, the code will :
The same scenario with a lazy approach translates as follows:
The maximum memory footprint is : 1 document, 1 fragment, 1 variation.
Transformers benefiting from the async lazy approach:
DoctranQATransformer
DoctranTextTranslator
OpenAIMetadataTagger
DoctranPropertyExtractor
NucliaTextTransformer
GoogleTranslateTransformer
Transformers less suited at this stage are those primarily working in memory:
LongContextReorder
EmbeddingsRedundantFilter
EmbeddingsClusteringFilter
Html2TextTransformer
TextSplitter
Preliminary Step for Another Pull-Request:
This proposal serves as a preliminary step for another pull-request, offering a better implementation of
ParentDocumentRetriever
. Refer to langchain-rag.Issue:
BaseDocumentTransformer
is incompatible with a stream.BaseDocumentTransformer
is not LCEL-compatible.Dependencies:
No dependencies.
Tag maintainer:
@baskaryan
Twitter handle:
Twitter ID: pprados