core[minor]: New integration of LCEL in langchain 0.1 #16437

pprados · 2024-01-23T08:59:24Z

Description:

(note, it's a new version compatible with langchain 0.1)

Langchain introduces the BaseDocumentTransformer interface for document transformation, featuring two abstract methods:

def transform_documents(self, documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]
async def atransform_documents(self, documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

However, this interface lacks compatibility with LCEL.

Document transformation can vary in speed, from quick processes like text splitting to more extensive processes such as generating document summaries, translation via DoctranTextTranslator, and creating questions. Introducing an asynchronous lazy approach for transformations would be advantageous, facilitating use in streams and reducing memory consumption.

Furthermore, making all this compatible with LCEL presents an opportunity. The proposed solution involves an incremental approach. The initial step is to introduce a new, asynchronous, and lazy LCEL-compatible interface. Subsequently, all existing transformations (21 classes) would migrate to this new interface.

The RunnableGeneratorDocumentTransformer interface, derived from RunnableSerializable and BaseDocumentTransformer, offers two new methods:

lazy_transform_documents()
alazy_transform_documents()

This class aims to replace BaseDocumentTransformer.

These methods are generators, possibly asynchronous (AsyncIterator[Document]). The default implementations of transform_documents() and atransform_documents() use the lazy versions internally to maintain compatibility with streams.

A new version of TextSplitter and CharacterTextSplitter is proposed to demonstrate modifying the current transformer code to accommodate the new interface. (See langchain/unit_test/test_new_text_splitter.py).

The update enables a series of transformations:

runnable = (CharacterTextSplitter(...) | TokenTextSplitter(...))
transformed = list(runnable.invoke([doc1, doc2]))

The addition of the + operator allows multiple transformations for the same inputs:

runnable = (CharacterTextSplitter(...) + TokenTextSplitter(...))
transformed = list(runnable.invoke([doc1, doc2]))

Operators can be combined:

runnable = (UpperTransformer(...) + CharacterTextSplitter(...) |
            TokenTextSplitter(...))
transformed = list(runnable.invoke([doc1, doc2]))

Currently, only the unit tests provide examples of transformations.

If this proposal is accepted, the plan is to refactor all Transformer implementations to make them compatible with a lazy, asynchronous, and LCEL-compatible approach. This will be undertaken post-integration into the master branch. Refer to the tests test_lcel_transform_documents() and test_alcel_transform_documents().

Why is the lazy approach important?

Consider an API retrieving a URL from a client, responsible for fetching all documents and importing them into a vector store. Applying three transformations to each fragment (via ParentDocumentRetriever), the current code will consume excessive memory. In contrast, a lazy approach significantly reduces memory requirements, making it feasible to respond to multiple users simultaneously without a memory-intensive instance.

To improve the proximity of each piece, the
code applies three transformations to each fragment (via ParentDocumentRetriever):
- a summary,
- the generation of 3 questions,
- and translation into French.

With the current langchain API, the code will :

retrieve 10 PDF files of 200 pages,
split them into 10-page fragments (i.e. 200+20*10 pages in memory)
Produce 5 versions of each fragment (i.e. 200+(20*10)*4= 1000 pages in memory)
then save the fragments in the vector store 1000 pages + 1000 texts to import

The same scenario with a lazy approach translates as follows:

Retrieve a lazy file one file at a time (Loader.lazy_load(); 200 pages in memory)
Split the file into lazy files ( 200 pages + 1 fragment)
For each fragment, produce 5 versions (200 pages + 1 fragment + 1 variation)
Then save the fragments in the vectorstore
The maximum memory footprint is : 1 document, 1 fragment, 1 variation.

Transformers benefiting from the async lazy approach:

DoctranQATransformer
DoctranTextTranslator
OpenAIMetadataTagger
DoctranPropertyExtractor
NucliaTextTransformer
GoogleTranslateTransformer

Transformers less suited at this stage are those primarily working in memory:

LongContextReorder
EmbeddingsRedundantFilter
EmbeddingsClusteringFilter
Html2TextTransformer
All TextSplitter

Preliminary Step for Another Pull-Request:

This proposal serves as a preliminary step for another pull-request, offering a better implementation of ParentDocumentRetriever. Refer to langchain-rag.

Issue:

BaseDocumentTransformer is incompatible with a stream.
BaseDocumentTransformer is not LCEL-compatible.
The current API consumes excessive memory in many online scenarios.

Dependencies:

No dependencies.

Tag maintainer:

@baskaryan

Twitter handle:

Twitter ID: pprados

vercel · 2024-01-23T08:59:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Jan 24, 2024 8:34am

pprados · 2024-01-23T11:45:06Z

libs/core/langchain_core/documents/copy_transformer.py

+)
+
+
+class CopyDocumentTransformer(RunnableGeneratorDocumentTransformer):


For the moment, I've placed this class here, so as not to create a langchain_core.document_transformers (as in langchain). This can be changed.

pprados · 2024-01-23T11:58:15Z

@hwchase17
I've reorganized the code to be compatible with langchain 0.1
I think it's time to integrate this evolution, before there's too much of a gap with langchain-community.

New integration of LCEL in langchain 0.1

09e6b0b

pprados added 9 commits January 23, 2024 10:04

Remove new_text_splitter

d9edaa1

Add explanations

4b1a2c5

Remove circulare dependencies between langchain-core and langchain

0720edb

Fix format

0c2cced

Fix import order

17da4f3

Remove tools for tests

b098988

Remove tools for tests

b02ba27

Reorder import

00f2066

WIP

83803c5

vercel bot deployed to Preview January 23, 2024 10:12 View deployment

WIP

79dfd9c

vercel bot deployed to Preview January 23, 2024 11:06 View deployment

pprados added 3 commits January 23, 2024 12:12

WIP

5918069

WIP

dce9961

WIP

db99be8

vercel bot deployed to Preview January 23, 2024 11:26 View deployment

Add LCEL for Transformers

a6bd31a

vercel bot deployed to Preview January 23, 2024 11:44 View deployment

pprados commented Jan 23, 2024

View reviewed changes

pprados marked this pull request as ready for review January 23, 2024 11:58

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. Ɑ: lcel Related to LangChain Expression Language (LCEL) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jan 23, 2024

eyurtsev changed the title ~~New integration of LCEL in langchain 0.1~~ core[minor]: New integration of LCEL in langchain 0.1 Jan 23, 2024

Merge branch 'langchain-ai:master' into pprados/transformer_with_lcel

d6bdd79

hwchase17 closed this Jan 30, 2024

baskaryan reopened this Jan 30, 2024

baskaryan closed this Jan 30, 2024

baskaryan reopened this Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core[minor]: New integration of LCEL in langchain 0.1 #16437

core[minor]: New integration of LCEL in langchain 0.1 #16437

pprados commented Jan 23, 2024 •

edited

vercel bot commented Jan 23, 2024 •

edited

pprados Jan 23, 2024

pprados commented Jan 23, 2024

		)


		class CopyDocumentTransformer(RunnableGeneratorDocumentTransformer):

core[minor]: New integration of LCEL in langchain 0.1 #16437

Are you sure you want to change the base?

core[minor]: New integration of LCEL in langchain 0.1 #16437

Conversation

pprados commented Jan 23, 2024 • edited

vercel bot commented Jan 23, 2024 • edited

pprados Jan 23, 2024

Choose a reason for hiding this comment

pprados commented Jan 23, 2024

pprados commented Jan 23, 2024 •

edited

vercel bot commented Jan 23, 2024 •

edited