Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core[minor]: New integration of LCEL in langchain 0.1 #16437

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Jan 23, 2024

Description:

(note, it's a new version compatible with langchain 0.1)

Langchain introduces the BaseDocumentTransformer interface for document transformation, featuring two abstract methods:

  • def transform_documents(self, documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]
  • async def atransform_documents(self, documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

However, this interface lacks compatibility with LCEL.

Document transformation can vary in speed, from quick processes like text splitting to more extensive processes such as generating document summaries, translation via DoctranTextTranslator, and creating questions. Introducing an asynchronous lazy approach for transformations would be advantageous, facilitating use in streams and reducing memory consumption.

Furthermore, making all this compatible with LCEL presents an opportunity. The proposed solution involves an incremental approach. The initial step is to introduce a new, asynchronous, and lazy LCEL-compatible interface. Subsequently, all existing transformations (21 classes) would migrate to this new interface.

The RunnableGeneratorDocumentTransformer interface, derived from RunnableSerializable and BaseDocumentTransformer, offers two new methods:

  • lazy_transform_documents()
  • alazy_transform_documents()

This class aims to replace BaseDocumentTransformer.

These methods are generators, possibly asynchronous (AsyncIterator[Document]). The default implementations of transform_documents() and atransform_documents() use the lazy versions internally to maintain compatibility with streams.

A new version of TextSplitter and CharacterTextSplitter is proposed to demonstrate modifying the current transformer code to accommodate the new interface. (See langchain/unit_test/test_new_text_splitter.py).

The update enables a series of transformations:

runnable = (CharacterTextSplitter(...) | TokenTextSplitter(...))
transformed = list(runnable.invoke([doc1, doc2]))

The addition of the + operator allows multiple transformations for the same inputs:

runnable = (CharacterTextSplitter(...) + TokenTextSplitter(...))
transformed = list(runnable.invoke([doc1, doc2]))

Operators can be combined:

runnable = (UpperTransformer(...) + CharacterTextSplitter(...) |
            TokenTextSplitter(...))
transformed = list(runnable.invoke([doc1, doc2]))

Currently, only the unit tests provide examples of transformations.

If this proposal is accepted, the plan is to refactor all Transformer implementations to make them compatible with a lazy, asynchronous, and LCEL-compatible approach. This will be undertaken post-integration into the master branch. Refer to the tests test_lcel_transform_documents() and test_alcel_transform_documents().

Why is the lazy approach important?

Consider an API retrieving a URL from a client, responsible for fetching all documents and importing them into a vector store. Applying three transformations to each fragment (via ParentDocumentRetriever), the current code will consume excessive memory. In contrast, a lazy approach significantly reduces memory requirements, making it feasible to respond to multiple users simultaneously without a memory-intensive instance.

To improve the proximity of each piece, the
code applies three transformations to each fragment (via ParentDocumentRetriever):
- a summary,
- the generation of 3 questions,
- and translation into French.

With the current langchain API, the code will :

  • retrieve 10 PDF files of 200 pages,
  • split them into 10-page fragments (i.e. 200+20*10 pages in memory)
  • Produce 5 versions of each fragment (i.e. 200+(20*10)*4= 1000 pages in memory)
  • then save the fragments in the vector store 1000 pages + 1000 texts to import

The same scenario with a lazy approach translates as follows:

  • Retrieve a lazy file one file at a time (Loader.lazy_load(); 200 pages in memory)
  • Split the file into lazy files ( 200 pages + 1 fragment)
  • For each fragment, produce 5 versions (200 pages + 1 fragment + 1 variation)
  • Then save the fragments in the vectorstore
    The maximum memory footprint is : 1 document, 1 fragment, 1 variation.

Transformers benefiting from the async lazy approach:

  • DoctranQATransformer
  • DoctranTextTranslator
  • OpenAIMetadataTagger
  • DoctranPropertyExtractor
  • NucliaTextTransformer
  • GoogleTranslateTransformer

Transformers less suited at this stage are those primarily working in memory:

  • LongContextReorder
  • EmbeddingsRedundantFilter
  • EmbeddingsClusteringFilter
  • Html2TextTransformer
  • All TextSplitter

Preliminary Step for Another Pull-Request:

This proposal serves as a preliminary step for another pull-request, offering a better implementation of ParentDocumentRetriever. Refer to langchain-rag.

Issue:

  • BaseDocumentTransformer is incompatible with a stream.
  • BaseDocumentTransformer is not LCEL-compatible.
  • The current API consumes excessive memory in many online scenarios.

Dependencies:

No dependencies.

Tag maintainer:

@baskaryan

Twitter handle:

Twitter ID: pprados

Copy link

vercel bot commented Jan 23, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jan 24, 2024 8:34am

)


class CopyDocumentTransformer(RunnableGeneratorDocumentTransformer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment, I've placed this class here, so as not to create a langchain_core.document_transformers (as in langchain). This can be changed.

@pprados
Copy link
Contributor Author

pprados commented Jan 23, 2024

@hwchase17
I've reorganized the code to be compatible with langchain 0.1
I think it's time to integrate this evolution, before there's too much of a gap with langchain-community.

@pprados pprados marked this pull request as ready for review January 23, 2024 11:58
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. Ɑ: lcel Related to LangChain Expression Language (LCEL) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jan 23, 2024
@eyurtsev eyurtsev changed the title New integration of LCEL in langchain 0.1 core[minor]: New integration of LCEL in langchain 0.1 Jan 23, 2024
@hwchase17 hwchase17 closed this Jan 30, 2024
@baskaryan baskaryan reopened this Jan 30, 2024
@baskaryan baskaryan closed this Jan 30, 2024
@baskaryan baskaryan reopened this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features Ɑ: lcel Related to LangChain Expression Language (LCEL) size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants