Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document transformer abstraction #3182

Merged
merged 1 commit into from Apr 19, 2023
Merged

Conversation

dev2049
Copy link
Contributor

@dev2049 dev2049 commented Apr 19, 2023

Add DocumentTransformer abstraction so that in #2915 we don't have to wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses the query) in the contextual doc compression abstractions. with this change, doc filter (doc extractor, whatever we call it) would look something like

class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC):
  
  @abstractmethod
  def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]:
    ...
  
  def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]:
    if query is None:
      raise ValueError("Must pass in non-null query to DocumentFilter")
    return self.filter(documents, query)

@dev2049 dev2049 requested a review from hwchase17 April 19, 2023 22:24
@hwchase17 hwchase17 merged commit 10e4b32 into master Apr 19, 2023
9 checks passed
@hwchase17 hwchase17 deleted the dev2049/doc_transformer branch April 19, 2023 23:05
samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023
Add DocumentTransformer abstraction so that in langchain-ai#2915 we don't have to
wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses
the query) in the contextual doc compression abstractions. with this
change, doc filter (doc extractor, whatever we call it) would look
something like
```python
class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC):
  
  @AbstractMethod
  def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]:
    ...
  
  def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]:
    if query is None:
      raise ValueError("Must pass in non-null query to DocumentFilter")
    return self.filter(documents, query)
```
baskaryan added a commit that referenced this pull request Jul 13, 2023
- Description: Add two new document transformers that translates
documents into different languages and converts documents into q&a
format to improve vector search results. Uses OpenAI function calling
via the [doctran](https://github.com/psychic-api/doctran/tree/main)
library.
  - Issue: N/A
  - Dependencies: `doctran = "^0.0.5"`
  - Tag maintainer: @rlancemartin @eyurtsev @hwchase17 
  - Twitter handle: @psychicapi or @jfan001

Notes
- Adheres to the `DocumentTransformer` abstraction set by @dev2049 in
#3182
- refactored `EmbeddingsRedundantFilter` to put it in a file under a new
`document_transformers` module
- Added basic docs for `DocumentInterrogator`, `DocumentTransformer` as
well as the existing `EmbeddingsRedundantFilter`

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
aerrober pushed a commit to aerrober/langchain-fork that referenced this pull request Jul 24, 2023
- Description: Add two new document transformers that translates
documents into different languages and converts documents into q&a
format to improve vector search results. Uses OpenAI function calling
via the [doctran](https://github.com/psychic-api/doctran/tree/main)
library.
  - Issue: N/A
  - Dependencies: `doctran = "^0.0.5"`
  - Tag maintainer: @rlancemartin @eyurtsev @hwchase17 
  - Twitter handle: @psychicapi or @jfan001

Notes
- Adheres to the `DocumentTransformer` abstraction set by @dev2049 in
langchain-ai#3182
- refactored `EmbeddingsRedundantFilter` to put it in a file under a new
`document_transformers` module
- Added basic docs for `DocumentInterrogator`, `DocumentTransformer` as
well as the existing `EmbeddingsRedundantFilter`

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants