Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contextual compression retriever #2915

Merged
merged 33 commits into from
Apr 21, 2023
Merged

Conversation

dev2049
Copy link
Contributor

@dev2049 dev2049 commented Apr 14, 2023

No description provided.

@dev2049
Copy link
Contributor Author

dev2049 commented Apr 14, 2023

will add tests and move prompt to own file but wanted to get ppl's high level thoughts first

"""Filter down documents."""

@abstractmethod
def afilter(self, docs: List[Document], query: str) -> List[Document]:
Copy link
Collaborator

@eyurtsev eyurtsev Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have an async keyword?


class BaseDocumentFilter(BaseModel, ABC):
@abstractmethod
def filter(self, docs: List[Document], query: str) -> List[Document]:
Copy link
Collaborator

@eyurtsev eyurtsev Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use sequence instead of list on function arguments to denote that the input won't be mutated by the implementation?

return {"question": query, "context": doc.page_content}


class BaseDocumentFilter(BaseModel, ABC):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me filter sounds like it's just choosing which docs to keep. maybe BaseDocumentCompressor/Compression?



def _get_default_chain_prompt() -> PromptTemplate:
template = """Given the following question and context, extract any part of the context *as is* that is relevant to answer the question.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it's important, but a lot of queries aren't "questions" per-se

Copy link
Contributor

@vowelparrot vowelparrot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. I'd imagine there may be some use of having a "None" option as well

@dev2049
Copy link
Contributor Author

dev2049 commented Apr 17, 2023

A lot of things going on here, added all this to one branch to try to flesh out my own thinking but would split this pr up if/when we agree on which abstractions are actually worth keeping

Added some more DocumentFilters and notion of a filter pipeline. Think these are now really more like document "transformers" than "filters", but can hash out naming later.

There's now filters for

  • llm-based compression (one doc at a time)
  • llm-based filtering based on relevance to query
  • embedding based filtering based on relevance to query
  • embedding based filtering based on redundancy with other docs
  • text splitter "filtering" (this is just a wrapper to fit splitters into the pipeline framework)

A filter pipeline would look something like this

pipeline_filter = DocumentFilterPipeline(
    filters=[
        SplitterDocumentFilter(splitter=CharacterTextSplitter(chunk_size=100, chunk_overlap=0, separator=". ")),
        EmbeddingRedundantDocumentFilter(embeddings=OpenAIEmbeddings()),
        EmbeddingRelevancyDocumentFilter(embeddings=OpenAIEmbeddings(), similarity_threshold=0.8),
    ]
)

this pipeline would break down docs into smaller chunks, remove the redundant ones, then keep only the relevant ones. and it'd reuse embeddings between steps 2 and 3

any and all feedback appreciated. very possible this is over-engineered, won't be the least bit offended if you say as much :)

Sequence of relevant documents
"""
docs = self.base_retriever.get_relevant_documents(query)
retrieved_docs = [RetrievedDocument.from_document(doc) for doc in docs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm if we just use retrieved document here only seems a bit weird

it would be nice if normal retrievers could also return RetrievedDocuments

could check if doc already retrieved doc, convert if not. and just return retrieved doc?

retrieved doc may be a concept worth living in general schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually.... if we arent 100% sure about retrieved doc as an abstaction, could be kinda nice to have internal only? in whcih case maybe make it _RetrievedDocument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally not super convinced it's a long term abstraction, so happy to make it _RetrievedDocument

Matrix = Union[List[List[float]], List[np.ndarray], np.ndarray]


def cosine_similarity(x: Matrix, y: Matrix) -> np.ndarray:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're ok making sklearn a dependency could just import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise i can add some unit tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's OK to repro for now

@dev2049 dev2049 changed the title RFC: Contextual compression retriever Contextual compression retriever Apr 18, 2023
@dev2049 dev2049 marked this pull request as ready for review April 18, 2023 18:37
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Calculate cosine similarity with numpy."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
from langchain.math_utils import cosine_similarity


def maximal_marginal_relevance(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably add some unit tests for this

hinthornw

This comment was marked as outdated.

hwchase17 pushed a commit that referenced this pull request Apr 19, 2023
Add DocumentTransformer abstraction so that in #2915 we don't have to
wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses
the query) in the contextual doc compression abstractions. with this
change, doc filter (doc extractor, whatever we call it) would look
something like
```python
class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC):
  
  @AbstractMethod
  def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]:
    ...
  
  def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]:
    if query is None:
      raise ValueError("Must pass in non-null query to DocumentFilter")
    return self.filter(documents, query)
```
"""
docs = await self.base_retriever.aget_relevant_documents(query)
compressed_docs = await self.base_compressor.acompress_documents(docs, query)
return list(compressed_docs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this kosher in async? @agola11 @hwchase17

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@hwchase17 hwchase17 merged commit 46542dc into master Apr 21, 2023
9 checks passed
@hwchase17 hwchase17 deleted the dev2049/contextual_compression branch April 21, 2023 00:01
vowelparrot pushed a commit that referenced this pull request Apr 21, 2023
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
vowelparrot pushed a commit that referenced this pull request Apr 28, 2023
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023
Add DocumentTransformer abstraction so that in langchain-ai#2915 we don't have to
wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses
the query) in the contextual doc compression abstractions. with this
change, doc filter (doc extractor, whatever we call it) would look
something like
```python
class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC):
  
  @AbstractMethod
  def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]:
    ...
  
  def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]:
    if query is None:
      raise ValueError("Must pass in non-null query to DocumentFilter")
    return self.filter(documents, query)
```
samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
yanghua pushed a commit to yanghua/langchain that referenced this pull request May 9, 2023
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants