Contextual compression retriever #2915

dev2049 · 2023-04-14T21:21:29Z

No description provided.

dev2049 · 2023-04-14T21:27:16Z

will add tests and move prompt to own file but wanted to get ppl's high level thoughts first

eyurtsev · 2023-04-15T19:12:50Z

langchain/retrievers/contextual_compression.py

+        """Filter down documents."""
+
+    @abstractmethod
+    def afilter(self, docs: List[Document], query: str) -> List[Document]:


Should this have an async keyword?

eyurtsev · 2023-04-15T19:13:17Z

langchain/retrievers/contextual_compression.py

+
+class BaseDocumentFilter(BaseModel, ABC):
+    @abstractmethod
+    def filter(self, docs: List[Document], query: str) -> List[Document]:


Is it possible to use sequence instead of list on function arguments to denote that the input won't be mutated by the implementation?

dev2049 · 2023-04-16T00:25:57Z

langchain/retrievers/contextual_compression.py

+    return {"question": query, "context": doc.page_content}
+
+
+class BaseDocumentFilter(BaseModel, ABC):


to me filter sounds like it's just choosing which docs to keep. maybe BaseDocumentCompressor/Compression?

vowelparrot · 2023-04-16T00:59:57Z

langchain/retrievers/contextual_compression.py

+
+
+def _get_default_chain_prompt() -> PromptTemplate:
+    template = """Given the following question and context, extract any part of the context *as is* that is relevant to answer the question.


I'm not convinced it's important, but a lot of queries aren't "questions" per-se

vowelparrot

I like the idea. I'd imagine there may be some use of having a "None" option as well

…/langchain into dev2049/contextual_compression

dev2049 · 2023-04-17T22:42:10Z

A lot of things going on here, added all this to one branch to try to flesh out my own thinking but would split this pr up if/when we agree on which abstractions are actually worth keeping

Added some more DocumentFilters and notion of a filter pipeline. Think these are now really more like document "transformers" than "filters", but can hash out naming later.

There's now filters for

llm-based compression (one doc at a time)
llm-based filtering based on relevance to query
embedding based filtering based on relevance to query
embedding based filtering based on redundancy with other docs
text splitter "filtering" (this is just a wrapper to fit splitters into the pipeline framework)

A filter pipeline would look something like this

pipeline_filter = DocumentFilterPipeline(
    filters=[
        SplitterDocumentFilter(splitter=CharacterTextSplitter(chunk_size=100, chunk_overlap=0, separator=". ")),
        EmbeddingRedundantDocumentFilter(embeddings=OpenAIEmbeddings()),
        EmbeddingRelevancyDocumentFilter(embeddings=OpenAIEmbeddings(), similarity_threshold=0.8),
    ]
)

this pipeline would break down docs into smaller chunks, remove the redundant ones, then keep only the relevant ones. and it'd reuse embeddings between steps 2 and 3

any and all feedback appreciated. very possible this is over-engineered, won't be the least bit offended if you say as much :)

hwchase17 · 2023-04-18T04:11:31Z

langchain/retrievers/contextual_compression.py

+            Sequence of relevant documents
+        """
+        docs = self.base_retriever.get_relevant_documents(query)
+        retrieved_docs = [RetrievedDocument.from_document(doc) for doc in docs]


hm if we just use retrieved document here only seems a bit weird

it would be nice if normal retrievers could also return RetrievedDocuments

could check if doc already retrieved doc, convert if not. and just return retrieved doc?

retrieved doc may be a concept worth living in general schema?

actually.... if we arent 100% sure about retrieved doc as an abstaction, could be kinda nice to have internal only? in whcih case maybe make it _RetrievedDocument?

personally not super convinced it's a long term abstraction, so happy to make it _RetrievedDocument

dev2049 · 2023-04-18T05:14:41Z

langchain/math_utils.py

+Matrix = Union[List[List[float]], List[np.ndarray], np.ndarray]
+
+
+def cosine_similarity(x: Matrix, y: Matrix) -> np.ndarray:


if we're ok making sklearn a dependency could just import

otherwise i can add some unit tests

Think it's OK to repro for now

dev2049 · 2023-04-18T18:41:02Z

langchain/vectorstores/utils.py

-def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
-    """Calculate cosine similarity with numpy."""
-    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
+from langchain.math_utils import cosine_similarity


 def maximal_marginal_relevance(


should probably add some unit tests for this

@AbstractMethod

Add DocumentTransformer abstraction so that in #2915 we don't have to wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses the query) in the contextual doc compression abstractions. with this change, doc filter (doc extractor, whatever we call it) would look something like ```python class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC): @AbstractMethod def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]: ... def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]: if query is None: raise ValueError("Must pass in non-null query to DocumentFilter") return self.filter(documents, query) ```

dev2049 · 2023-04-20T17:40:17Z

langchain/retrievers/contextual_compression.py

+        """
+        docs = await self.base_retriever.aget_relevant_documents(query)
+        compressed_docs = await self.base_compressor.acompress_documents(docs, query)
+        return list(compressed_docs)


is this kosher in async? @agola11 @hwchase17

…/langchain into dev2049/contextual_compression

remove useless classes

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

@AbstractMethod

Add DocumentTransformer abstraction so that in langchain-ai#2915 we don't have to wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses the query) in the contextual doc compression abstractions. with this change, doc filter (doc extractor, whatever we call it) would look something like ```python class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC): @AbstractMethod def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]: ... def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]: if query is None: raise ValueError("Must pass in non-null query to DocumentFilter") return self.filter(documents, query) ```

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

dev2049 added 3 commits April 13, 2023 12:28

rfc

b2933fa

Merge branch 'master' into dev2049/contextual_compression

0518477

cr

4e3f954

dev2049 requested review from hwchase17 and vowelparrot April 14, 2023 21:21

dev2049 and others added 2 commits April 14, 2023 14:30

nit

b7f3d83

cr

ae60708

eyurtsev reviewed Apr 15, 2023

View reviewed changes

dev2049 commented Apr 16, 2023

View reviewed changes

vowelparrot reviewed Apr 16, 2023

View reviewed changes

dev2049 added 2 commits April 17, 2023 10:35

Merge branch 'dev2049/contextual_compression' of github.com:hwchase17…

c8760fb

…/langchain into dev2049/contextual_compression

add filters and pipeline

9942cb9

dev2049 added 2 commits April 17, 2023 16:16

Merge branch 'master' into dev2049/contextual_compression

a04b1ea

cr

7ec300b

hwchase17 reviewed Apr 18, 2023

View reviewed changes

dev2049 added 2 commits April 17, 2023 21:31

cr

00e9d2f

cr

3fb67b2

dev2049 commented Apr 18, 2023

View reviewed changes

dev2049 added 6 commits April 17, 2023 22:20

docstring

8389c7a

private

9750409

Merge branch 'master' into dev2049/contextual_compression

4219c85

cosine unit test

2bdb060

unit test

48fef55

add example nb

ae1409c

dev2049 changed the title ~~RFC: Contextual compression retriever~~ Contextual compression retriever Apr 18, 2023

dev2049 marked this pull request as ready for review April 18, 2023 18:37

dev2049 commented Apr 18, 2023

View reviewed changes

fmt

2717614

This comment was marked as outdated.

Sign in to view

dev2049 added 2 commits April 19, 2023 12:08

Merge branch 'master' into dev2049/contextual_compression

61f1177

rename

b86aedd

dev2049 mentioned this pull request Apr 19, 2023

Add document transformer abstraction #3182

Merged

dev2049 added 10 commits April 19, 2023 16:07

cr

5ef902a

Merge branch 'master' into dev2049/contextual_compression

6c4ee20

cr

a3f2aa3

add doc transformer

73ab935

rename

40b192a

cr

68dd33f

fix

92ddfc4

nit

5101fa1

nb

261f510

nb

3fbf4e1

dev2049 commented Apr 20, 2023

View reviewed changes

dev2049 mentioned this pull request Apr 20, 2023

Add contextual compression base abstractions langchain-ai/langchainjs#915

Closed

hwchase17 added 3 commits April 20, 2023 16:13

Merge branch 'master' into dev2049/contextual_compression

3e0a65c

Merge branch 'dev2049/contextual_compression' of github.com:hwchase17…

cb46cd0

…/langchain into dev2049/contextual_compression

contextual compression (#3253)

bad602a

remove useless classes

hwchase17 merged commit 46542dc into master Apr 21, 2023
9 checks passed

hwchase17 deleted the dev2049/contextual_compression branch April 21, 2023 00:01

vowelparrot pushed a commit that referenced this pull request Apr 21, 2023

Contextual compression retriever (#2915)

d2173e0

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

martin-liu mentioned this pull request Apr 24, 2023

ValueError in cosine_similarity when using FAISS index as vector store #3384

Closed

jphme mentioned this pull request Apr 26, 2023

Async Support for LLMChainExtractor #3587

Closed

vowelparrot pushed a commit that referenced this pull request Apr 28, 2023

Contextual compression retriever (#2915)

c31cdc6

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023

Contextual compression retriever (langchain-ai#2915)

1a1c81c

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

yanghua pushed a commit to yanghua/langchain that referenced this pull request May 9, 2023

Contextual compression retriever (langchain-ai#2915)

d5cb130

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contextual compression retriever #2915

Contextual compression retriever #2915

dev2049 commented Apr 14, 2023

dev2049 commented Apr 14, 2023

eyurtsev Apr 15, 2023 •

edited

eyurtsev Apr 15, 2023 •

edited

dev2049 Apr 16, 2023

vowelparrot Apr 16, 2023

vowelparrot left a comment

dev2049 commented Apr 17, 2023

hwchase17 Apr 18, 2023

hwchase17 Apr 18, 2023

dev2049 Apr 18, 2023

dev2049 Apr 18, 2023

dev2049 Apr 18, 2023

vowelparrot Apr 18, 2023

dev2049 Apr 18, 2023

This comment was marked as outdated.

dev2049 Apr 20, 2023

hwchase17 Apr 20, 2023

		return {"question": query, "context": doc.page_content}


		class BaseDocumentFilter(BaseModel, ABC):



		def _get_default_chain_prompt() -> PromptTemplate:
		template = """Given the following question and context, extract any part of the context as is that is relevant to answer the question.

		Matrix = Union[List[List[float]], List[np.ndarray], np.ndarray]


		def cosine_similarity(x: Matrix, y: Matrix) -> np.ndarray:

Contextual compression retriever #2915

Contextual compression retriever #2915

Conversation

dev2049 commented Apr 14, 2023

dev2049 commented Apr 14, 2023

eyurtsev Apr 15, 2023 • edited

Choose a reason for hiding this comment

eyurtsev Apr 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vowelparrot left a comment

Choose a reason for hiding this comment

dev2049 commented Apr 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev Apr 15, 2023 •

edited

eyurtsev Apr 15, 2023 •

edited