-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contextual compression retriever #2915
Conversation
will add tests and move prompt to own file but wanted to get ppl's high level thoughts first |
"""Filter down documents.""" | ||
|
||
@abstractmethod | ||
def afilter(self, docs: List[Document], query: str) -> List[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have an async keyword?
|
||
class BaseDocumentFilter(BaseModel, ABC): | ||
@abstractmethod | ||
def filter(self, docs: List[Document], query: str) -> List[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use sequence instead of list on function arguments to denote that the input won't be mutated by the implementation?
return {"question": query, "context": doc.page_content} | ||
|
||
|
||
class BaseDocumentFilter(BaseModel, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me filter sounds like it's just choosing which docs to keep. maybe BaseDocumentCompressor/Compression?
|
||
|
||
def _get_default_chain_prompt() -> PromptTemplate: | ||
template = """Given the following question and context, extract any part of the context *as is* that is relevant to answer the question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced it's important, but a lot of queries aren't "questions" per-se
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea. I'd imagine there may be some use of having a "None" option as well
…/langchain into dev2049/contextual_compression
A lot of things going on here, added all this to one branch to try to flesh out my own thinking but would split this pr up if/when we agree on which abstractions are actually worth keeping Added some more DocumentFilters and notion of a filter pipeline. Think these are now really more like document "transformers" than "filters", but can hash out naming later. There's now filters for
A filter pipeline would look something like this pipeline_filter = DocumentFilterPipeline(
filters=[
SplitterDocumentFilter(splitter=CharacterTextSplitter(chunk_size=100, chunk_overlap=0, separator=". ")),
EmbeddingRedundantDocumentFilter(embeddings=OpenAIEmbeddings()),
EmbeddingRelevancyDocumentFilter(embeddings=OpenAIEmbeddings(), similarity_threshold=0.8),
]
) this pipeline would break down docs into smaller chunks, remove the redundant ones, then keep only the relevant ones. and it'd reuse embeddings between steps 2 and 3 any and all feedback appreciated. very possible this is over-engineered, won't be the least bit offended if you say as much :) |
Sequence of relevant documents | ||
""" | ||
docs = self.base_retriever.get_relevant_documents(query) | ||
retrieved_docs = [RetrievedDocument.from_document(doc) for doc in docs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm if we just use retrieved document here only seems a bit weird
it would be nice if normal retrievers could also return RetrievedDocuments
could check if doc
already retrieved doc, convert if not. and just return retrieved doc?
retrieved doc may be a concept worth living in general schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually.... if we arent 100% sure about retrieved doc as an abstaction, could be kinda nice to have internal only? in whcih case maybe make it _RetrievedDocument
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally not super convinced it's a long term abstraction, so happy to make it _RetrievedDocument
langchain/math_utils.py
Outdated
Matrix = Union[List[List[float]], List[np.ndarray], np.ndarray] | ||
|
||
|
||
def cosine_similarity(x: Matrix, y: Matrix) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're ok making sklearn a dependency could just import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise i can add some unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's OK to repro for now
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: | ||
"""Calculate cosine similarity with numpy.""" | ||
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) | ||
from langchain.math_utils import cosine_similarity | ||
|
||
|
||
def maximal_marginal_relevance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably add some unit tests for this
Add DocumentTransformer abstraction so that in #2915 we don't have to wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses the query) in the contextual doc compression abstractions. with this change, doc filter (doc extractor, whatever we call it) would look something like ```python class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC): @AbstractMethod def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]: ... def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]: if query is None: raise ValueError("Must pass in non-null query to DocumentFilter") return self.filter(documents, query) ```
""" | ||
docs = await self.base_retriever.aget_relevant_documents(query) | ||
compressed_docs = await self.base_compressor.acompress_documents(docs, query) | ||
return list(compressed_docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this kosher in async? @agola11 @hwchase17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
…/langchain into dev2049/contextual_compression
remove useless classes
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Add DocumentTransformer abstraction so that in langchain-ai#2915 we don't have to wrap TextSplitter and RedundantEmbeddingFilter (neither of which uses the query) in the contextual doc compression abstractions. with this change, doc filter (doc extractor, whatever we call it) would look something like ```python class BaseDocumentFilter(BaseDocumentTransformer[_RetrievedDocument], ABC): @AbstractMethod def filter(self, documents: List[_RetrievedDocument], query: str) -> List[_RetrievedDocument]: ... def transform_documents(self, documents: List[_RetrievedDocument], query: Optional[str] = None, **kwargs: Any) -> List[_RetrievedDocument]: if query is None: raise ValueError("Must pass in non-null query to DocumentFilter") return self.filter(documents, query) ```
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
No description provided.