Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Add guide for implementing custom retriever #20350

Merged
merged 11 commits into from
Apr 12, 2024
309 changes: 309 additions & 0 deletions docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
{
"cells": [
{
"cell_type": "raw",
"id": "b5fc1fc7-c4c5-418f-99da-006c604a7ea6",
"metadata": {},
"source": [
"---\n",
"title: Custom Retriever\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "ff6f3c79-0848-4956-9115-54f6b2134587",
"metadata": {},
"source": [
"# Custom Retriever\n",
"\n",
"## Overview\n",
"\n",
"Many LLM applications involve retrieving information from external data sources using a `Retriever`. \n",
"\n",
"A retriever is responsible for retrieving a list of relevant `Documents` to a given user `query`.\n",
"\n",
"The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).\n",
"\n",
"## Interface\n",
"\n",
"To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n",
"\n",
"| Method | Description | Required/Optional |\n",
"|--------------------------------|--------------------------------------------------|-------------------|\n",
"| `_get_relevant_documents` | Get documents relevant to a query. | Required |\n",
"| `_aget_relevant_documents` | Implement to provide async native support. | Optional |\n",
"\n",
"\n",
"The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n",
"\n",
":::{.callout-tip}\n",
"By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/expression_language/interface) and will gain the standard `Runnable` functionality out of the box!\n",
":::\n",
"\n",
"\n",
":::{.callout-info}\n",
"You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n",
"\n",
"The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/expression_language/primitives/functions)) is that a `BaseRetriever` is a well\n",
"known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n",
"is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n",
"in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n",
":::\n"
]
},
{
"cell_type": "markdown",
"id": "2be9fe82-0757-41d1-a647-15bed11fd3bf",
"metadata": {},
"source": [
"## Example\n",
"\n",
"Let's implement a toy retriever that returns all documents whose text contains the text in the user query."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "bdf61902-2984-493b-a002-d4fced6df590",
"metadata": {},
"outputs": [],
"source": [
"from typing import List\n",
"\n",
"from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
"from langchain_core.documents import Document\n",
"from langchain_core.retrievers import BaseRetriever\n",
"\n",
"\n",
"class ToyRetriever(BaseRetriever):\n",
" \"\"\"A toy retriever that contains the top k documents that contain the user query.\n",
"\n",
" This retriever only implements the sync method _get_relevant_documents.\n",
"\n",
" If the retriever were to involve file access or network access, it could benefit\n",
" from a native async implementation of `_aget_relevant_documents`.\n",
"\n",
" As usual, with Runnables, there's a default async implementation that's provided\n",
" that delegates to the sync implementation running on another thread.\n",
" \"\"\"\n",
"\n",
" documents: List[Document]\n",
" \"\"\"List of documents to retrieve from.\"\"\"\n",
" k: int\n",
" \"\"\"Number of top results to return\"\"\"\n",
"\n",
" def _get_relevant_documents(\n",
" self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n",
" ) -> List[Document]:\n",
" \"\"\"Sync implementations for retriever.\"\"\"\n",
" matching_documents = []\n",
" for document in documents:\n",
" if len(matching_documents) > self.k:\n",
" return matching_documents\n",
"\n",
" if query.lower() in document.page_content.lower():\n",
" matching_documents.append(document)\n",
" return matching_documents\n",
"\n",
" # Optional: Provide a more efficient native implementation by overriding\n",
" # _aget_relevant_documents\n",
" # async def _aget_relevant_documents(\n",
" # self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun\n",
" # ) -> List[Document]:\n",
" # \"\"\"Asynchronously get documents relevant to a query.\n",
"\n",
" # Args:\n",
" # query: String to find relevant documents for\n",
" # run_manager: The callbacks handler to use\n",
"\n",
" # Returns:\n",
" # List of relevant documents\n",
" # \"\"\""
]
},
{
"cell_type": "markdown",
"id": "2eac1f28-29c1-4888-b3aa-b4fa70c73b4c",
"metadata": {},
"source": [
"## Test it 🧪"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "ea868db5-48cc-4ec2-9b0a-1ab94c32b302",
"metadata": {},
"outputs": [],
"source": [
"documents = [\n",
" Document(\n",
" page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n",
" metadata={\"type\": \"dog\", \"trait\": \"loyalty\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
" metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n",
" metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Parrots are intelligent birds capable of mimicking human speech.\",\n",
" metadata={\"type\": \"bird\", \"trait\": \"intelligence\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n",
" metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n",
" ),\n",
"]\n",
"retriever = ToyRetriever(documents=documents, k=3)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "18be85e9-6ef0-4ee0-ae5d-a0810c38b254",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.invoke(\"that\")"
]
},
{
"cell_type": "markdown",
"id": "13f76f6e-cf2b-4f67-859b-0ef8be98abbe",
"metadata": {},
"source": [
"It's a **runnable** so it'll benefit from the standard Runnable Interface! 🤩"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "3672e9fe-4365-4628-9d25-31924cfaf784",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"await retriever.ainvoke(\"that\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "e2c96eed-6813-421c-acf2-6554839840ee",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],\n",
" [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.batch([\"dog\", \"cat\"])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "978b6636-bf36-42c2-969c-207718f084cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}\n",
"{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}\n",
"{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}\n"
]
}
],
"source": [
"async for event in retriever.astream_events(\"bar\", version=\"v1\"):\n",
" print(event)"
]
},
{
"cell_type": "markdown",
"id": "7b45c404-37bf-4370-bb7c-26556777ff46",
"metadata": {},
"source": [
"## Contributing\n",
"\n",
"We appreciate contributions of interesting retrievers!\n",
"\n",
"Here's a checklist to help make sure your contribution gets added to LangChain:\n",
"\n",
"Documentation:\n",
"\n",
"* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n",
"* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)\n",
"\n",
"Tests:\n",
"\n",
"* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.\n",
"\n",
"Optimizations:\n",
"\n",
"If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!\n",
" \n",
"* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
21 changes: 1 addition & 20 deletions docs/docs/modules/data_connection/retrievers/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -80,23 +80,4 @@ chain.invoke("What did the president say about technology?")

## Custom Retriever

Since the retriever interface is so simple, it's pretty easy to write a custom one.

```python
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from typing import List


class CustomRetriever(BaseRetriever):

def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
return [Document(page_content=query)]

retriever = CustomRetriever()

retriever.get_relevant_documents("bar")
```
See the [documentation here](/docs/modules/data_connection/retrievers/custom_retriever) to implement a custom retriever.