Skip to content

Commit

Permalink
Docs: Add guide for implementing custom retriever (#20350)
Browse files Browse the repository at this point in the history
Add longer guide for implementing custom retriever.

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
  • Loading branch information
2 people authored and hinthornw committed Apr 26, 2024
1 parent 21a4969 commit ac374a1
Show file tree
Hide file tree
Showing 2 changed files with 310 additions and 20 deletions.
309 changes: 309 additions & 0 deletions docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
{
"cells": [
{
"cell_type": "raw",
"id": "b5fc1fc7-c4c5-418f-99da-006c604a7ea6",
"metadata": {},
"source": [
"---\n",
"title: Custom Retriever\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "ff6f3c79-0848-4956-9115-54f6b2134587",
"metadata": {},
"source": [
"# Custom Retriever\n",
"\n",
"## Overview\n",
"\n",
"Many LLM applications involve retrieving information from external data sources using a `Retriever`. \n",
"\n",
"A retriever is responsible for retrieving a list of relevant `Documents` to a given user `query`.\n",
"\n",
"The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).\n",
"\n",
"## Interface\n",
"\n",
"To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n",
"\n",
"| Method | Description | Required/Optional |\n",
"|--------------------------------|--------------------------------------------------|-------------------|\n",
"| `_get_relevant_documents` | Get documents relevant to a query. | Required |\n",
"| `_aget_relevant_documents` | Implement to provide async native support. | Optional |\n",
"\n",
"\n",
"The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n",
"\n",
":::{.callout-tip}\n",
"By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/expression_language/interface) and will gain the standard `Runnable` functionality out of the box!\n",
":::\n",
"\n",
"\n",
":::{.callout-info}\n",
"You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n",
"\n",
"The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/expression_language/primitives/functions)) is that a `BaseRetriever` is a well\n",
"known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n",
"is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n",
"in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n",
":::\n"
]
},
{
"cell_type": "markdown",
"id": "2be9fe82-0757-41d1-a647-15bed11fd3bf",
"metadata": {},
"source": [
"## Example\n",
"\n",
"Let's implement a toy retriever that returns all documents whose text contains the text in the user query."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "bdf61902-2984-493b-a002-d4fced6df590",
"metadata": {},
"outputs": [],
"source": [
"from typing import List\n",
"\n",
"from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
"from langchain_core.documents import Document\n",
"from langchain_core.retrievers import BaseRetriever\n",
"\n",
"\n",
"class ToyRetriever(BaseRetriever):\n",
" \"\"\"A toy retriever that contains the top k documents that contain the user query.\n",
"\n",
" This retriever only implements the sync method _get_relevant_documents.\n",
"\n",
" If the retriever were to involve file access or network access, it could benefit\n",
" from a native async implementation of `_aget_relevant_documents`.\n",
"\n",
" As usual, with Runnables, there's a default async implementation that's provided\n",
" that delegates to the sync implementation running on another thread.\n",
" \"\"\"\n",
"\n",
" documents: List[Document]\n",
" \"\"\"List of documents to retrieve from.\"\"\"\n",
" k: int\n",
" \"\"\"Number of top results to return\"\"\"\n",
"\n",
" def _get_relevant_documents(\n",
" self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n",
" ) -> List[Document]:\n",
" \"\"\"Sync implementations for retriever.\"\"\"\n",
" matching_documents = []\n",
" for document in documents:\n",
" if len(matching_documents) > self.k:\n",
" return matching_documents\n",
"\n",
" if query.lower() in document.page_content.lower():\n",
" matching_documents.append(document)\n",
" return matching_documents\n",
"\n",
" # Optional: Provide a more efficient native implementation by overriding\n",
" # _aget_relevant_documents\n",
" # async def _aget_relevant_documents(\n",
" # self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun\n",
" # ) -> List[Document]:\n",
" # \"\"\"Asynchronously get documents relevant to a query.\n",
"\n",
" # Args:\n",
" # query: String to find relevant documents for\n",
" # run_manager: The callbacks handler to use\n",
"\n",
" # Returns:\n",
" # List of relevant documents\n",
" # \"\"\""
]
},
{
"cell_type": "markdown",
"id": "2eac1f28-29c1-4888-b3aa-b4fa70c73b4c",
"metadata": {},
"source": [
"## Test it 🧪"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "ea868db5-48cc-4ec2-9b0a-1ab94c32b302",
"metadata": {},
"outputs": [],
"source": [
"documents = [\n",
" Document(\n",
" page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n",
" metadata={\"type\": \"dog\", \"trait\": \"loyalty\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
" metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n",
" metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Parrots are intelligent birds capable of mimicking human speech.\",\n",
" metadata={\"type\": \"bird\", \"trait\": \"intelligence\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n",
" metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n",
" ),\n",
"]\n",
"retriever = ToyRetriever(documents=documents, k=3)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "18be85e9-6ef0-4ee0-ae5d-a0810c38b254",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.invoke(\"that\")"
]
},
{
"cell_type": "markdown",
"id": "13f76f6e-cf2b-4f67-859b-0ef8be98abbe",
"metadata": {},
"source": [
"It's a **runnable** so it'll benefit from the standard Runnable Interface! 🤩"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "3672e9fe-4365-4628-9d25-31924cfaf784",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"await retriever.ainvoke(\"that\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "e2c96eed-6813-421c-acf2-6554839840ee",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],\n",
" [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.batch([\"dog\", \"cat\"])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "978b6636-bf36-42c2-969c-207718f084cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}\n",
"{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}\n",
"{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}\n"
]
}
],
"source": [
"async for event in retriever.astream_events(\"bar\", version=\"v1\"):\n",
" print(event)"
]
},
{
"cell_type": "markdown",
"id": "7b45c404-37bf-4370-bb7c-26556777ff46",
"metadata": {},
"source": [
"## Contributing\n",
"\n",
"We appreciate contributions of interesting retrievers!\n",
"\n",
"Here's a checklist to help make sure your contribution gets added to LangChain:\n",
"\n",
"Documentation:\n",
"\n",
"* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n",
"* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)\n",
"\n",
"Tests:\n",
"\n",
"* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.\n",
"\n",
"Optimizations:\n",
"\n",
"If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!\n",
" \n",
"* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
21 changes: 1 addition & 20 deletions docs/docs/modules/data_connection/retrievers/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -80,23 +80,4 @@ chain.invoke("What did the president say about technology?")

## Custom Retriever

Since the retriever interface is so simple, it's pretty easy to write a custom one.

```python
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from typing import List


class CustomRetriever(BaseRetriever):

def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
return [Document(page_content=query)]

retriever = CustomRetriever()

retriever.get_relevant_documents("bar")
```
See the [documentation here](/docs/modules/data_connection/retrievers/custom_retriever) to implement a custom retriever.

0 comments on commit ac374a1

Please sign in to comment.