forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Docs: Add guide for implementing custom retriever (langchain-ai#20350)
Add longer guide for implementing custom retriever. --------- Co-authored-by: ccurme <chester.curme@gmail.com>
- Loading branch information
1 parent
f110d3c
commit b6a9262
Showing
2 changed files
with
310 additions
and
20 deletions.
There are no files selected for viewing
309 changes: 309 additions & 0 deletions
309
docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,309 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "raw", | ||
"id": "b5fc1fc7-c4c5-418f-99da-006c604a7ea6", | ||
"metadata": {}, | ||
"source": [ | ||
"---\n", | ||
"title: Custom Retriever\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ff6f3c79-0848-4956-9115-54f6b2134587", | ||
"metadata": {}, | ||
"source": [ | ||
"# Custom Retriever\n", | ||
"\n", | ||
"## Overview\n", | ||
"\n", | ||
"Many LLM applications involve retrieving information from external data sources using a `Retriever`. \n", | ||
"\n", | ||
"A retriever is responsible for retrieving a list of relevant `Documents` to a given user `query`.\n", | ||
"\n", | ||
"The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).\n", | ||
"\n", | ||
"## Interface\n", | ||
"\n", | ||
"To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n", | ||
"\n", | ||
"| Method | Description | Required/Optional |\n", | ||
"|--------------------------------|--------------------------------------------------|-------------------|\n", | ||
"| `_get_relevant_documents` | Get documents relevant to a query. | Required |\n", | ||
"| `_aget_relevant_documents` | Implement to provide async native support. | Optional |\n", | ||
"\n", | ||
"\n", | ||
"The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n", | ||
"\n", | ||
":::{.callout-tip}\n", | ||
"By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/expression_language/interface) and will gain the standard `Runnable` functionality out of the box!\n", | ||
":::\n", | ||
"\n", | ||
"\n", | ||
":::{.callout-info}\n", | ||
"You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n", | ||
"\n", | ||
"The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/expression_language/primitives/functions)) is that a `BaseRetriever` is a well\n", | ||
"known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n", | ||
"is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n", | ||
"in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n", | ||
":::\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2be9fe82-0757-41d1-a647-15bed11fd3bf", | ||
"metadata": {}, | ||
"source": [ | ||
"## Example\n", | ||
"\n", | ||
"Let's implement a toy retriever that returns all documents whose text contains the text in the user query." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 26, | ||
"id": "bdf61902-2984-493b-a002-d4fced6df590", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from typing import List\n", | ||
"\n", | ||
"from langchain_core.callbacks import CallbackManagerForRetrieverRun\n", | ||
"from langchain_core.documents import Document\n", | ||
"from langchain_core.retrievers import BaseRetriever\n", | ||
"\n", | ||
"\n", | ||
"class ToyRetriever(BaseRetriever):\n", | ||
" \"\"\"A toy retriever that contains the top k documents that contain the user query.\n", | ||
"\n", | ||
" This retriever only implements the sync method _get_relevant_documents.\n", | ||
"\n", | ||
" If the retriever were to involve file access or network access, it could benefit\n", | ||
" from a native async implementation of `_aget_relevant_documents`.\n", | ||
"\n", | ||
" As usual, with Runnables, there's a default async implementation that's provided\n", | ||
" that delegates to the sync implementation running on another thread.\n", | ||
" \"\"\"\n", | ||
"\n", | ||
" documents: List[Document]\n", | ||
" \"\"\"List of documents to retrieve from.\"\"\"\n", | ||
" k: int\n", | ||
" \"\"\"Number of top results to return\"\"\"\n", | ||
"\n", | ||
" def _get_relevant_documents(\n", | ||
" self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n", | ||
" ) -> List[Document]:\n", | ||
" \"\"\"Sync implementations for retriever.\"\"\"\n", | ||
" matching_documents = []\n", | ||
" for document in documents:\n", | ||
" if len(matching_documents) > self.k:\n", | ||
" return matching_documents\n", | ||
"\n", | ||
" if query.lower() in document.page_content.lower():\n", | ||
" matching_documents.append(document)\n", | ||
" return matching_documents\n", | ||
"\n", | ||
" # Optional: Provide a more efficient native implementation by overriding\n", | ||
" # _aget_relevant_documents\n", | ||
" # async def _aget_relevant_documents(\n", | ||
" # self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun\n", | ||
" # ) -> List[Document]:\n", | ||
" # \"\"\"Asynchronously get documents relevant to a query.\n", | ||
"\n", | ||
" # Args:\n", | ||
" # query: String to find relevant documents for\n", | ||
" # run_manager: The callbacks handler to use\n", | ||
"\n", | ||
" # Returns:\n", | ||
" # List of relevant documents\n", | ||
" # \"\"\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2eac1f28-29c1-4888-b3aa-b4fa70c73b4c", | ||
"metadata": {}, | ||
"source": [ | ||
"## Test it 🧪" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 21, | ||
"id": "ea868db5-48cc-4ec2-9b0a-1ab94c32b302", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"documents = [\n", | ||
" Document(\n", | ||
" page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n", | ||
" metadata={\"type\": \"dog\", \"trait\": \"loyalty\"},\n", | ||
" ),\n", | ||
" Document(\n", | ||
" page_content=\"Cats are independent pets that often enjoy their own space.\",\n", | ||
" metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n", | ||
" ),\n", | ||
" Document(\n", | ||
" page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n", | ||
" metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n", | ||
" ),\n", | ||
" Document(\n", | ||
" page_content=\"Parrots are intelligent birds capable of mimicking human speech.\",\n", | ||
" metadata={\"type\": \"bird\", \"trait\": \"intelligence\"},\n", | ||
" ),\n", | ||
" Document(\n", | ||
" page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n", | ||
" metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n", | ||
" ),\n", | ||
"]\n", | ||
"retriever = ToyRetriever(documents=documents, k=3)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 22, | ||
"id": "18be85e9-6ef0-4ee0-ae5d-a0810c38b254", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n", | ||
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]" | ||
] | ||
}, | ||
"execution_count": 22, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"retriever.invoke(\"that\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "13f76f6e-cf2b-4f67-859b-0ef8be98abbe", | ||
"metadata": {}, | ||
"source": [ | ||
"It's a **runnable** so it'll benefit from the standard Runnable Interface! 🤩" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 23, | ||
"id": "3672e9fe-4365-4628-9d25-31924cfaf784", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n", | ||
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]" | ||
] | ||
}, | ||
"execution_count": 23, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"await retriever.ainvoke(\"that\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 24, | ||
"id": "e2c96eed-6813-421c-acf2-6554839840ee", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],\n", | ||
" [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]" | ||
] | ||
}, | ||
"execution_count": 24, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"retriever.batch([\"dog\", \"cat\"])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 25, | ||
"id": "978b6636-bf36-42c2-969c-207718f084cf", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}\n", | ||
"{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}\n", | ||
"{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"async for event in retriever.astream_events(\"bar\", version=\"v1\"):\n", | ||
" print(event)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7b45c404-37bf-4370-bb7c-26556777ff46", | ||
"metadata": {}, | ||
"source": [ | ||
"## Contributing\n", | ||
"\n", | ||
"We appreciate contributions of interesting retrievers!\n", | ||
"\n", | ||
"Here's a checklist to help make sure your contribution gets added to LangChain:\n", | ||
"\n", | ||
"Documentation:\n", | ||
"\n", | ||
"* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n", | ||
"* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)\n", | ||
"\n", | ||
"Tests:\n", | ||
"\n", | ||
"* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.\n", | ||
"\n", | ||
"Optimizations:\n", | ||
"\n", | ||
"If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!\n", | ||
" \n", | ||
"* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters