Docs: Add guide for implementing custom retriever (#20350)

Add longer guide for implementing custom retriever. --------- Co-authored-by: ccurme <chester.curme@gmail.com>
langchain-ai · Apr 26, 2024 · ac374a1 · ac374a1
1 parent 21a4969
commit ac374a1
Show file tree

Hide file tree

Showing 2 changed files with 310 additions and 20 deletions.
diff --git a/docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb b/docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb
@@ -0,0 +1,309 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "id": "b5fc1fc7-c4c5-418f-99da-006c604a7ea6",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "title: Custom Retriever\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff6f3c79-0848-4956-9115-54f6b2134587",
+   "metadata": {},
+   "source": [
+    "# Custom Retriever\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "Many LLM applications involve retrieving information from external data sources using a `Retriever`. \n",
+    "\n",
+    "A retriever is responsible for retrieving a list of relevant `Documents` to a given user `query`.\n",
+    "\n",
+    "The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).\n",
+    "\n",
+    "## Interface\n",
+    "\n",
+    "To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n",
+    "\n",
+    "| Method                         | Description                                      | Required/Optional |\n",
+    "|--------------------------------|--------------------------------------------------|-------------------|\n",
+    "| `_get_relevant_documents`      | Get documents relevant to a query.               | Required          |\n",
+    "| `_aget_relevant_documents`     | Implement to provide async native support.       | Optional          |\n",
+    "\n",
+    "\n",
+    "The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n",
+    "\n",
+    ":::{.callout-tip}\n",
+    "By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/expression_language/interface) and will gain the standard `Runnable` functionality out of the box!\n",
+    ":::\n",
+    "\n",
+    "\n",
+    ":::{.callout-info}\n",
+    "You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n",
+    "\n",
+    "The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/expression_language/primitives/functions)) is that a `BaseRetriever` is a well\n",
+    "known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n",
+    "is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n",
+    "in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n",
+    ":::\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2be9fe82-0757-41d1-a647-15bed11fd3bf",
+   "metadata": {},
+   "source": [
+    "## Example\n",
+    "\n",
+    "Let's implement a toy retriever that returns all documents whose text contains the text in the user query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "bdf61902-2984-493b-a002-d4fced6df590",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import List\n",
+    "\n",
+    "from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
+    "from langchain_core.documents import Document\n",
+    "from langchain_core.retrievers import BaseRetriever\n",
+    "\n",
+    "\n",
+    "class ToyRetriever(BaseRetriever):\n",
+    "    \"\"\"A toy retriever that contains the top k documents that contain the user query.\n",
+    "\n",
+    "    This retriever only implements the sync method _get_relevant_documents.\n",
+    "\n",
+    "    If the retriever were to involve file access or network access, it could benefit\n",
+    "    from a native async implementation of `_aget_relevant_documents`.\n",
+    "\n",
+    "    As usual, with Runnables, there's a default async implementation that's provided\n",
+    "    that delegates to the sync implementation running on another thread.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    documents: List[Document]\n",
+    "    \"\"\"List of documents to retrieve from.\"\"\"\n",
+    "    k: int\n",
+    "    \"\"\"Number of top results to return\"\"\"\n",
+    "\n",
+    "    def _get_relevant_documents(\n",
+    "        self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n",
+    "    ) -> List[Document]:\n",
+    "        \"\"\"Sync implementations for retriever.\"\"\"\n",
+    "        matching_documents = []\n",
+    "        for document in documents:\n",
+    "            if len(matching_documents) > self.k:\n",
+    "                return matching_documents\n",
+    "\n",
+    "            if query.lower() in document.page_content.lower():\n",
+    "                matching_documents.append(document)\n",
+    "        return matching_documents\n",
+    "\n",
+    "    # Optional: Provide a more efficient native implementation by overriding\n",
+    "    # _aget_relevant_documents\n",
+    "    # async def _aget_relevant_documents(\n",
+    "    #     self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun\n",
+    "    # ) -> List[Document]:\n",
+    "    #     \"\"\"Asynchronously get documents relevant to a query.\n",
+    "\n",
+    "    #     Args:\n",
+    "    #         query: String to find relevant documents for\n",
+    "    #         run_manager: The callbacks handler to use\n",
+    "\n",
+    "    #     Returns:\n",
+    "    #         List of relevant documents\n",
+    "    #     \"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2eac1f28-29c1-4888-b3aa-b4fa70c73b4c",
+   "metadata": {},
+   "source": [
+    "## Test it 🧪"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "ea868db5-48cc-4ec2-9b0a-1ab94c32b302",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents = [\n",
+    "    Document(\n",
+    "        page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n",
+    "        metadata={\"type\": \"dog\", \"trait\": \"loyalty\"},\n",
+    "    ),\n",
+    "    Document(\n",
+    "        page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
+    "        metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n",
+    "    ),\n",
+    "    Document(\n",
+    "        page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n",
+    "        metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n",
+    "    ),\n",
+    "    Document(\n",
+    "        page_content=\"Parrots are intelligent birds capable of mimicking human speech.\",\n",
+    "        metadata={\"type\": \"bird\", \"trait\": \"intelligence\"},\n",
+    "    ),\n",
+    "    Document(\n",
+    "        page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n",
+    "        metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n",
+    "    ),\n",
+    "]\n",
+    "retriever = ToyRetriever(documents=documents, k=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "18be85e9-6ef0-4ee0-ae5d-a0810c38b254",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
+       " Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "retriever.invoke(\"that\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13f76f6e-cf2b-4f67-859b-0ef8be98abbe",
+   "metadata": {},
+   "source": [
+    "It's a **runnable** so it'll benefit from the standard Runnable Interface! 🤩"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "3672e9fe-4365-4628-9d25-31924cfaf784",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
+       " Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "await retriever.ainvoke(\"that\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "e2c96eed-6813-421c-acf2-6554839840ee",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],\n",
+       " [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "retriever.batch([\"dog\", \"cat\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "978b6636-bf36-42c2-969c-207718f084cf",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}\n",
+      "{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}\n",
+      "{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}\n"
+     ]
+    }
+   ],
+   "source": [
+    "async for event in retriever.astream_events(\"bar\", version=\"v1\"):\n",
+    "    print(event)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b45c404-37bf-4370-bb7c-26556777ff46",
+   "metadata": {},
+   "source": [
+    "## Contributing\n",
+    "\n",
+    "We appreciate contributions of interesting retrievers!\n",
+    "\n",
+    "Here's a checklist to help make sure your contribution gets added to LangChain:\n",
+    "\n",
+    "Documentation:\n",
+    "\n",
+    "* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n",
+    "* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)\n",
+    "\n",
+    "Tests:\n",
+    "\n",
+    "* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.\n",
+    "\n",
+    "Optimizations:\n",
+    "\n",
+    "If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!\n",
+    " \n",
+    "* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/docs/modules/data_connection/retrievers/index.mdx b/docs/docs/modules/data_connection/retrievers/index.mdx
@@ -80,23 +80,4 @@ chain.invoke("What did the president say about technology?")
 
 ## Custom Retriever
 
-Since the retriever interface is so simple, it's pretty easy to write a custom one.
-
-```python
-from langchain_core.retrievers import BaseRetriever
-from langchain_core.callbacks import CallbackManagerForRetrieverRun
-from langchain_core.documents import Document
-from typing import List
-
-
-class CustomRetriever(BaseRetriever):
-
-    def _get_relevant_documents(
-        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
-    ) -> List[Document]:
-        return [Document(page_content=query)]
-
-retriever = CustomRetriever()
-
-retriever.get_relevant_documents("bar")
-```
+See the [documentation here](/docs/modules/data_connection/retrievers/custom_retriever) to implement a custom retriever.