diff --git a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb index e69de29b..ba55bdcf 100644 --- a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb +++ b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb @@ -0,0 +1,794 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "s49gpkvZ7q53" + }, + "source": [ + "# Hybrid Search using RRF\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/leemthompo/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb)\n", + "\n", + "In this example we'll use the reciprocal rank fusion algorithm to combine the results of BM25 and kNN semantic search.\n", + "We'll use the same dataset we used in our [quickstart](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/00-quick-start.ipynb) guide.\n", + "You can use RRF for hybrid search out of the box, without any additional configuration.\n", + "\n", + "We also provide a walkthrough of a toy example, which demonstrates how RRF ranking works at a basic level." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Y01AXpELkygt" + }, + "source": [ + "# 🧰 Requirements\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.6 or later\n", + "- An Elastic deployment with minimum **4GB machine learning node**\n", + " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n", + "- The [ELSER](https://www.elastic.co/guide/en/machine-learning/8.8/ml-nlp-elser.html) model installed on your Elastic deployment\n", + "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "N4pI1-eIvWrI" + }, + "source": [ + "# Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Under **Advanced settings**, go to **Machine Learning instances**\n", + " - You'll need at least **4GB** RAM per zone for this tutorial\n", + " - Select **Create deployment**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gaTFHLJC-Mgi" + }, + "source": [ + "# Install packages and initialize the Elasticsearch Python client\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to `pip` install the packages we need for this example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K9Q1p2C9-wce", + "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" + }, + "outputs": [], + "source": [ + "!git clone https://github.com/elastic/elasticsearch-py.git\n", + "%cd elasticsearch-py\n", + "!git checkout v8.8.2\n", + "!{sys.executable} -m pip install .\n", + "!pip install sentence_transformers\n", + "!pip install torch\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gEzq2Z1wBs3M" + }, + "source": [ + "[TODO: Update]\n", + "Next we need to import the `elasticsearch` module and the `getpass` module.\n", + "`getpass` is part of the Python standard library and is used to securely prompt for credentials." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uP_GTVRi-d96" + }, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "from urllib.request import urlopen\n", + "import getpass\n", + "from sentence_transformers import SentenceTransformer\n", + "import torch\n", + "\n", + "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n", + "\n", + "model = SentenceTransformer('all-MiniLM-L6-v2', device=device)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "AMSePFiZCRqX" + }, + "source": [ + "Now we can instantiate the Python Elasticsearch client.\n", + "First we prompt the user for their password and Cloud ID.\n", + "\n", + "🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.\n", + "\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "h0MdAZ53CdKL", + "outputId": "96ea6f81-f935-4d51-c4a7-af5a896180f1" + }, + "outputs": [], + "source": [ + "# Found in the 'Manage Deployment' page\n", + "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID: ')\n", + "\n", + "# Password for the 'elastic' user generated by Elasticsearch\n", + "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password: ')\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "bRHbecNeEDL3" + }, + "source": [ + "Confirm that the client has connected with this test" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rdiUKqZbEKfF", + "outputId": "43b6f1cd-a43e-4dbe-caa5-7fd170464881" + }, + "outputs": [], + "source": [ + "print(client.info())" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "enHQuT57DhD1" + }, + "source": [ + "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n", + "\n", + "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "TF_wxIAhD07a" + }, + "source": [ + "# Create Elasticsearch index with required mappings\n", + "\n", + "We need to add a field to support dense vector storage and search.\n", + "Note the `title_vector` field below, which is used to store the dense vector representation of the `title` field." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cvYECABJJs_2", + "outputId": "18fb51e4-c4f6-4d1b-cb2d-bc6f8ec1aa84" + }, + "outputs": [], + "source": [ + "# Define the mapping\n", + "mapping = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"title\": {\"type\": \"text\"},\n", + " \"authors\": {\"type\": \"keyword\"},\n", + " \"summary\": {\"type\": \"text\"},\n", + " \"publish_date\": {\"type\": \"date\"},\n", + " \"num_reviews\": {\"type\": \"integer\"},\n", + " \"publisher\": {\"type\": \"keyword\"},\n", + " \"title_vector\": { \n", + " \"type\": \"dense_vector\", \n", + " \"dims\": 384, \n", + " \"index\": \"true\", \n", + " \"similarity\": \"dot_product\" \n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "# Create the index\n", + "client.indices.create(index='rrf_book_index', body=mapping)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset\n", + "\n", + "Let's index some data.\n", + "Note that we are embedding the `title` field using the sentence transformer model.\n", + "Once indexed, you'll see that your documents contain a `title_vector` field (`\"type\": \"dense_vector\"`) which contains a vector of floating point values.\n", + "This is the embedding of the `title` field in vector space.\n", + "We'll use this field to perform semantic search using kNN." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "books = [\n", + " {\n", + " \"title\": \"The Pragmatic Programmer: Your Journey to Mastery\",\n", + " \"authors\": [\"andrew hunt\", \"david thomas\"],\n", + " \"summary\": \"A guide to pragmatic programming for software engineers and developers\",\n", + " \"publish_date\": \"2019-10-29\",\n", + " \"num_reviews\": 30,\n", + " \"publisher\": \"addison-wesley\"\n", + " },\n", + " {\n", + " \"title\": \"Python Crash Course\",\n", + " \"authors\": [\"eric matthes\"],\n", + " \"summary\": \"A fast-paced, no-nonsense guide to programming in Python\",\n", + " \"publish_date\": \"2019-05-03\",\n", + " \"num_reviews\": 42,\n", + " \"publisher\": \"no starch press\"\n", + " },\n", + " {\n", + " \"title\": \"Artificial Intelligence: A Modern Approach\",\n", + " \"authors\": [\"stuart russell\", \"peter norvig\"],\n", + " \"summary\": \"Comprehensive introduction to the theory and practice of artificial intelligence\",\n", + " \"publish_date\": \"2020-04-06\",\n", + " \"num_reviews\": 39,\n", + " \"publisher\": \"pearson\"\n", + " },\n", + " {\n", + " \"title\": \"Clean Code: A Handbook of Agile Software Craftsmanship\",\n", + " \"authors\": [\"robert c. martin\"],\n", + " \"summary\": \"A guide to writing code that is easy to read, understand and maintain\",\n", + " \"publish_date\": \"2008-08-11\",\n", + " \"num_reviews\": 55,\n", + " \"publisher\": \"prentice hall\"\n", + " },\n", + " {\n", + " \"title\": \"You Don't Know JS: Up & Going\",\n", + " \"authors\": [\"kyle simpson\"],\n", + " \"summary\": \"Introduction to JavaScript and programming as a whole\",\n", + " \"publish_date\": \"2015-03-27\",\n", + " \"num_reviews\": 36,\n", + " \"publisher\": \"oreilly\"\n", + " },\n", + " {\n", + " \"title\": \"Eloquent JavaScript\",\n", + " \"authors\": [\"marijn haverbeke\"],\n", + " \"summary\": \"A modern introduction to programming\",\n", + " \"publish_date\": \"2018-12-04\",\n", + " \"num_reviews\": 38,\n", + " \"publisher\": \"no starch press\"\n", + " },\n", + " {\n", + " \"title\": \"Design Patterns: Elements of Reusable Object-Oriented Software\",\n", + " \"authors\": [\"erich gamma\", \"richard helm\", \"ralph johnson\", \"john vlissides\"],\n", + " \"summary\": \"Guide to design patterns that can be used in any object-oriented language\",\n", + " \"publish_date\": \"1994-10-31\",\n", + " \"num_reviews\": 45,\n", + " \"publisher\": \"addison-wesley\"\n", + " },\n", + " {\n", + " \"title\": \"The Clean Coder: A Code of Conduct for Professional Programmers\",\n", + " \"authors\": [\"robert c. martin\"],\n", + " \"summary\": \"A guide to professional conduct in the field of software engineering\",\n", + " \"publish_date\": \"2011-05-13\",\n", + " \"num_reviews\": 20,\n", + " \"publisher\": \"prentice hall\"\n", + " },\n", + " {\n", + " \"title\": \"JavaScript: The Good Parts\",\n", + " \"authors\": [\"douglas crockford\"],\n", + " \"summary\": \"A deep dive into the parts of JavaScript that are essential to writing maintainable code\",\n", + " \"publish_date\": \"2008-05-15\",\n", + " \"num_reviews\": 51,\n", + " \"publisher\": \"oreilly\"\n", + " },\n", + " {\n", + " \"title\": \"Introduction to the Theory of Computation\",\n", + " \"authors\": [\"michael sipser\"],\n", + " \"summary\": \"Introduction to the theory of computation and complexity theory\",\n", + " \"publish_date\": \"2012-06-27\",\n", + " \"num_reviews\": 33,\n", + " \"publisher\": \"cengage learning\"\n", + " },\n", + "]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Index documents\n", + "\n", + "Our dataset is a Python list that contains dictionaries of movie titles and descriptions.\n", + "We'll use the `helpers.bulk` method to index our documents in batches.\n", + "\n", + "The following code iterates over the list of books and creates a list of actions to be performed.\n", + "Each action is a dictionary containing an \"index\" operation on our Elasticsearch index.\n", + "The book's title is encoded using our selected model, and the encoded vector is added to the book document.\n", + "The book document is then added to the list of actions.\n", + "\n", + "Finally, we call the `bulk` method, specifying the index name and the list of actions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "actions = []\n", + "for book in books:\n", + " actions.append({\"index\": {\"_index\": \"rrf_book_index\"}})\n", + " titleEmbedding = model.encode(book[\"title\"]).tolist()\n", + " book[\"title_vector\"] = titleEmbedding\n", + " actions.append(book)\n", + "\n", + "client.bulk(index=\"rrf_book_index\", operations=actions)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "WgWDMgf9NkHL" + }, + "source": [ + "## Pretty printing Elasticsearch responses\n", + "\n", + "This is a helper function to print Elasticsearch responses in a readable format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def pretty_response(response):\n", + " for hit in response['hits']['hits']:\n", + " id = hit['_id']\n", + " publication_date = hit['_source']['publish_date']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " summary = hit['_source']['summary']\n", + " pretty_output = (f\"\\nID: {id}\\nPublication date: {publication_date}\\nTitle: {title}\\nSummary: {summary}\\nScore: {score}\")\n", + " print(pretty_output)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "MrBCHdH1u8Wd" + }, + "source": [ + "# Hybrid search using RRF\n", + "\n", + "## RRF overview\n", + "\n", + "[Reciprocal Rank Fusion (RRF)](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html) is a state-of-the-art ranking algorithm for combining results from different information retrieval strategies.\n", + "RRF consistently improves the combined results of different search algorithms.\n", + "It outperforms all other ranking algorithms, and often surpasses the best individual results, without calibration.\n", + "In brief, it enables best-in-class hybrid search out of the box.\n", + "\n", + "## How RRF works in Elasticsearch\n", + "\n", + "You can use RRF as part of a search to combine and rank documents using result sets from a combination of query and/or knn searches.\n", + "A minimum of 2 results sets is required for ranking from the specified sources.\n", + "Check out the [RRF API reference](https://www.elastic.co/guide/en/elasticsearch/reference/master/rrf.html#rrf-api) for full details information.\n", + "\n", + "In the following example, we'll use RRF to combine the results of a `match` query and a kNN semantic search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "body = {\n", + " \"size\": 5,\n", + " \"query\": {\n", + " \"match\": {\n", + " \"summary\": \"shoes\"\n", + " },\n", + " \n", + " },\n", + " \"knn\": {\n", + " \"field\": \"title_vector\",\n", + " \"query_vector\" : model.encode(\"python programming\").tolist(), # generate embedding for query so it can be compared to `title_vector`\n", + " \"k\": 5,\n", + " \"num_candidates\": 10},\n", + " \"rank\": {\n", + " \"rrf\": {\n", + " \"window_size\": 5,\n", + " \"rank_constant\": 20\n", + " }\n", + " }\n", + "}\n", + "\n", + "response = client.search(index=\"rrf_book_index\", body=body)\n", + "\n", + "print(response)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above example, we first execute the kNN search to get its global top 5 results.\n", + "Then we execute the match query to get its global top 5 results.\n", + "Then we combine the knn search and match query results and rank them based on the RRF method to get the final top 2 results.\n", + "\n", + "ℹ️ Note that if `k` from a knn search is larger than `window_size`, the results are truncated to `window_size`.\n", + "If `k` is smaller than `window_size`, the results will be `k` size." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RRF toy example\n", + "\n", + "This very simple example demonstrates how RRF ranks documents from different search strategies.\n", + "We begin by creating a mapping for an index with a text field, a vector field, and an integer field along with indexing several documents. For this example we are going to use a vector with only a single dimension to make the ranking easier to explain." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "body = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"text\" : {\n", + " \"type\" : \"text\"\n", + " },\n", + " \"vector\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": 1,\n", + " \"similarity\": \"l2_norm\",\n", + " \"index\": \"true\"\n", + "\n", + " },\n", + " \"integer\" : {\n", + " \"type\" : \"integer\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "client.indices.create(index=\"example-index\", body=body)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next let's index some documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "doc1 = {\n", + " \"text\" : \"rrf\",\n", + " \"vector\" : [5],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "doc2 ={\n", + " \"text\" : \"rrf rrf\",\n", + " \"vector\" : [4],\n", + " \"integer\": 2\n", + "}\n", + "\n", + "doc3 = {\n", + " \"text\" : \"rrf rrf rrf\",\n", + " \"vector\" : [3],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "doc4 = {\n", + " \"text\" : \"rrf rrf rrf rrf\",\n", + " \"integer\": 2\n", + "}\n", + "\n", + "doc5 ={\n", + " \"vector\" : [0],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "docs = [doc1, doc2, doc3, doc4, doc5]\n", + "\n", + "actions = []\n", + "for doc in docs:\n", + " actions.append({\"index\": {\"_index\": \"example-index\"}})\n", + " actions.append(doc)\n", + "\n", + "client.bulk(index=\"example-index\", operations=actions)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now execute a search using RRF with a query, a kNN search, and a terms aggregation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "body = {\n", + " \"query\": {\n", + " \"term\": {\n", + " \"text\": \"rrf\"\n", + " }\n", + " },\n", + " \"knn\": {\n", + " \"field\": \"vector\",\n", + " \"query_vector\": [3],\n", + " \"k\": 5,\n", + " \"num_candidates\": 5\n", + " },\n", + " \"rank\": {\n", + " \"rrf\": {\n", + " \"window_size\": 5,\n", + " \"rank_constant\": 1\n", + " }\n", + " },\n", + " \"size\": 3,\n", + " \"aggs\": {\n", + " \"int_count\": {\n", + " \"terms\": {\n", + " \"field\": \"integer\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "response = client.search(index=\"example-index\", body=body)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We receive a response with ranked hits and the terms aggregation result.\n", + "Note that _score is null, and we instead use _rank to show our top-ranked documents.\n", + "\n", + "Let’s break down how these hits were ranked.\n", + "We start by running the query and the kNN search separately to collect what their individual hits are.\n", + "\n", + "First, we look at the hits for the query.\n", + "\n", + "```json\n", + "\"hits\" : [\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"4\",\n", + " \"_score\" : 0.16152832, (1) \n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"text\" : \"rrf rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"3\", (2) \n", + " \"_score\" : 0.15876243,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [3],\n", + " \"text\" : \"rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"2\", (3) \n", + " \"_score\" : 0.15350538,\n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"vector\" : [4],\n", + " \"text\" : \"rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"1\", (4)\n", + " \"_score\" : 0.13963442,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [5],\n", + " \"text\" : \"rrf\"\n", + " }\n", + " }\n", + "]\n", + "```\n", + "\n", + "Note the following information about the hits:\n", + "\n", + "- **(1)** rank 1, `_id` 4\n", + "- **(2)** rank 2, `_id` 3\n", + "- **(3)** rank 3, `_id` 2\n", + "- **(4)** rank 4, `_id` 1\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Note that our first hit doesn’t have a value for the vector field.\n", + "\n", + "Now, we look at the results for the kNN search.\n", + "\n", + "```json\n", + "\"hits\" : [\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"3\", (1)\n", + " \"_score\" : 1.0,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [3],\n", + " \"text\" : \"rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"2\", (2)\n", + " \"_score\" : 0.5,\n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"vector\" : [4],\n", + " \"text\" : \"rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"1\", (3)\n", + " \"_score\" : 0.2,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [5],\n", + " \"text\" : \"rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"5\", (4)\n", + " \"_score\" : 0.1,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [0]\n", + " }\n", + " }\n", + "]\n", + "```\n", + "\n", + "Note the following information about the hits:\n", + "\n", + "- **(1)** rank 1, `_id` 3\n", + "- **(2)** rank 2, `_id` 2\n", + "- **(3)** rank 3, `_id` 1\n", + "- **(4)** rank 4, `_id` 5\n", + "\n", + "\n", + "We can now take the two individually ranked result sets and apply the RRF formula to them to get our final ranking.\n", + "\n", + "```python\n", + "# doc | query | knn | score\n", + "_id: 1 = 1.0/(1+4) + 1.0/(1+3) = 0.4500\n", + "_id: 2 = 1.0/(1+3) + 1.0/(1+2) = 0.5833\n", + "_id: 3 = 1.0/(1+2) + 1.0/(1+1) = 0.8333\n", + "_id: 4 = 1.0/(1+1) = 0.5000\n", + "_id: 5 = 1.0/(1+4) = 0.2000\n", + "```\n", + "\n", + "We rank the documents based on the RRF formula with a `window_size` of `5`\n", + "truncating the bottom `2` docs in our RRF result set with a `size` of `3`.\n", + "\n", + "We end up with `_id: 3` as `_rank: 1`, `_id: 2` as `_rank: 2`, and\n", + "`_id: 4` as `_rank: 3`.\n", + "\n", + "This ranking matches the result set from the\n", + "original RRF search as expected." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}