From 8a7d66ba1967e2dbb26bc71baa461fb4348016b9 Mon Sep 17 00:00:00 2001 From: Liam Thompson Date: Fri, 7 Jul 2023 11:27:47 +0200 Subject: [PATCH 1/3] Add RRF example, plus toy example walkthru --- .../search/02-hybrid-search-with-rrf.ipynb | 956 ++++++++++++++++++ 1 file changed, 956 insertions(+) diff --git a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb index e69de29b..fa2c53d1 100644 --- a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb +++ b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb @@ -0,0 +1,956 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "s49gpkvZ7q53" + }, + "source": [ + "# Hybrid Search using RRF\n", + "\n", + "In this example we'll use the reciprocal rank fusion algorithm to combine the results of BM25 and kNN semantic search.\n", + "We'll use the same dataset we used in our [quickstart](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/00-quick-start.ipynb) guide." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Y01AXpELkygt" + }, + "source": [ + "# 🧰 Requirements\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.6 or later\n", + "- An Elastic deployment with minimum **4GB machine learning node**\n", + " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n", + "- The [ELSER](https://www.elastic.co/guide/en/machine-learning/8.8/ml-nlp-elser.html) model installed on your Elastic deployment\n", + "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "N4pI1-eIvWrI" + }, + "source": [ + "# Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Under **Advanced settings**, go to **Machine Learning instances**\n", + " - You'll need at least **4GB** RAM per zone for this tutorial\n", + " - Select **Create deployment**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gaTFHLJC-Mgi" + }, + "source": [ + "# Install packages and initialize the Elasticsearch Python client\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to `pip` install the packages we need for this example." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K9Q1p2C9-wce", + "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "fatal: destination path 'elasticsearch-py' already exists and is not an empty directory.\n", + "/Users/liamthompson/notebook-tests/elasticsearch-py\n", + "HEAD is now at 825e642b Bumps 8.8 to 8.8.2\n", + "zsh:1: parse error near `-m'\n", + "Requirement already satisfied: sentence_transformers in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (2.2.2)\n", + "Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (4.30.2)\n", + "Requirement already satisfied: tqdm in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (4.65.0)\n", + "Requirement already satisfied: torch>=1.6.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (2.0.1)\n", + "Requirement already satisfied: torchvision in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (0.15.2)\n", + "Requirement already satisfied: numpy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (1.25.0)\n", + "Requirement already satisfied: scikit-learn in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (1.3.0)\n", + "Requirement already satisfied: scipy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (1.11.1)\n", + "Requirement already satisfied: nltk in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (3.8.1)\n", + "Requirement already satisfied: sentencepiece in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (0.1.99)\n", + "Requirement already satisfied: huggingface-hub>=0.4.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (0.15.1)\n", + "Requirement already satisfied: pyyaml>=5.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (6.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (4.6.3)\n", + "Requirement already satisfied: packaging>=20.9 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (23.1)\n", + "Requirement already satisfied: requests in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (2.31.0)\n", + "Requirement already satisfied: fsspec in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (2023.6.0)\n", + "Requirement already satisfied: filelock in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (3.12.2)\n", + "Requirement already satisfied: jinja2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch>=1.6.0->sentence_transformers) (3.1.2)\n", + "Requirement already satisfied: networkx in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch>=1.6.0->sentence_transformers) (3.1)\n", + "Requirement already satisfied: sympy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch>=1.6.0->sentence_transformers) (1.12)\n", + "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence_transformers) (0.13.3)\n", + "Requirement already satisfied: regex!=2019.12.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence_transformers) (2023.6.3)\n", + "Requirement already satisfied: safetensors>=0.3.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence_transformers) (0.3.1)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from jinja2->torch>=1.6.0->sentence_transformers) (2.1.3)\n", + "Requirement already satisfied: click in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from nltk->sentence_transformers) (8.1.3)\n", + "Requirement already satisfied: joblib in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from nltk->sentence_transformers) (1.3.1)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (1.26.16)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (2023.5.7)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (3.1.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (3.4)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from scikit-learn->sentence_transformers) (3.1.0)\n", + "Requirement already satisfied: mpmath>=0.19 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sympy->torch>=1.6.0->sentence_transformers) (1.3.0)\n", + "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torchvision->sentence_transformers) (10.0.0)\n", + "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", + "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n", + "Requirement already satisfied: torch in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (2.0.1)\n", + "Requirement already satisfied: jinja2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (3.1.2)\n", + "Requirement already satisfied: networkx in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (3.1)\n", + "Requirement already satisfied: filelock in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (3.12.2)\n", + "Requirement already satisfied: typing-extensions in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (4.6.3)\n", + "Requirement already satisfied: sympy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (1.12)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from jinja2->torch) (2.1.3)\n", + "Requirement already satisfied: mpmath>=0.19 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sympy->torch) (1.3.0)\n", + "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", + "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n" + ] + } + ], + "source": [ + "!git clone https://github.com/elastic/elasticsearch-py.git\n", + "%cd elasticsearch-py\n", + "!git checkout v8.8.2\n", + "!{sys.executable} -m pip install .\n", + "!pip install sentence_transformers\n", + "!pip install torch\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gEzq2Z1wBs3M" + }, + "source": [ + "[TODO: Update]\n", + "Next we need to import the `elasticsearch` module and the `getpass` module.\n", + "`getpass` is part of the Python standard library and is used to securely prompt for credentials." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "uP_GTVRi-d96" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "from urllib.request import urlopen\n", + "import getpass\n", + "from sentence_transformers import SentenceTransformer\n", + "import torch\n", + "\n", + "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n", + "\n", + "model = SentenceTransformer('all-MiniLM-L6-v2', device=device)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "AMSePFiZCRqX" + }, + "source": [ + "Now we can instantiate the Python Elasticsearch client.\n", + "First we prompt the user for their password and Cloud ID.\n", + "\n", + "🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.\n", + "\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "h0MdAZ53CdKL", + "outputId": "96ea6f81-f935-4d51-c4a7-af5a896180f1" + }, + "outputs": [], + "source": [ + "# Found in the 'Manage Deployment' page\n", + "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID: ')\n", + "\n", + "# Password for the 'elastic' user generated by Elasticsearch\n", + "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password: ')\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "bRHbecNeEDL3" + }, + "source": [ + "Confirm that the client has connected with this test" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rdiUKqZbEKfF", + "outputId": "43b6f1cd-a43e-4dbe-caa5-7fd170464881" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'name': 'instance-0000000000', 'cluster_name': '9dd1e5c0b0d64796b8cf0746cf63d734', 'cluster_uuid': 'VeYvw6JhQcC3P-Q1-L9P_w', 'version': {'number': '8.9.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'ac7d79178c3e57c935358453331efe9e9cc5104d', 'build_date': '2023-06-21T09:08:25.219504984Z', 'build_snapshot': True, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0', 'transport_version': '8500019'}, 'tagline': 'You Know, for Search'}\n" + ] + } + ], + "source": [ + "print(client.info())" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "enHQuT57DhD1" + }, + "source": [ + "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n", + "\n", + "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "TF_wxIAhD07a" + }, + "source": [ + "# Create Elasticsearch index with required mappings\n", + "\n", + "We need to add a field to support dense vector storage and search.\n", + "Note the `title_vector` field below, which is used to store the dense vector representation of the `title` field." + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cvYECABJJs_2", + "outputId": "18fb51e4-c4f6-4d1b-cb2d-bc6f8ec1aa84" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_2383/1628078329.py:22: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", + " client.indices.create(index='rrf_book_index', body=mapping)\n" + ] + }, + { + "ename": "BadRequestError", + "evalue": "BadRequestError(400, 'resource_already_exists_exception', 'index [rrf_book_index/Ip8zitwhSMe0OJtEwpuqzQ] already exists')", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mBadRequestError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[59], line 22\u001b[0m\n\u001b[1;32m 2\u001b[0m mapping \u001b[39m=\u001b[39m {\n\u001b[1;32m 3\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mmappings\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[1;32m 4\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mproperties\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 18\u001b[0m }\n\u001b[1;32m 19\u001b[0m }\n\u001b[1;32m 21\u001b[0m \u001b[39m# Create the index\u001b[39;00m\n\u001b[0;32m---> 22\u001b[0m client\u001b[39m.\u001b[39;49mindices\u001b[39m.\u001b[39;49mcreate(index\u001b[39m=\u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39mrrf_book_index\u001b[39;49m\u001b[39m'\u001b[39;49m, body\u001b[39m=\u001b[39;49mmapping)\n", + "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py:414\u001b[0m, in \u001b[0;36m_rewrite_parameters..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 411\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mKeyError\u001b[39;00m:\n\u001b[1;32m 412\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 414\u001b[0m \u001b[39mreturn\u001b[39;00m api(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", + "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/indices.py:517\u001b[0m, in \u001b[0;36mcreate\u001b[0;34m(self, index, aliases, error_trace, filter_path, human, mappings, master_timeout, pretty, settings, timeout, wait_for_active_shards)\u001b[0m\n", + "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/_base.py:389\u001b[0m, in \u001b[0;36mNamespacedClient.perform_request\u001b[0;34m(self, method, path, params, headers, body)\u001b[0m\n\u001b[1;32m 378\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mperform_request\u001b[39m(\n\u001b[1;32m 379\u001b[0m \u001b[39mself\u001b[39m,\n\u001b[1;32m 380\u001b[0m method: \u001b[39mstr\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 387\u001b[0m \u001b[39m# Use the internal clients .perform_request() implementation\u001b[39;00m\n\u001b[1;32m 388\u001b[0m \u001b[39m# so we take advantage of their transport options.\u001b[39;00m\n\u001b[0;32m--> 389\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_client\u001b[39m.\u001b[39;49mperform_request(\n\u001b[1;32m 390\u001b[0m method, path, params\u001b[39m=\u001b[39;49mparams, headers\u001b[39m=\u001b[39;49mheaders, body\u001b[39m=\u001b[39;49mbody\n\u001b[1;32m 391\u001b[0m )\n", + "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/_base.py:320\u001b[0m, in \u001b[0;36mBaseClient.perform_request\u001b[0;34m(self, method, path, params, headers, body)\u001b[0m\n\u001b[1;32m 317\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mValueError\u001b[39;00m, \u001b[39mKeyError\u001b[39;00m, \u001b[39mTypeError\u001b[39;00m):\n\u001b[1;32m 318\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 320\u001b[0m \u001b[39mraise\u001b[39;00m HTTP_EXCEPTIONS\u001b[39m.\u001b[39mget(meta\u001b[39m.\u001b[39mstatus, ApiError)(\n\u001b[1;32m 321\u001b[0m message\u001b[39m=\u001b[39mmessage, meta\u001b[39m=\u001b[39mmeta, body\u001b[39m=\u001b[39mresp_body\n\u001b[1;32m 322\u001b[0m )\n\u001b[1;32m 324\u001b[0m \u001b[39m# 'X-Elastic-Product: Elasticsearch' should be on every 2XX response.\u001b[39;00m\n\u001b[1;32m 325\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_verified_elasticsearch:\n\u001b[1;32m 326\u001b[0m \u001b[39m# If the header is set we mark the server as verified.\u001b[39;00m\n", + "\u001b[0;31mBadRequestError\u001b[0m: BadRequestError(400, 'resource_already_exists_exception', 'index [rrf_book_index/Ip8zitwhSMe0OJtEwpuqzQ] already exists')" + ] + } + ], + "source": [ + "# Define the mapping\n", + "mapping = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"title\": {\"type\": \"text\"},\n", + " \"authors\": {\"type\": \"keyword\"},\n", + " \"summary\": {\"type\": \"text\"},\n", + " \"publish_date\": {\"type\": \"date\"},\n", + " \"num_reviews\": {\"type\": \"integer\"},\n", + " \"publisher\": {\"type\": \"keyword\"},\n", + " \"title_vector\": { \n", + " \"type\": \"dense_vector\", \n", + " \"dims\": 384, \n", + " \"index\": \"true\", \n", + " \"similarity\": \"dot_product\" \n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "# Create the index\n", + "client.indices.create(index='rrf_book_index', body=mapping)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset\n", + "\n", + "Let's index some data.\n", + "Note that we are embedding the `title` field using the sentence transformer model.\n", + "Once indexed, you'll see that your documents contain a `title_vector` field (`\"type\": \"dense_vector\"`) which contains a vector of floating point values.\n", + "This is the embedding of the `title` field in vector space.\n", + "We'll use this field to perform semantic search using kNN." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ObjectApiResponse({'took': 29, 'errors': False, 'items': [{'index': {'_index': 'rrf_book_index', '_id': '7c-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 10, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '7s-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 11, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '78-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 12, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '8M-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 13, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '8c-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 14, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '8s-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 15, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '88-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 16, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '9M-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 17, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '9c-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 18, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '9s-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 19, '_primary_term': 1, 'status': 201}}]})" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "books = [\n", + " {\n", + " \"title\": \"The Pragmatic Programmer: Your Journey to Mastery\",\n", + " \"authors\": [\"andrew hunt\", \"david thomas\"],\n", + " \"summary\": \"A guide to pragmatic programming for software engineers and developers\",\n", + " \"publish_date\": \"2019-10-29\",\n", + " \"num_reviews\": 30,\n", + " \"publisher\": \"addison-wesley\"\n", + " },\n", + " {\n", + " \"title\": \"Python Crash Course\",\n", + " \"authors\": [\"eric matthes\"],\n", + " \"summary\": \"A fast-paced, no-nonsense guide to programming in Python\",\n", + " \"publish_date\": \"2019-05-03\",\n", + " \"num_reviews\": 42,\n", + " \"publisher\": \"no starch press\"\n", + " },\n", + " {\n", + " \"title\": \"Artificial Intelligence: A Modern Approach\",\n", + " \"authors\": [\"stuart russell\", \"peter norvig\"],\n", + " \"summary\": \"Comprehensive introduction to the theory and practice of artificial intelligence\",\n", + " \"publish_date\": \"2020-04-06\",\n", + " \"num_reviews\": 39,\n", + " \"publisher\": \"pearson\"\n", + " },\n", + " {\n", + " \"title\": \"Clean Code: A Handbook of Agile Software Craftsmanship\",\n", + " \"authors\": [\"robert c. martin\"],\n", + " \"summary\": \"A guide to writing code that is easy to read, understand and maintain\",\n", + " \"publish_date\": \"2008-08-11\",\n", + " \"num_reviews\": 55,\n", + " \"publisher\": \"prentice hall\"\n", + " },\n", + " {\n", + " \"title\": \"You Don't Know JS: Up & Going\",\n", + " \"authors\": [\"kyle simpson\"],\n", + " \"summary\": \"Introduction to JavaScript and programming as a whole\",\n", + " \"publish_date\": \"2015-03-27\",\n", + " \"num_reviews\": 36,\n", + " \"publisher\": \"oreilly\"\n", + " },\n", + " {\n", + " \"title\": \"Eloquent JavaScript\",\n", + " \"authors\": [\"marijn haverbeke\"],\n", + " \"summary\": \"A modern introduction to programming\",\n", + " \"publish_date\": \"2018-12-04\",\n", + " \"num_reviews\": 38,\n", + " \"publisher\": \"no starch press\"\n", + " },\n", + " {\n", + " \"title\": \"Design Patterns: Elements of Reusable Object-Oriented Software\",\n", + " \"authors\": [\"erich gamma\", \"richard helm\", \"ralph johnson\", \"john vlissides\"],\n", + " \"summary\": \"Guide to design patterns that can be used in any object-oriented language\",\n", + " \"publish_date\": \"1994-10-31\",\n", + " \"num_reviews\": 45,\n", + " \"publisher\": \"addison-wesley\"\n", + " },\n", + " {\n", + " \"title\": \"The Clean Coder: A Code of Conduct for Professional Programmers\",\n", + " \"authors\": [\"robert c. martin\"],\n", + " \"summary\": \"A guide to professional conduct in the field of software engineering\",\n", + " \"publish_date\": \"2011-05-13\",\n", + " \"num_reviews\": 20,\n", + " \"publisher\": \"prentice hall\"\n", + " },\n", + " {\n", + " \"title\": \"JavaScript: The Good Parts\",\n", + " \"authors\": [\"douglas crockford\"],\n", + " \"summary\": \"A deep dive into the parts of JavaScript that are essential to writing maintainable code\",\n", + " \"publish_date\": \"2008-05-15\",\n", + " \"num_reviews\": 51,\n", + " \"publisher\": \"oreilly\"\n", + " },\n", + " {\n", + " \"title\": \"Introduction to the Theory of Computation\",\n", + " \"authors\": [\"michael sipser\"],\n", + " \"summary\": \"Introduction to the theory of computation and complexity theory\",\n", + " \"publish_date\": \"2012-06-27\",\n", + " \"num_reviews\": 33,\n", + " \"publisher\": \"cengage learning\"\n", + " },\n", + "]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Index documents\n", + "\n", + "Our dataset is a Python list that contains dictionaries of movie titles and descriptions.\n", + "We'll use the `helpers.bulk` method to index our documents in batches.\n", + "\n", + "The following code iterates over the list of books and creates a list of actions to be performed.\n", + "Each action is a dictionary containing an \"index\" operation on our Elasticsearch index.\n", + "The book's title is encoded using our selected model, and the encoded vector is added to the book document.\n", + "The book document is then added to the list of actions.\n", + "\n", + "Finally, we call the `bulk` method, specifying the index name and the list of actions." + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ObjectApiResponse({'took': 25, 'errors': False, 'items': [{'index': {'_index': 'rrf_book_index', '_id': 'KM-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 30, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Kc-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 31, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Ks-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 32, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'K8-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 33, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'LM-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 34, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Lc-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 35, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Ls-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 36, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'L8-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 37, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'MM-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 38, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Mc-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 39, '_primary_term': 1, 'status': 201}}]})" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "actions = []\n", + "for book in books:\n", + " actions.append({\"index\": {\"_index\": \"rrf_book_index\"}})\n", + " titleEmbedding = model.encode(book[\"title\"]).tolist()\n", + " book[\"title_vector\"] = titleEmbedding\n", + " actions.append(book)\n", + "\n", + "client.bulk(index=\"rrf_book_index\", operations=actions)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "WgWDMgf9NkHL" + }, + "source": [ + "## Pretty printing Elasticsearch responses\n", + "\n", + "This is a helper function to print Elasticsearch responses in a readable format." + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [], + "source": [ + "def pretty_response(response):\n", + " for hit in response['hits']['hits']:\n", + " id = hit['_id']\n", + " publication_date = hit['_source']['publish_date']\n", + " score = hit['_score']\n", + " title = hit['_source']['title']\n", + " summary = hit['_source']['summary']\n", + " pretty_output = (f\"\\nID: {id}\\nPublication date: {publication_date}\\nTitle: {title}\\nSummary: {summary}\\nScore: {score}\")\n", + " print(pretty_output)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "MrBCHdH1u8Wd" + }, + "source": [ + "# Hybrid search using RRF\n", + "\n", + "## RRF overview\n", + "\n", + "[Reciprocal Rank Fusion (RRF)](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html) is a state-of-the-art ranking algorithm for combining results from different information retrieval strategies.\n", + "RRF consistently improves the combined results of different search algorithms.\n", + "It outperforms all other ranking algorithms, and often surpasses the best individual results, without calibration.\n", + "In brief, it enables best-in-class hybrid search out of the box.\n", + "\n", + "## How RRF works in Elasticsearch\n", + "\n", + "You can use RRF as part of a search to combine and rank documents using result sets from a combination of query and/or knn searches.\n", + "A minimum of 2 results sets is required for ranking from the specified sources.\n", + "Check out the [RRF API reference](https://www.elastic.co/guide/en/elasticsearch/reference/master/rrf.html#rrf-api) for full details information.\n", + "\n", + "In the following example, we'll use RRF to combine the results of a `match` query and a kNN semantic search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_2383/2934485565.py:22: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", + " response = client.search(index=\"rrf_book_index\", body=body)\n" + ] + }, + { + "ename": "TypeError", + "evalue": "search() got an unexpected keyword argument 'rank'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[76], line 22\u001b[0m\n\u001b[1;32m 1\u001b[0m body \u001b[39m=\u001b[39m {\n\u001b[1;32m 2\u001b[0m \u001b[39m\"\u001b[39m\u001b[39msize\u001b[39m\u001b[39m\"\u001b[39m: \u001b[39m5\u001b[39m,\n\u001b[1;32m 3\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mquery\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 19\u001b[0m }\n\u001b[1;32m 20\u001b[0m }\n\u001b[0;32m---> 22\u001b[0m response \u001b[39m=\u001b[39m client\u001b[39m.\u001b[39;49msearch(index\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mrrf_book_index\u001b[39;49m\u001b[39m\"\u001b[39;49m, body\u001b[39m=\u001b[39;49mbody)\n\u001b[1;32m 24\u001b[0m \u001b[39mprint\u001b[39m(response)\n", + "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py:414\u001b[0m, in \u001b[0;36m_rewrite_parameters..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 411\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mKeyError\u001b[39;00m:\n\u001b[1;32m 412\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 414\u001b[0m \u001b[39mreturn\u001b[39;00m api(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", + "\u001b[0;31mTypeError\u001b[0m: search() got an unexpected keyword argument 'rank'" + ] + } + ], + "source": [ + "body = {\n", + " \"size\": 5,\n", + " \"query\": {\n", + " \"match\": {\n", + " \"summary\": \"shoes\"\n", + " },\n", + " \n", + " },\n", + " \"knn\": {\n", + " \"field\": \"title_vector\",\n", + " \"query_vector\" : model.encode(\"python programming\").tolist(), # generate embedding for query so it can be compared to `title_vector`\n", + " \"k\": 5,\n", + " \"num_candidates\": 10},\n", + " \"rank\": {\n", + " \"rrf\": {\n", + " \"window_size\": 5,\n", + " \"rank_constant\": 20\n", + " }\n", + " }\n", + "}\n", + "\n", + "response = client.search(index=\"rrf_book_index\", body=body)\n", + "\n", + "print(response)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above example, we first execute the kNN search to get its global top 5 results.\n", + "Then we execute the match query to get its global top 5 results.\n", + "Then we combine the knn search and match query results and rank them based on the RRF method to get the final top 2 results.\n", + "\n", + "ℹ️ Note that if `k` from a knn search is larger than `window_size`, the results are truncated to `window_size`.\n", + "If `k` is smaller than `window_size`, the results will be `k` size." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RRF toy example\n", + "\n", + "This very simple example demonstrates how RRF ranks documents from different search strategies.\n", + "We begin by creating a mapping for an index with a text field, a vector field, and an integer field along with indexing several documents. For this example we are going to use a vector with only a single dimension to make the ranking easier to explain." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "body = {\n", + " \"mappings\": {\n", + " \"properties\": {\n", + " \"text\" : {\n", + " \"type\" : \"text\"\n", + " },\n", + " \"vector\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": 1,\n", + " \"similarity\": \"l2_norm\",\n", + " \"index\": \"true\"\n", + "\n", + " },\n", + " \"integer\" : {\n", + " \"type\" : \"integer\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "client.indices.create(index=\"example-index\", body=body)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next let's index some documents." + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ObjectApiResponse({'took': 7, 'errors': False, 'items': [{'index': {'_index': 'example-index', '_id': 'UM8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 0, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'Uc8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 1, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'Us8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 2, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'U88cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 3, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'VM8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 4, '_primary_term': 1, 'status': 201}}]})" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "doc1 = {\n", + " \"text\" : \"rrf\",\n", + " \"vector\" : [5],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "doc2 ={\n", + " \"text\" : \"rrf rrf\",\n", + " \"vector\" : [4],\n", + " \"integer\": 2\n", + "}\n", + "\n", + "doc3 = {\n", + " \"text\" : \"rrf rrf rrf\",\n", + " \"vector\" : [3],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "doc4 = {\n", + " \"text\" : \"rrf rrf rrf rrf\",\n", + " \"integer\": 2\n", + "}\n", + "\n", + "doc5 ={\n", + " \"vector\" : [0],\n", + " \"integer\": 1\n", + "}\n", + "\n", + "docs = [doc1, doc2, doc3, doc4, doc5]\n", + "\n", + "actions = []\n", + "for doc in docs:\n", + " actions.append({\"index\": {\"_index\": \"example-index\"}})\n", + " actions.append(doc)\n", + "\n", + "client.bulk(index=\"example-index\", operations=actions)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now execute a search using RRF with a query, a kNN search, and a terms aggregation." + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_2383/3671365121.py:29: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", + " response = client.search(index=\"example-index\", body=body)\n" + ] + }, + { + "ename": "TypeError", + "evalue": "search() got an unexpected keyword argument 'rank'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[81], line 29\u001b[0m\n\u001b[1;32m 1\u001b[0m body \u001b[39m=\u001b[39m {\n\u001b[1;32m 2\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mquery\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[1;32m 3\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mterm\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 26\u001b[0m }\n\u001b[1;32m 27\u001b[0m }\n\u001b[0;32m---> 29\u001b[0m response \u001b[39m=\u001b[39m client\u001b[39m.\u001b[39;49msearch(index\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mexample-index\u001b[39;49m\u001b[39m\"\u001b[39;49m, body\u001b[39m=\u001b[39;49mbody)\n", + "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py:414\u001b[0m, in \u001b[0;36m_rewrite_parameters..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 411\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mKeyError\u001b[39;00m:\n\u001b[1;32m 412\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 414\u001b[0m \u001b[39mreturn\u001b[39;00m api(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", + "\u001b[0;31mTypeError\u001b[0m: search() got an unexpected keyword argument 'rank'" + ] + } + ], + "source": [ + "body = {\n", + " \"query\": {\n", + " \"term\": {\n", + " \"text\": \"rrf\"\n", + " }\n", + " },\n", + " \"knn\": {\n", + " \"field\": \"vector\",\n", + " \"query_vector\": [3],\n", + " \"k\": 5,\n", + " \"num_candidates\": 5\n", + " },\n", + " \"rank\": {\n", + " \"rrf\": {\n", + " \"window_size\": 5,\n", + " \"rank_constant\": 1\n", + " }\n", + " },\n", + " \"size\": 3,\n", + " \"aggs\": {\n", + " \"int_count\": {\n", + " \"terms\": {\n", + " \"field\": \"integer\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "response = client.search(index=\"example-index\", body=body)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We receive a response with ranked hits and the terms aggregation result.\n", + "Note that _score is null, and we instead use _rank to show our top-ranked documents.\n", + "\n", + "Let’s break down how these hits were ranked.\n", + "We start by running the query and the kNN search separately to collect what their individual hits are.\n", + "\n", + "First, we look at the hits for the query.\n", + "\n", + "```json\n", + "\"hits\" : [\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"4\",\n", + " \"_score\" : 0.16152832, (1) \n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"text\" : \"rrf rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"3\", (2) \n", + " \"_score\" : 0.15876243,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [3],\n", + " \"text\" : \"rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"2\", (3) \n", + " \"_score\" : 0.15350538,\n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"vector\" : [4],\n", + " \"text\" : \"rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"1\", (4)\n", + " \"_score\" : 0.13963442,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [5],\n", + " \"text\" : \"rrf\"\n", + " }\n", + " }\n", + "]\n", + "```\n", + "\n", + "```markdown\n", + "<1> rank 1, `_id` 4\n", + "<2> rank 2, `_id` 3\n", + "<3> rank 3, `_id` 2\n", + "<4> rank 4, `_id` 1\n", + "```" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Note that our first hit doesn’t have a value for the vector field.\n", + "\n", + "Now, we look at the results for the kNN search.\n", + "\n", + "```json\n", + "\"hits\" : [\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"3\", \n", + " \"_score\" : 1.0,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [3],\n", + " \"text\" : \"rrf rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"2\", \n", + " \"_score\" : 0.5,\n", + " \"_source\" : {\n", + " \"integer\" : 2,\n", + " \"vector\" : [4],\n", + " \"text\" : \"rrf rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"1\", \n", + " \"_score\" : 0.2,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [5],\n", + " \"text\" : \"rrf\"\n", + " }\n", + " },\n", + " {\n", + " \"_index\" : \"example-index\",\n", + " \"_id\" : \"5\", \n", + " \"_score\" : 0.1,\n", + " \"_source\" : {\n", + " \"integer\" : 1,\n", + " \"vector\" : [0]\n", + " }\n", + " }\n", + "]```\n", + "\n", + "```markdown\n", + "<1> rank 1, `_id` 3\n", + "<2> rank 2, `_id` 2\n", + "<3> rank 3, `_id` 1\n", + "<4> rank 4, `_id` 5\n", + "```\n", + "\n", + "We can now take the two individually ranked result sets and apply the RRF formula to them to get our final ranking.\n", + "\n", + "```python\n", + "# doc | query | knn | score\n", + "_id: 1 = 1.0/(1+4) + 1.0/(1+3) = 0.4500\n", + "_id: 2 = 1.0/(1+3) + 1.0/(1+2) = 0.5833\n", + "_id: 3 = 1.0/(1+2) + 1.0/(1+1) = 0.8333\n", + "_id: 4 = 1.0/(1+1) = 0.5000\n", + "_id: 5 = 1.0/(1+4) = 0.2000\n", + "```\n", + "\n", + "We rank the documents based on the RRF formula with a `window_size` of `5`\n", + "truncating the bottom `2` docs in our RRF result set with a `size` of `3`.\n", + "We end with `_id: 3` as `_rank: 1`, `_id: 2` as `_rank: 2`, and\n", + "`_id: 4` as `_rank: 3`. This ranking matches the result set from the\n", + "original RRF search as expected." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 1ff03e4c3957e9f11c45346454c0823f8de70ec0 Mon Sep 17 00:00:00 2001 From: Liam Thompson Date: Fri, 7 Jul 2023 11:40:49 +0200 Subject: [PATCH 2/3] Clear output, add button --- .../search/02-hybrid-search-with-rrf.ipynb | 219 +++--------------- 1 file changed, 26 insertions(+), 193 deletions(-) diff --git a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb index fa2c53d1..312bbdb1 100644 --- a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb +++ b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb @@ -9,8 +9,13 @@ "source": [ "# Hybrid Search using RRF\n", "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/leemthompo/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb)\n", + "\n", "In this example we'll use the reciprocal rank fusion algorithm to combine the results of BM25 and kNN semantic search.\n", - "We'll use the same dataset we used in our [quickstart](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/00-quick-start.ipynb) guide." + "We'll use the same dataset we used in our [quickstart](https://github.com/joemcelroy/elasticsearch-labs/blob/notebooks-guides/colab-notebooks-examples/search/00-quick-start.ipynb) guide.\n", + "You can use RRF for hybrid search out of the box, without any additional configuration.\n", + "\n", + "We also provide a walkthrough of a toy example, which demonstrates how RRF ranking works at a basic level." ] }, { @@ -65,7 +70,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -73,63 +78,7 @@ "id": "K9Q1p2C9-wce", "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "fatal: destination path 'elasticsearch-py' already exists and is not an empty directory.\n", - "/Users/liamthompson/notebook-tests/elasticsearch-py\n", - "HEAD is now at 825e642b Bumps 8.8 to 8.8.2\n", - "zsh:1: parse error near `-m'\n", - "Requirement already satisfied: sentence_transformers in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (2.2.2)\n", - "Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (4.30.2)\n", - "Requirement already satisfied: tqdm in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (4.65.0)\n", - "Requirement already satisfied: torch>=1.6.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (2.0.1)\n", - "Requirement already satisfied: torchvision in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (0.15.2)\n", - "Requirement already satisfied: numpy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (1.25.0)\n", - "Requirement already satisfied: scikit-learn in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (1.3.0)\n", - "Requirement already satisfied: scipy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (1.11.1)\n", - "Requirement already satisfied: nltk in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (3.8.1)\n", - "Requirement already satisfied: sentencepiece in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (0.1.99)\n", - "Requirement already satisfied: huggingface-hub>=0.4.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sentence_transformers) (0.15.1)\n", - "Requirement already satisfied: pyyaml>=5.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (6.0)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (4.6.3)\n", - "Requirement already satisfied: packaging>=20.9 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (23.1)\n", - "Requirement already satisfied: requests in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (2.31.0)\n", - "Requirement already satisfied: fsspec in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (2023.6.0)\n", - "Requirement already satisfied: filelock in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence_transformers) (3.12.2)\n", - "Requirement already satisfied: jinja2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch>=1.6.0->sentence_transformers) (3.1.2)\n", - "Requirement already satisfied: networkx in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch>=1.6.0->sentence_transformers) (3.1)\n", - "Requirement already satisfied: sympy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch>=1.6.0->sentence_transformers) (1.12)\n", - "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence_transformers) (0.13.3)\n", - "Requirement already satisfied: regex!=2019.12.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence_transformers) (2023.6.3)\n", - "Requirement already satisfied: safetensors>=0.3.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence_transformers) (0.3.1)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from jinja2->torch>=1.6.0->sentence_transformers) (2.1.3)\n", - "Requirement already satisfied: click in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from nltk->sentence_transformers) (8.1.3)\n", - "Requirement already satisfied: joblib in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from nltk->sentence_transformers) (1.3.1)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (1.26.16)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (2023.5.7)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (3.1.0)\n", - "Requirement already satisfied: idna<4,>=2.5 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence_transformers) (3.4)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from scikit-learn->sentence_transformers) (3.1.0)\n", - "Requirement already satisfied: mpmath>=0.19 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sympy->torch>=1.6.0->sentence_transformers) (1.3.0)\n", - "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torchvision->sentence_transformers) (10.0.0)\n", - "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", - "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n", - "Requirement already satisfied: torch in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (2.0.1)\n", - "Requirement already satisfied: jinja2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (3.1.2)\n", - "Requirement already satisfied: networkx in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (3.1)\n", - "Requirement already satisfied: filelock in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (3.12.2)\n", - "Requirement already satisfied: typing-extensions in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (4.6.3)\n", - "Requirement already satisfied: sympy in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from torch) (1.12)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from jinja2->torch) (2.1.3)\n", - "Requirement already satisfied: mpmath>=0.19 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from sympy->torch) (1.3.0)\n", - "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", - "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n" - ] - } - ], + "outputs": [], "source": [ "!git clone https://github.com/elastic/elasticsearch-py.git\n", "%cd elasticsearch-py\n", @@ -153,20 +102,11 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": { "id": "uP_GTVRi-d96" }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], + "outputs": [], "source": [ "from elasticsearch import Elasticsearch, helpers\n", "from urllib.request import urlopen\n", @@ -196,7 +136,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -231,7 +171,7 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -239,15 +179,7 @@ "id": "rdiUKqZbEKfF", "outputId": "43b6f1cd-a43e-4dbe-caa5-7fd170464881" }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'name': 'instance-0000000000', 'cluster_name': '9dd1e5c0b0d64796b8cf0746cf63d734', 'cluster_uuid': 'VeYvw6JhQcC3P-Q1-L9P_w', 'version': {'number': '8.9.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'ac7d79178c3e57c935358453331efe9e9cc5104d', 'build_date': '2023-06-21T09:08:25.219504984Z', 'build_snapshot': True, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0', 'transport_version': '8500019'}, 'tagline': 'You Know, for Search'}\n" - ] - } - ], + "outputs": [], "source": [ "print(client.info())" ] @@ -279,7 +211,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -287,31 +219,7 @@ "id": "cvYECABJJs_2", "outputId": "18fb51e4-c4f6-4d1b-cb2d-bc6f8ec1aa84" }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_2383/1628078329.py:22: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", - " client.indices.create(index='rrf_book_index', body=mapping)\n" - ] - }, - { - "ename": "BadRequestError", - "evalue": "BadRequestError(400, 'resource_already_exists_exception', 'index [rrf_book_index/Ip8zitwhSMe0OJtEwpuqzQ] already exists')", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mBadRequestError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[59], line 22\u001b[0m\n\u001b[1;32m 2\u001b[0m mapping \u001b[39m=\u001b[39m {\n\u001b[1;32m 3\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mmappings\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[1;32m 4\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mproperties\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 18\u001b[0m }\n\u001b[1;32m 19\u001b[0m }\n\u001b[1;32m 21\u001b[0m \u001b[39m# Create the index\u001b[39;00m\n\u001b[0;32m---> 22\u001b[0m client\u001b[39m.\u001b[39;49mindices\u001b[39m.\u001b[39;49mcreate(index\u001b[39m=\u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39mrrf_book_index\u001b[39;49m\u001b[39m'\u001b[39;49m, body\u001b[39m=\u001b[39;49mmapping)\n", - "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py:414\u001b[0m, in \u001b[0;36m_rewrite_parameters..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 411\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mKeyError\u001b[39;00m:\n\u001b[1;32m 412\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 414\u001b[0m \u001b[39mreturn\u001b[39;00m api(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", - "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/indices.py:517\u001b[0m, in \u001b[0;36mcreate\u001b[0;34m(self, index, aliases, error_trace, filter_path, human, mappings, master_timeout, pretty, settings, timeout, wait_for_active_shards)\u001b[0m\n", - "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/_base.py:389\u001b[0m, in \u001b[0;36mNamespacedClient.perform_request\u001b[0;34m(self, method, path, params, headers, body)\u001b[0m\n\u001b[1;32m 378\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mperform_request\u001b[39m(\n\u001b[1;32m 379\u001b[0m \u001b[39mself\u001b[39m,\n\u001b[1;32m 380\u001b[0m method: \u001b[39mstr\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 387\u001b[0m \u001b[39m# Use the internal clients .perform_request() implementation\u001b[39;00m\n\u001b[1;32m 388\u001b[0m \u001b[39m# so we take advantage of their transport options.\u001b[39;00m\n\u001b[0;32m--> 389\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_client\u001b[39m.\u001b[39;49mperform_request(\n\u001b[1;32m 390\u001b[0m method, path, params\u001b[39m=\u001b[39;49mparams, headers\u001b[39m=\u001b[39;49mheaders, body\u001b[39m=\u001b[39;49mbody\n\u001b[1;32m 391\u001b[0m )\n", - "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/_base.py:320\u001b[0m, in \u001b[0;36mBaseClient.perform_request\u001b[0;34m(self, method, path, params, headers, body)\u001b[0m\n\u001b[1;32m 317\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mValueError\u001b[39;00m, \u001b[39mKeyError\u001b[39;00m, \u001b[39mTypeError\u001b[39;00m):\n\u001b[1;32m 318\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 320\u001b[0m \u001b[39mraise\u001b[39;00m HTTP_EXCEPTIONS\u001b[39m.\u001b[39mget(meta\u001b[39m.\u001b[39mstatus, ApiError)(\n\u001b[1;32m 321\u001b[0m message\u001b[39m=\u001b[39mmessage, meta\u001b[39m=\u001b[39mmeta, body\u001b[39m=\u001b[39mresp_body\n\u001b[1;32m 322\u001b[0m )\n\u001b[1;32m 324\u001b[0m \u001b[39m# 'X-Elastic-Product: Elasticsearch' should be on every 2XX response.\u001b[39;00m\n\u001b[1;32m 325\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_verified_elasticsearch:\n\u001b[1;32m 326\u001b[0m \u001b[39m# If the header is set we mark the server as verified.\u001b[39;00m\n", - "\u001b[0;31mBadRequestError\u001b[0m: BadRequestError(400, 'resource_already_exists_exception', 'index [rrf_book_index/Ip8zitwhSMe0OJtEwpuqzQ] already exists')" - ] - } - ], + "outputs": [], "source": [ "# Define the mapping\n", "mapping = {\n", @@ -353,20 +261,9 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "ObjectApiResponse({'took': 29, 'errors': False, 'items': [{'index': {'_index': 'rrf_book_index', '_id': '7c-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 10, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '7s-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 11, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '78-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 12, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '8M-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 13, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '8c-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 14, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '8s-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 15, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '88-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 16, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '9M-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 17, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '9c-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 18, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': '9s-QKokBaD3r4jKCZkdN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 19, '_primary_term': 1, 'status': 201}}]})" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "books = [\n", " {\n", @@ -472,20 +369,9 @@ }, { "cell_type": "code", - "execution_count": 70, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "ObjectApiResponse({'took': 25, 'errors': False, 'items': [{'index': {'_index': 'rrf_book_index', '_id': 'KM-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 30, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Kc-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 31, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Ks-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 32, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'K8-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 33, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'LM-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 34, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Lc-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 35, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Ls-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 36, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'L8-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 37, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'MM-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 38, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'rrf_book_index', '_id': 'Mc-gK4kBaD3r4jKC2Ejk', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 39, '_primary_term': 1, 'status': 201}}]})" - ] - }, - "execution_count": 70, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "actions = []\n", "for book in books:\n", @@ -511,7 +397,7 @@ }, { "cell_type": "code", - "execution_count": 71, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -553,30 +439,9 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_2383/2934485565.py:22: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", - " response = client.search(index=\"rrf_book_index\", body=body)\n" - ] - }, - { - "ename": "TypeError", - "evalue": "search() got an unexpected keyword argument 'rank'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[76], line 22\u001b[0m\n\u001b[1;32m 1\u001b[0m body \u001b[39m=\u001b[39m {\n\u001b[1;32m 2\u001b[0m \u001b[39m\"\u001b[39m\u001b[39msize\u001b[39m\u001b[39m\"\u001b[39m: \u001b[39m5\u001b[39m,\n\u001b[1;32m 3\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mquery\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 19\u001b[0m }\n\u001b[1;32m 20\u001b[0m }\n\u001b[0;32m---> 22\u001b[0m response \u001b[39m=\u001b[39m client\u001b[39m.\u001b[39;49msearch(index\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mrrf_book_index\u001b[39;49m\u001b[39m\"\u001b[39;49m, body\u001b[39m=\u001b[39;49mbody)\n\u001b[1;32m 24\u001b[0m \u001b[39mprint\u001b[39m(response)\n", - "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py:414\u001b[0m, in \u001b[0;36m_rewrite_parameters..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 411\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mKeyError\u001b[39;00m:\n\u001b[1;32m 412\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 414\u001b[0m \u001b[39mreturn\u001b[39;00m api(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", - "\u001b[0;31mTypeError\u001b[0m: search() got an unexpected keyword argument 'rank'" - ] - } - ], + "outputs": [], "source": [ "body = {\n", " \"size\": 5,\n", @@ -667,20 +532,9 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "ObjectApiResponse({'took': 7, 'errors': False, 'items': [{'index': {'_index': 'example-index', '_id': 'UM8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 0, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'Uc8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 1, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'Us8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 2, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'U88cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 3, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'example-index', '_id': 'VM8cLYkBaD3r4jKCTUjQ', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 4, '_primary_term': 1, 'status': 201}}]})" - ] - }, - "execution_count": 80, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "doc1 = {\n", " \"text\" : \"rrf\",\n", @@ -730,30 +584,9 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_2383/3671365121.py:29: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n", - " response = client.search(index=\"example-index\", body=body)\n" - ] - }, - { - "ename": "TypeError", - "evalue": "search() got an unexpected keyword argument 'rank'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[81], line 29\u001b[0m\n\u001b[1;32m 1\u001b[0m body \u001b[39m=\u001b[39m {\n\u001b[1;32m 2\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mquery\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[1;32m 3\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mterm\u001b[39m\u001b[39m\"\u001b[39m: {\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 26\u001b[0m }\n\u001b[1;32m 27\u001b[0m }\n\u001b[0;32m---> 29\u001b[0m response \u001b[39m=\u001b[39m client\u001b[39m.\u001b[39;49msearch(index\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mexample-index\u001b[39;49m\u001b[39m\"\u001b[39;49m, body\u001b[39m=\u001b[39;49mbody)\n", - "File \u001b[0;32m~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py:414\u001b[0m, in \u001b[0;36m_rewrite_parameters..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 411\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mKeyError\u001b[39;00m:\n\u001b[1;32m 412\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[0;32m--> 414\u001b[0m \u001b[39mreturn\u001b[39;00m api(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", - "\u001b[0;31mTypeError\u001b[0m: search() got an unexpected keyword argument 'rank'" - ] - } - ], + "outputs": [], "source": [ "body = {\n", " \"query\": {\n", From 866227e01085fad648d1d4f56923fc8e2d383f78 Mon Sep 17 00:00:00 2001 From: Liam Thompson Date: Fri, 7 Jul 2023 12:00:13 +0200 Subject: [PATCH 3/3] Cleanup --- .../search/02-hybrid-search-with-rrf.ipynb | 43 +++++++++++-------- 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb index 312bbdb1..ba55bdcf 100644 --- a/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb +++ b/colab-notebooks-examples/search/02-hybrid-search-with-rrf.ipynb @@ -676,12 +676,12 @@ "]\n", "```\n", "\n", - "```markdown\n", - "<1> rank 1, `_id` 4\n", - "<2> rank 2, `_id` 3\n", - "<3> rank 3, `_id` 2\n", - "<4> rank 4, `_id` 1\n", - "```" + "Note the following information about the hits:\n", + "\n", + "- **(1)** rank 1, `_id` 4\n", + "- **(2)** rank 2, `_id` 3\n", + "- **(3)** rank 3, `_id` 2\n", + "- **(4)** rank 4, `_id` 1\n" ] }, { @@ -698,7 +698,7 @@ "\"hits\" : [\n", " {\n", " \"_index\" : \"example-index\",\n", - " \"_id\" : \"3\", \n", + " \"_id\" : \"3\", (1)\n", " \"_score\" : 1.0,\n", " \"_source\" : {\n", " \"integer\" : 1,\n", @@ -708,7 +708,7 @@ " },\n", " {\n", " \"_index\" : \"example-index\",\n", - " \"_id\" : \"2\", \n", + " \"_id\" : \"2\", (2)\n", " \"_score\" : 0.5,\n", " \"_source\" : {\n", " \"integer\" : 2,\n", @@ -718,7 +718,7 @@ " },\n", " {\n", " \"_index\" : \"example-index\",\n", - " \"_id\" : \"1\", \n", + " \"_id\" : \"1\", (3)\n", " \"_score\" : 0.2,\n", " \"_source\" : {\n", " \"integer\" : 1,\n", @@ -728,22 +728,24 @@ " },\n", " {\n", " \"_index\" : \"example-index\",\n", - " \"_id\" : \"5\", \n", + " \"_id\" : \"5\", (4)\n", " \"_score\" : 0.1,\n", " \"_source\" : {\n", " \"integer\" : 1,\n", " \"vector\" : [0]\n", " }\n", " }\n", - "]```\n", - "\n", - "```markdown\n", - "<1> rank 1, `_id` 3\n", - "<2> rank 2, `_id` 2\n", - "<3> rank 3, `_id` 1\n", - "<4> rank 4, `_id` 5\n", + "]\n", "```\n", "\n", + "Note the following information about the hits:\n", + "\n", + "- **(1)** rank 1, `_id` 3\n", + "- **(2)** rank 2, `_id` 2\n", + "- **(3)** rank 3, `_id` 1\n", + "- **(4)** rank 4, `_id` 5\n", + "\n", + "\n", "We can now take the two individually ranked result sets and apply the RRF formula to them to get our final ranking.\n", "\n", "```python\n", @@ -757,8 +759,11 @@ "\n", "We rank the documents based on the RRF formula with a `window_size` of `5`\n", "truncating the bottom `2` docs in our RRF result set with a `size` of `3`.\n", - "We end with `_id: 3` as `_rank: 1`, `_id: 2` as `_rank: 2`, and\n", - "`_id: 4` as `_rank: 3`. This ranking matches the result set from the\n", + "\n", + "We end up with `_id: 3` as `_rank: 1`, `_id: 2` as `_rank: 2`, and\n", + "`_id: 4` as `_rank: 3`.\n", + "\n", + "This ranking matches the result set from the\n", "original RRF search as expected." ] }