<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/qdrant_arize.png" width="500"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Tuning a RAG Pipeline using Qdrant and Arize Phoenix</h1>

ℹ️ This notebook requires an OpenAI API key.

### **1. Import Relevant Packages**

In [421]:
import os

# Setup projects
SIMPLE_RAG_PROJECT = "simple-rag"
HYBRID_RAG_PROJECT = "hybrid-rag"
os.environ["PHOENIX_PROJECT_NAME"] = SIMPLE_RAG_PROJECT

In [422]:
import datetime
import json
import os
import pickle
import ssl
import time
import urllib
from getpass import getpass
from urllib.request import urlopen

import certifi
import nest_asyncio
import openai
import pandas as pd
import phoenix as px
import requests
from bs4 import BeautifulSoup
from llama_index.core import (
    ServiceContext, StorageContext, download_loader,
    load_index_from_storage, set_global_handler
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.graph_stores.simple import SimpleGraphStore
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
    HallucinationEvaluator, OpenAIModel, QAEvaluator,
    RelevanceEvaluator, run_evals
)
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import PointStruct

nest_asyncio.apply()  # needed for concurrent evals in notebook environments
pd.set_option("display.max_colwidth", 1000)

### **2. Launch Phoenix**
You can run Phoenix in the background to collect trace data emitted by any LlamaIndex application that has been instrumented with the OpenInferenceTraceCallbackHandler. Phoenix supports LlamaIndex's one-click observability which will automatically instrument your LlamaIndex application! You can consult our integration guide for a more detailed explanation of how to instrument your LlamaIndex application.

Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI (the UI should be empty because we have yet to run the LlamaIndex application).

In [423]:
session = px.launch_app()

Existing running Phoenix instance detected! Shutting it down and starting a new instance...


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


Be sure to enable phoenix as your global handler for tracing!

In [424]:
set_global_handler("arize_phoenix")

### **3. Setup your openai key and retrieve the documents to be used**

In [425]:
from dotenv import load_dotenv
load_dotenv()

True

In [426]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

### **4. Retrieve the documents / dataset to be used**

In [238]:
from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("atitaarora/qdrant_doc", split="train")

In [239]:
dataset.info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='csv', dataset_name='qdrant_doc', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1767967, num_examples=240, shard_lengths=None, dataset_name='qdrant_doc')}, download_checksums={'hf://datasets/atitaarora/qdrant_doc@8d859890840f65337c38e96d660b81b1441bbecd/documents.csv': {'num_bytes': 1777260, 'checksum': None}}, download_size=1777260, post_processing_size=None, dataset_size=1767967, size_in_bytes=3545227)

### **5. Definition of global chunk properties and chunk processing**
Processing each document with desired **TEXT_SPLITTER_ALGO , CHUNK_SIZE , CHUNK_OVERLAP** etc

In [240]:
## Global config for chunk processing
CHUNK_SIZE = 512 #1000
CHUNK_OVERLAP = 50

### **6. Process dataset as langchain (or llamaindex) document for further processing**

In [241]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument
from llama_index.core import Document

## Split and process the document chunks from the given dataset

def process_document_chunks(dataset,chunk_size,chunk_overlap):
    langchain_docs = [
        LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
        for doc in tqdm(dataset)
    ]

    # could showcase another variation of processed documents
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        add_start_index=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in langchain_docs:
        docs_processed += text_splitter.split_documents([doc])

    ## Converting Langchain document chunks above into Llamaindex Document for ingestion
    llama_documents = [
        Document.from_langchain_format(doc)
        for doc in docs_processed
    ]
    return llama_documents

In [242]:
documents = process_document_chunks(dataset, CHUNK_SIZE, CHUNK_OVERLAP)
len(documents)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 240/240 [00:00<00:00, 14744.66it/s]


4431

### **7. Setting up Qdrant and Collection**

We first set up the qdrant client and then create a collection so that our data may be stored.

In [427]:
##Uncomment to initialise qdrant client in memory
#client = qdrant_client.QdrantClient(
#    location=":memory:",
#)

##Uncomment below to connect to Qdrant Cloud
client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

## Uncomment below to connect to local Qdrant
#client = qdrant_client.QdrantClient("http://localhost:6333")

In [428]:
## Collection Name 
COLLECTION_NAME = "qdrant_docs_arize_dense"

In [429]:
## General Collection level operations

## Get information about existing collections 
client.get_collections()

## Get information about specific collection
#collection_info = client.get_collection(COLLECTION_NAME)
#print(collection_info)

## Deleting collection, if need be
#client.delete_collection(COLLECTION_NAME)

CollectionsResponse(collections=[CollectionDescription(name='qdrant_docs_arize_dense'), CollectionDescription(name='qdrant_docs_arize_hybrid')])

In [None]:
## Declaring the intended Embedding Model with Fastembed
from fastembed.embedding import TextEmbedding

pd.DataFrame(TextEmbedding.list_supported_models())

### **8. Document Embedding processing and Ingestion**

This example uses a `QdrantVectorStore` and creates a new collection to work fully connected with Qdrant but you can use whatever LlamaIndex application you like.

In [None]:
import llama_index
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from phoenix.trace import suppress_tracing
## Uncomment it if you'd like to use FastEmbed instead of OpenAI
## For the complete list of supported models,
##please check https://qdrant.github.io/fastembed/examples/Supported_Models/
from llama_index.embeddings.fastembed import FastEmbedEmbedding

vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

##Uncomment if using FastEmbed
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

## Uncomment it if you'd like to use OpenAI Embeddings instead of FastEmbed
#Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

Settings.llm = OpenAI(model="gpt-4-1106-preview", temperature=0.0)

with suppress_tracing():
  index = VectorStoreIndex.from_documents(
      documents,
      storage_context=storage_context,
      show_progress=True
  )

### **8a. Connecting to existing Collection**

This example uses a `QdrantVectorStore` and uses the previously generated collection to work fully connected with Qdrant.

In [430]:
## Uncomment it if using an existing collection
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever
vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

In [431]:
client.count(collection_name=COLLECTION_NAME)

CountResult(count=4431)

### **9.Running an example query and printing out the response.**

In [432]:
##Initialise retriever to interact with the Qdrant collection
retriever = VectorIndexRetriever(
    index=index,
    vector_store_query_mode=VectorStoreQueryMode.DEFAULT,
    similarity_top_k=5
)

In [None]:
response = retriever.retrieve("What is quantization?")
for i, node in enumerate(response):
    print(i + 1, node.text, end="\n\n")

In [None]:
response

In [272]:
# We can view the above data in the UI
px.active_session().view()

📺 Opening a view to the Phoenix app. The app is running at http://localhost:6006/


### **10. Run Your Query Engine and View Your Traces in Phoenix**

We've compiled a list of the baseline questions about Qdrant. Let's download the sample queries and take a look.

In [433]:
## Loading the Eval dataset
from datasets import load_dataset
qdrant_qa = load_dataset("atitaarora/qdrant_doc_qna", split="train")
qdrant_qa_question = qdrant_qa.select_columns(['question'])

Downloading readme:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 85.8k/85.8k [00:00<00:00, 278kB/s]


Generating train split: 0 examples [00:00, ? examples/s]

In [434]:
qdrant_qa_question['question'][:10]

['What is vaccum optimizer ?',
 'Tell me about ‘always_ram’ parameter?',
 'What is difference between scalar and product quantization?',
 'What is ‘best_score’ strategy?',
 'How does oversampling helps?',
 'What is the purpose of ‘CreatePayloadIndexAsync’?',
 'What is the purpose of ef_construct in HNSW ?',
 'How do you use ‘ordering’ parameter?',
 'What is significance of ‘on_disk_payload’ setting?',
 'What is the impact of ‘write_consistency_factor’ ?']

In [435]:
query_engine = index.as_query_engine()
for query in tqdm(qdrant_qa_question['question'][:10]):
    try:
      query_engine.query(query)
    except Exception as e:
      pass

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:59<00:00,  5.91s/it]


Check the Phoenix UI as your queries run. Your traces should appear in real time.

Open the Phoenix UI with the link below if you haven't already and click through the queries to better understand how the query engine is performing. For each trace you will see a break

Phoenix can be used to understand and troubleshoot your by surfacing:
 - **Application latency** - highlighting slow invocations of LLMs, Retrievers, etc.
 - **Token Usage** - Displays the breakdown of token usage with LLMs to surface up your most expensive LLM calls
 - **Runtime Exceptions** - Critical runtime exceptions such as rate-limiting are captured as exception events.
 - **Retrieved Documents** - view all the documents retrieved during a retriever call and the score and order in which they were returned
 - **Embeddings** - view the embedding text used for retrieval and the underlying embedding model
LLM Parameters - view the parameters used when calling out to an LLM to debug things like temperature and the system prompts
 - **Prompt Templates** - Figure out what prompt template is used during the prompting step and what variables were used.
 - **Tool Descriptions** - view the description and function signature of the tools your LLM has been given access to
 - **LLM Function Calls** - if using OpenAI or other a model with function calls, you can view the function selection and function messages in the input messages to the LLM.

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/RAG_trace_details.png" alt="Trace Details View on Phoenix" style="width:100%; height:auto;">

In [276]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: http://localhost:6006/


### **11. Export and Evaluate Your Trace Data**
You can export your trace data as a pandas dataframe for further analysis and evaluation.

In this case, we will export our retriever spans into two separate dataframes:

queries_df, in which the retrieved documents for each query are concatenated into a single column, retrieved_documents_df, in which each retrieved document is "exploded" into its own row to enable the evaluation of each query-document pair in isolation. This will enable us to compute multiple kinds of evaluations, including:

relevance: Are the retrieved documents grounded in the response? Q&A correctness: Are your application's responses grounded in the retrieved context? hallucinations: Is your application making up false information?

In [436]:
queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())

In [437]:
queries_df

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
af10580fc044f34b,What is the impact of ‘write_consistency_factor’ ?,"The impact of the `write_consistency_factor` is that it determines the number of replicas that are required to acknowledge a write operation before the system responds to the client. If this value is increased, write operations will be more resilient to network partitions within the cluster, as they will be able to tolerate more failures. However, this also means that a greater number of active replicas are needed to successfully perform write operations.","- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations.\n\n### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python"
637b8cb7885e45e8,What is significance of ‘on_disk_payload’ setting?,"The `on_disk_payload` setting is significant because it determines where payload data is stored. When set to `true`, it ensures that payload data is stored on disk only, rather than in RAM. This can be particularly useful for managing memory usage effectively when dealing with large payloads, as it helps to limit the amount of RAM required by the service.","* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).\n\nThe payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage."
cd0006eaeda9dbcf,How do you use ‘ordering’ parameter?,"The `ordering` parameter can be used with update and delete operations to ensure that these operations are executed in the same order on all replicas. To use this parameter, you would include it in your request to Qdrant. For example, when making an HTTP PUT request to update points in a collection, you would append `?ordering=strong` to the endpoint URL, like this:\n\n```\nPUT /collections/{collection_name}/points?ordering=strong\n```\n\nIn the body of the request, you would then include the details of the batch update, specifying the ids, payloads, and vectors for the points you want to update.\n\nWhen using the `ordering` parameter, Qdrant will route the operation to the leader replica of the shard and will not respond to the client until the operation has been completed. This ensures that all replicas process the update or delete operations in the same order, which is particularly useful for avoiding data inconsistencies when there are concurrent updates to the same documents.","- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents\n\n```http\n\nPUT /collections/{collection_name}/points?ordering=strong\n\n{\n\n ""batch"": {\n\n ""ids"": [1, 2, 3],\n\n ""payloads"": [\n\n {""color"": ""red""},\n\n {""color"": ""green""},\n\n {""color"": ""blue""}\n\n ],\n\n ""vectors"": [\n\n [0.9, 0.1, 0.1],\n\n [0.1, 0.9, 0.1],\n\n [0.1, 0.1, 0.9]\n\n ]\n\n }\n\n}\n\n```\n\n\n\n```python\n\nclient.upsert(\n\n collection_name=""{collection_name}"",\n\n points=models.Batch(\n\n ids=[1, 2, 3],"
80b7acbd5d1a7bb0,What is the purpose of ef_construct in HNSW ?,The `ef_construct` parameter in HNSW is the number of neighbors to consider during the index building. A larger value for `ef_construct` leads to higher precision in the search but also results in longer indexing time and more space required.,"(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n )),\n\n }),\n\nThe larger the value of it, the higher the precision of the search, but more space required. The `ef_construct` parameter is the number of \n\nneighbours to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time.\n\nThe default values of these parameters are `m=16` and `ef_construct=100`. Let's try to increase them to `m=32` and `ef_construct=200` and"
7a620831aa8c580d,What is the purpose of ‘CreatePayloadIndexAsync’?,"The purpose of `CreatePayloadIndexAsync` is to create an index for a specific field within a collection in a database or search engine. This index is designed to optimize the search and retrieval of data based on the values of the field being indexed. The parameters within the function suggest that it is used to index a text field, with options to specify the tokenizer, minimum and maximum token length, and whether to convert the text to lowercase, which are all relevant for text processing and search optimization.","client.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};\n\n},\n\n ""api"": {\n\n ""type"": ""openapi"",\n\n ""url"": ""https://your-application-name.fly.dev/.well-known/openapi.yaml"",\n\n ""has_user_authentication"": false\n\n },\n\n ""logo_url"": ""https://your-application-name.fly.dev/.well-known/logo.png"",\n\n ""contact_email"": ""email@domain.com"",\n\n ""legal_info_url"": ""email@domain.com""\n\n}\n\n```\n\n\n\nThat was the last step before running the final command. The command that will deploy \n\nthe application on the server:\n\n\n\n```bash\n\nflyctl deploy\n\n```"
9b89ae147db1aa43,How does oversampling helps?,"Oversampling helps in two distinct ways:\n\n1. It improves the accuracy and performance of similarity search algorithms by allowing for significant compression of high-dimensional vectors in memory, while compensating for accuracy loss by re-scoring additional points with the original vectors.\n\n2. It equalizes the representation of classes in the training dataset, which enables more fair and accurate modeling of real-world scenarios.","### Oversampling for quantization\n\n\n\nWe are introducing [oversampling](/documentation/guides/quantization/#oversampling) as a new way to help you improve the accuracy and performance of similarity search algorithms. With this method, you are able to significantly compress high-dimensional vectors in memory and then compensate the accuracy loss by re-scoring additional points with the original vectors.\n\noversampling helps equalize the representation of classes in the training dataset, thus enabling more fair and accurate modeling of real-world scenarios."
6e2dca3f21076026,What is ‘best_score’ strategy?,"The 'best_score' strategy is a method used to find similar vectors by comparing each candidate against every example to identify the ones that are closer to a positive example and further from a negative one. The strategy selects the best positive and best negative scores for each candidate, and the final score is calculated using a step formula. If the best positive score is greater than the best negative score, the final score is the best positive score. Otherwise, the final score is the negative of the best negative score squared. This strategy was introduced in version 1.6.0 and its performance scales linearly with the number of examples used.","This is the default strategy that's going to be set implicitly, but you can explicitly define it by setting `""strategy"": ""average_vector""` in the recommendation request.\n\n\n\n### Best score strategy\n\n\n\n*Available as of v1.6.0*\n\n\n\nA new strategy introduced in v1.6, is called `best_score`. It is based on the idea that the best way to find similar vectors is to find the ones that are closer to a positive example, while avoiding the ones that are closer to a negative one.\n\nThe way it works is that each candidate is measured against every example, then we select the best positive and best negative scores. The final score is chosen with this step formula:\n\n\n\n```rust\n\nlet score = if best_positive_score > best_negative_score {\n\n best_positive_score;\n\n} else {\n\n -(best_negative_score * best_negative_score);\n\n};\n\n```\n\n\n\n<aside role=""alert"">\n\nThe performance of <code>best_score</code> strategy will be linearly impacted by the amount of examples.\n\n</as..."
01d4da38ee636076,What is difference between scalar and product quantization?,"Scalar quantization is a compression technique that reduces the number of bits used to represent each component of a vector. For example, it can convert 32-bit floating-point numbers into 8-bit unsigned integers for each vector component. This method is SIMD-friendly, which can make it faster in computations.\n\nProduct quantization, on the other hand, is also used for compressing high-dimensional vectors but involves dividing the vector into smaller sub-vectors and quantizing each of them separately. This method is not SIMD-friendly, which can result in slower distance calculations compared to scalar quantization. Additionally, product quantization typically incurs a loss of accuracy and is recommended for use with high-dimensional vectors.","But there are some tradeoffs. Product quantization distance calculations are not SIMD-friendly, so it is slower than scalar quantization.\n\nAlso, product quantization has a loss of accuracy, so it is recommended to use it only for high-dimensional vectors.\n\n\n\nPlease refer to the [Quantization Tips](#quantization-tips) section for more information on how to optimize the quantization parameters for your use case.\n\n\n\n## How to choose the right quantization method\n\n*Available as of v1.1.0*\n\n\n\nScalar quantization, in the context of vector search engines, is a compression technique that compresses vectors by reducing the number of bits used to represent each vector component.\n\n\n\n\n\nFor instance, Qdrant uses 32-bit floating numbers to represent the original vector components. Scalar quantization allows you to reduce the number of bits used to 8.\n\nIn other words, Qdrant performs `float32 -> uint8` conversion for each vector component."
6f5f2fe36b1f6b0f,Tell me about ‘always_ram’ parameter?,"The `always_ram` parameter is a configuration option that determines whether quantized vectors should be always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors. However, in certain setups, keeping quantized vectors in RAM can speed up the search process. If you want to store quantized vectors in RAM, you can set `always_ram` to `true`. This parameter can be particularly useful if you experience a significant decrease in search quality and need to ensure faster access to the quantized vectors.","""compression"": ""x32"",\n\n ""always_ram"": true\n\n }\n\n },\n\n ""on_disk"": true\n\n }\n\n },\n\n ""hnsw_config"": {\n\n ""ef_construct"": 123\n\n },\n\n ""quantization_config"": {\n\n ""scalar"": {\n\n ""type"": ""int8"",\n\n ""quantile"": 0.8,\n\n ""always_ram"": false\n\n }\n\n }\n\n}\n\n```\n\n\n\n```bash\n\ncurl -X PATCH http://localhost:6333/collections/test_collection1 \\n\nIt might be worth tuning this parameter if you experience a significant decrease in search quality.\n\n\n\n`always_ram` - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors.\n\nHowever, in some setups you might want to keep quantized vectors in RAM to speed up the search process.\n\n\n\nIn this case, you can set `always_ram` to `true` to store quantized vectors in RAM.\n\n\n\n### Setting up Binary Quant..."
4c72533de8333fbd,What is vaccum optimizer ?,"The term ""vacuum optimizer"" does not appear directly in the provided context. However, the context does mention an ""optimizer_config"" with various parameters, one of which is ""vacuum_min_vector_number."" This parameter suggests that there is a vacuuming process involved in the optimization, which likely refers to a mechanism for managing and optimizing the storage of vectors. The ""vacuum_min_vector_number"" parameter indicates a threshold for the minimum number of vectors required before the vacuuming process can be triggered. Vacuuming in this context might be related to the process of cleaning up and compacting the data storage to improve performance and efficiency, although the exact nature of the ""vacuum optimizer"" cannot be determined from the given excerpt.","return optimizer\n\n```\n\n\n\nCaching in Quaterion is used for avoiding calculation of outputs of a frozen pretrained `Encoder` in every epoch.\n\nWhen it is configured, outputs will be computed once and cached in the preferred device for direct usage later on.\n\nIt provides both a considerable speedup and less memory footprint.\n\nHowever, it is quite a bit versatile and has several knobs to tune.\n\n},\n\n ""optimizer_config"": {\n\n ""deleted_threshold"": 0.2,\n\n ""vacuum_min_vector_number"": 1000,\n\n ""default_segment_number"": 0,\n\n ""max_segment_size"": null,\n\n ""memmap_threshold"": null,\n\n ""indexing_threshold"": 20000,\n\n ""flush_interval_sec"": 5,\n\n ""max_optimization_threads"": 1\n\n },\n\n ""wal_config"": {\n\n ""wal_capacity_mb"": 32,"


In [438]:
retrieved_documents_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9b921d98f4d72421,0,7553aa25374ccc3b9ce2042985cb264f,What is the impact of ‘write_consistency_factor’ ?,"- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations.",0.832443
9b921d98f4d72421,1,7553aa25374ccc3b9ce2042985cb264f,What is the impact of ‘write_consistency_factor’ ?,"### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python",0.825855
9b007f22503fb16c,0,98f3367385a8661f1fb9cb41f67a8397,What is significance of ‘on_disk_payload’ setting?,* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).,0.821666
9b007f22503fb16c,1,98f3367385a8661f1fb9cb41f67a8397,What is significance of ‘on_disk_payload’ setting?,"The payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage.",0.787703
2e0bfc0318b338ce,0,9e6e7f4ee9534522b259dbe57b7c242b,How do you use ‘ordering’ parameter?,"- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents",0.770652
2e0bfc0318b338ce,1,9e6e7f4ee9534522b259dbe57b7c242b,How do you use ‘ordering’ parameter?,"```http\n\nPUT /collections/{collection_name}/points?ordering=strong\n\n{\n\n ""batch"": {\n\n ""ids"": [1, 2, 3],\n\n ""payloads"": [\n\n {""color"": ""red""},\n\n {""color"": ""green""},\n\n {""color"": ""blue""}\n\n ],\n\n ""vectors"": [\n\n [0.9, 0.1, 0.1],\n\n [0.1, 0.9, 0.1],\n\n [0.1, 0.1, 0.9]\n\n ]\n\n }\n\n}\n\n```\n\n\n\n```python\n\nclient.upsert(\n\n collection_name=""{collection_name}"",\n\n points=models.Batch(\n\n ids=[1, 2, 3],",0.740061
8d75d3042708b6c8,0,3fefcfcbf9d25bae6d695ffbd9b1eecc,What is the purpose of ef_construct in HNSW ?,"(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n )),\n\n }),",0.78715
8d75d3042708b6c8,1,3fefcfcbf9d25bae6d695ffbd9b1eecc,What is the purpose of ef_construct in HNSW ?,"The larger the value of it, the higher the precision of the search, but more space required. The `ef_construct` parameter is the number of \n\nneighbours to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time.\n\nThe default values of these parameters are `m=16` and `ef_construct=100`. Let's try to increase them to `m=32` and `ef_construct=200` and",0.767757
b388e45d23bbb452,0,28c99c7776f28de672def56a56604dc6,What is the purpose of ‘CreatePayloadIndexAsync’?,"client.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};",0.717572
b388e45d23bbb452,1,28c99c7776f28de672def56a56604dc6,What is the purpose of ‘CreatePayloadIndexAsync’?,"},\n\n ""api"": {\n\n ""type"": ""openapi"",\n\n ""url"": ""https://your-application-name.fly.dev/.well-known/openapi.yaml"",\n\n ""has_user_authentication"": false\n\n },\n\n ""logo_url"": ""https://your-application-name.fly.dev/.well-known/logo.png"",\n\n ""contact_email"": ""email@domain.com"",\n\n ""legal_info_url"": ""email@domain.com""\n\n}\n\n```\n\n\n\nThat was the last step before running the final command. The command that will deploy \n\nthe application on the server:\n\n\n\n```bash\n\nflyctl deploy\n\n```",0.698218


### **12. Define your evaluation model and your evaluators**

Next, define your evaluation model and your evaluators.

Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

In [439]:
eval_model = OpenAIModel(
    model="gpt-4-turbo-preview",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
)
px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df))

run_evals |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s

Your evaluations should now appear as annotations on the appropriate spans in Phoenix.

![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)

### **13. Let's try Hybrid search now**

In [440]:
## Define a new collection to store your hybrid emebeddings
COLLECTION_NAME_HYBRID = "qdrant_docs_arize_hybrid"

In [441]:
##Reprocess documents with different settings if needed 
#documents = process_document_chunks(dataset , CHUNK_SIZE , CHUNK_OVERLAP)

In [442]:
#len(documents)

In [247]:
##List of supported sparse vector models
from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding
SparseTextEmbedding.list_supported_models()

[{'model': 'prithvida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Misspelled version of the model. Retained for backward compatibility. Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}},
 {'model': 'prithivida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}}]

### **14. Ingest Sparse and Dense vectors into Qdrant**

Ingest sparse and dense vectors into Qdrant Collection.
We are using Splade++ model for Sparse Vector Model and default Fastembed model - bge-small-en-1.5 for dense embeddings. 

In [248]:
import llama_index
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding, SparseEmbedding
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from typing import List, Tuple

sparse_model_name = "prithivida/Splade_PP_en_v1"

# This triggers the model download
sparse_model = SparseTextEmbedding(model_name=sparse_model_name, batch_size=32)

batch_size = 10
parallel = 0

## Computing sparse vectors
def compute_sparse_vectors(
    texts: List[str],
    ) -> Tuple[List[List[int]], List[List[float]]]:
    indices, values = [], []
    for embedding in sparse_model.embed(texts):
        indices.append(embedding.indices.tolist())
        values.append(embedding.values.tolist())
    return indices, values

## Creating a vector store with Hybrid search enabled
vector_store = QdrantVectorStore(
    client=client,
    collection_name=COLLECTION_NAME_HYBRID,
    enable_hybrid=True,
    sparse_doc_fn=compute_sparse_vectors,
    sparse_query_fn=compute_sparse_vectors)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

## Ingesting sparse and dense vectors into Qdrant collection
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True
)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/4431 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/335 [00:00<?, ?it/s]

In [443]:
## collection level operations
client.get_collection(COLLECTION_NAME_HYBRID)
#client.delete_collection(COLLECTION_NAME_HYBRID)

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=8862, indexed_vectors_count=4429, points_count=4431, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors={'text-dense': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None)}, shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors={'text-sparse': SparseVectorParams(index=SparseIndexParams(full_scan_threshold=None, on_disk=None))}), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_thread

In [444]:
## Check the number of documents matches the expected number of document chunks 
client.count(collection_name=COLLECTION_NAME_HYBRID)

CountResult(count=4431)

### **15. Hybrid Search with Qdrant**

In [447]:
## Before trying Hybrid search , lets try Sparse Vector Search Retriever 
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever

sparse_retriever = VectorIndexRetriever(
    index=hybrid_index,
    vector_store_query_mode=VectorStoreQueryMode.SPARSE,
    sparse_top_k=2,
)

## Pure sparse vector search
nodes = sparse_retriever.retrieve("What is a Vacuum Optimizer?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

1 Like many other databases, Qdrant does not delete entries immediately after a query.

Instead, it marks records as deleted and ignores them for future queries.



This strategy allows us to minimize disk access - one of the slowest operations.

However, a side effect of this strategy is that, over time, deleted records accumulate, occupy memory and slow down the system.



To avoid these adverse effects, Vacuum Optimizer is used.

It is used if the segment has accumulated too many deleted records.

2 In this case, the segment to be optimized remains readable for the time of the rebuild.



![Segment optimization](/docs/optimization.svg)



The availability is achieved by wrapping the segment into a proxy that transparently handles data changes.

Changed data is placed in the copy-on-write segment, which has priority for retrieval and subsequent updates.



## Vacuum Optimizer



The simplest example of a case where you need to rebuild a segment repository is to remove points.



In [448]:
## Let's try Hybrid Search Retriever now using the 'alpha' parameter that controls the weightage between
## the dense and sparse vector search scores.

hybrid_retriever = VectorIndexRetriever(
    index=hybrid_index,
    vector_store_query_mode=VectorStoreQueryMode.HYBRID,
    sparse_top_k=2,
    similarity_top_k=5,
    alpha=1,
)

In [449]:
## Let's try hybrid retriever 
nodes = hybrid_retriever.retrieve("What is merge optimizer?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n")

1 Such segments, for example, are created as copy-on-write segments during optimization itself.



It is also essential to have at least one small segment that Qdrant will use to store frequently updated data.

On the other hand, too many small segments lead to suboptimal search performance.



There is the Merge Optimizer, which combines the smallest segments into one large segment. It is used if too many segments are created.
2 ---

title: Optimizer

weight: 70

aliases:

  - ../optimizer

---



# Optimizer



It is much more efficient to apply changes in batches than perform each change individually, as many other databases do. Qdrant here is no exception. Since Qdrant operates with data structures that are not always easy to change, it is sometimes necessary to rebuild those structures completely.



Storage optimization in Qdrant occurs at the segment level (see [storage](../storage)).
3 The criteria for starting the optimizer are defined in the configuration file.



Here is an 

In [451]:
# We shouldn't be modifying the alpha parameter after the retriever has been created
# but that's the easiest way to show the effect of the parameter
#hybrid_retriever._alpha = 0.1
#hybrid_retriever._alpha = 0.9

#nodes = hybrid_retriever.retrieve("What are the advantages of quantization?")
#for i, node in enumerate(nodes):
#    print(i + 1, node.text, end="\n\n")

In [450]:
## Taken from https://github.com/run-llama/llama_index/blob/6af5b1377a652f3e1de2f8523e0cbab2378ebb33/llama-index-integrations/vector_stores/llama-index-vector-stores-qdrant/llama_index/vector_stores/qdrant/utils.py#L96-L102
## For the purpose of showing how to customize the score distribution in Hybrid search

from llama_index.core.vector_stores.types import VectorStoreQueryResult

def relative_score_custom_fusion(
    dense_result: VectorStoreQueryResult,
    sparse_result: VectorStoreQueryResult,
    # NOTE: only for hybrid search (0 for sparse search, 1 for dense search)
    alpha: float = 0.1,
    top_k: int = 2,
) -> VectorStoreQueryResult:
    """
    Fuse dense and sparse results using relative score fusion.
    """
    # check if dense or sparse results is empty
    if (dense_result.nodes is None or len(dense_result.nodes) == 0) and (
        sparse_result.nodes is None or len(sparse_result.nodes) == 0
    ):
        return VectorStoreQueryResult(nodes=None, similarities=None, ids=None)
    elif sparse_result.nodes is None or len(sparse_result.nodes) == 0:
        return dense_result
    elif dense_result.nodes is None or len(dense_result.nodes) == 0:
        return sparse_result

    assert dense_result.nodes is not None
    assert dense_result.similarities is not None
    assert sparse_result.nodes is not None
    assert sparse_result.similarities is not None

    # deconstruct results
    sparse_result_tuples = list(zip(sparse_result.similarities, sparse_result.nodes))
    sparse_result_tuples.sort(key=lambda x: x[0], reverse=True)

    dense_result_tuples = list(zip(dense_result.similarities, dense_result.nodes))
    dense_result_tuples.sort(key=lambda x: x[0], reverse=True)

    # track nodes in both results
    all_nodes_dict = {x.node_id: x for x in dense_result.nodes}
    for node in sparse_result.nodes:
        if node.node_id not in all_nodes_dict:
            all_nodes_dict[node.node_id] = node

    # normalize sparse similarities from 0 to 1
    sparse_similarities = [x[0] for x in sparse_result_tuples]

    sparse_per_node = {}
    if len(sparse_similarities) > 0:
        max_sparse_sim = max(sparse_similarities)
        min_sparse_sim = min(sparse_similarities)

        # avoid division by zero
        if max_sparse_sim == min_sparse_sim:
            sparse_similarities = [max_sparse_sim] * len(sparse_similarities)
        else:
            sparse_similarities = [
                (x - min_sparse_sim) / (max_sparse_sim - min_sparse_sim)
                for x in sparse_similarities
            ]

        sparse_per_node = {
            sparse_result_tuples[i][1].node_id: x
            for i, x in enumerate(sparse_similarities)
        }

    # normalize dense similarities from 0 to 1
    dense_similarities = [x[0] for x in dense_result_tuples]

    dense_per_node = {}
    if len(dense_similarities) > 0:
        max_dense_sim = max(dense_similarities)
        min_dense_sim = min(dense_similarities)

        # avoid division by zero
        if max_dense_sim == min_dense_sim:
            dense_similarities = [max_dense_sim] * len(dense_similarities)
        else:
            dense_similarities = [
                (x - min_dense_sim) / (max_dense_sim - min_dense_sim)
                for x in dense_similarities
            ]

        dense_per_node = {
            dense_result_tuples[i][1].node_id: x
            for i, x in enumerate(dense_similarities)
        }

    # fuse the scores
    fused_similarities = []
    for node_id in all_nodes_dict:
        sparse_sim = sparse_per_node.get(node_id, 0)
        dense_sim = dense_per_node.get(node_id, 0)
        fused_sim = (1 - alpha) * sparse_sim + alpha * dense_sim
        fused_similarities.append((fused_sim, all_nodes_dict[node_id]))

    fused_similarities.sort(key=lambda x: x[0], reverse=True)
    fused_similarities = fused_similarities[:top_k]

    # create final response object
    return VectorStoreQueryResult(
        nodes=[x[1] for x in fused_similarities],
        similarities=[x[0] for x in fused_similarities],
        ids=[x[1].node_id for x in fused_similarities],
    )

In [452]:

## Initialise Hybrid Vector Store 
vector_store_hybrid = QdrantVectorStore(
    client=client,
    collection_name=COLLECTION_NAME_HYBRID,
    enable_hybrid=True,
    batch_size=20,  # This is important for the ingestion
    hybrid_fusion_fn = relative_score_custom_fusion,
)

## Followed by initializing index for interacting with the Hybrid Collection in Qdrant

hybrid_index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store_hybrid,
    storage_context=storage_context,
)

### **16. Re-Run Your Query Engine and View Your Traces in Phoenix**

Let's rerun the list of the baseline questions about Qdrant on the Hybrid Retriever. 

In [453]:
## Switching phoenix project space
from phoenix.trace import using_project

# Switch project to run evals
with using_project(HYBRID_RAG_PROJECT):
# All spans created within this context will be associated with the `HYBRID_RAG_PROJECT` project.

    ##Reuse the previously loaded dataset `qdrant_qa_question`
    query_engine_hybrid = hybrid_index.as_query_engine()
    for query in tqdm(qdrant_qa_question['question'][:10]):
        try:
          query_engine_hybrid.query(query)
        except Exception as e:
          pass



  0%|                                                                                                                                              | 0/10 [00:00<?, ?it/s][A[A

 10%|█████████████▍                                                                                                                        | 1/10 [00:08<01:13,  8.13s/it][A[A

 20%|██████████████████████████▊                                                                                                           | 2/10 [00:13<00:51,  6.50s/it][A[A

 30%|████████████████████████████████████████▏                                                                                             | 3/10 [00:21<00:51,  7.33s/it][A[A

 40%|█████████████████████████████████████████████████████▌                                                                                | 4/10 [00:27<00:39,  6.56s/it][A[A

 50%|███████████████████████████████████████████████████████████████████                                    

In [454]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: http://localhost:6006/


In [455]:
## Switching phoenix project space
from phoenix.trace import using_project


queries_df_hybrid = get_qa_with_reference(px.Client(), project_name=HYBRID_RAG_PROJECT)
retrieved_documents_df_hybrid = get_retrieved_documents(px.Client(), project_name=HYBRID_RAG_PROJECT)

In [456]:
queries_df_hybrid

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
212d2c6b46f2474b,What is the impact of ‘write_consistency_factor’ ?,"The impact of the `write_consistency_factor` is that it determines the number of replicas in a distributed system that must acknowledge a write operation before the system responds to the client. If the `write_consistency_factor` is increased, write operations will be more tolerant to network partitions, as more replicas are required to acknowledge the write. However, this also means that a higher number of active replicas are needed to successfully perform write operations. If not enough replicas are available to meet the `write_consistency_factor`, write operations will not be completed, potentially affecting the availability of the system for writing data.","- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations.\n\n### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python"
5b680cfae0d4b4f1,What is significance of ‘on_disk_payload’ setting?,"The `on_disk_payload` setting is significant because it determines where payload data is stored. When set to `true`, it ensures that payload data is stored only on disk, rather than in RAM. This can be particularly useful for managing memory usage effectively when dealing with large payloads, as it helps to limit the amount of RAM required.","* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).\n\nThe payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage."
0aa7f26ef2ea7615,How do you use ‘ordering’ parameter?,"The `ordering` parameter can be used with update and delete operations to ensure that these operations are executed in the same order on all replicas. To use this parameter, you would include it in your request to Qdrant. For example, when sending an HTTP PUT request to update points in a collection, you would append `?ordering=strong` to the URL, like this:\n\n```\nPUT /collections/{collection_name}/points?ordering=strong\n```\n\nIn the body of the request, you would include the details of the points you want to update, such as their IDs, payloads, and vectors.\n\nWhen using the `ordering` parameter, Qdrant will route the operation to the leader replica of the shard and will not respond to the client until it has received a response from the leader replica. This ensures that all replicas process the operations in the same order, which is particularly useful for avoiding data inconsistencies when there are concurrent updates to the same documents.","- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents\n\n```http\n\nPUT /collections/{collection_name}/points?ordering=strong\n\n{\n\n ""batch"": {\n\n ""ids"": [1, 2, 3],\n\n ""payloads"": [\n\n {""color"": ""red""},\n\n {""color"": ""green""},\n\n {""color"": ""blue""}\n\n ],\n\n ""vectors"": [\n\n [0.9, 0.1, 0.1],\n\n [0.1, 0.9, 0.1],\n\n [0.1, 0.1, 0.9]\n\n ]\n\n }\n\n}\n\n```\n\n\n\n```python\n\nclient.upsert(\n\n collection_name=""{collection_name}"",\n\n points=models.Batch(\n\n ids=[1, 2, 3],"
e3d1118a55429921,What is the purpose of ef_construct in HNSW ?,The `ef_construct` parameter in HNSW is the number of neighbors to consider during the index building. A larger value for `ef_construct` leads to higher precision in the search but also results in longer indexing time.,"(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n )),\n\n }),\n\nThe larger the value of it, the higher the precision of the search, but more space required. The `ef_construct` parameter is the number of \n\nneighbours to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time.\n\nThe default values of these parameters are `m=16` and `ef_construct=100`. Let's try to increase them to `m=32` and `ef_construct=200` and"
acad6907c9d13ff9,What is the purpose of ‘CreatePayloadIndexAsync’?,"The purpose of the `CreatePayloadIndexAsync` method is to create an index for a specific field within a collection in a database. This index is designed to optimize the search and retrieval of data based on the values of that field. The parameters within the method specify the name of the field to be indexed, the schema of the field including its type (e.g., text), the tokenizer to be used (e.g., word), the minimum and maximum token length, and whether the text should be converted to lowercase. This asynchronous operation facilitates efficient querying of text fields by structuring the index to quickly locate entries based on the indexed field's criteria.","client.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};\n\n},\n\n ""api"": {\n\n ""type"": ""openapi"",\n\n ""url"": ""https://your-application-name.fly.dev/.well-known/openapi.yaml"",\n\n ""has_user_authentication"": false\n\n },\n\n ""logo_url"": ""https://your-application-name.fly.dev/.well-known/logo.png"",\n\n ""contact_email"": ""email@domain.com"",\n\n ""legal_info_url"": ""email@domain.com""\n\n}\n\n```\n\n\n\nThat was the last step before running the final command. The command that will deploy \n\nthe application on the server:\n\n\n\n```bash\n\nflyctl deploy\n\n```"
048a6afc49aa9229,How does oversampling helps?,"Oversampling helps in two distinct ways. In the context of quantization for similarity search algorithms, it allows for significant compression of high-dimensional vectors in memory while compensating for accuracy loss by re-scoring additional points with the original vectors. In the context of training datasets, oversampling helps to equalize the representation of classes, enabling more fair and accurate modeling of real-world scenarios.","### Oversampling for quantization\n\n\n\nWe are introducing [oversampling](/documentation/guides/quantization/#oversampling) as a new way to help you improve the accuracy and performance of similarity search algorithms. With this method, you are able to significantly compress high-dimensional vectors in memory and then compensate the accuracy loss by re-scoring additional points with the original vectors.\n\noversampling helps equalize the representation of classes in the training dataset, thus enabling more fair and accurate modeling of real-world scenarios."
3861ecba1ca97b94,What is ‘best_score’ strategy?,"The 'best_score' strategy is a method used to find similar vectors by comparing each candidate against every example. It selects the best positive and best negative scores from these comparisons. The final score for a candidate is determined using a step formula: if the best positive score is greater than the best negative score, the final score is the best positive score; otherwise, the final score is the negative of the best negative score squared. This strategy was introduced in version 1.6.0 and its performance scales linearly with the number of examples used.","This is the default strategy that's going to be set implicitly, but you can explicitly define it by setting `""strategy"": ""average_vector""` in the recommendation request.\n\n\n\n### Best score strategy\n\n\n\n*Available as of v1.6.0*\n\n\n\nA new strategy introduced in v1.6, is called `best_score`. It is based on the idea that the best way to find similar vectors is to find the ones that are closer to a positive example, while avoiding the ones that are closer to a negative one.\n\nThe way it works is that each candidate is measured against every example, then we select the best positive and best negative scores. The final score is chosen with this step formula:\n\n\n\n```rust\n\nlet score = if best_positive_score > best_negative_score {\n\n best_positive_score;\n\n} else {\n\n -(best_negative_score * best_negative_score);\n\n};\n\n```\n\n\n\n<aside role=""alert"">\n\nThe performance of <code>best_score</code> strategy will be linearly impacted by the amount of examples.\n\n</as..."
a865b3eee64e9806,What is difference between scalar and product quantization?,"Scalar quantization is a compression technique that reduces the number of bits used to represent each component of a vector. It involves converting the original floating-point representations of vector components into a lower bit representation, such as from 32-bit floating numbers to 8-bit unsigned integers.\n\nProduct quantization, on the other hand, is a different approach to compressing high-dimensional vectors. It involves dividing the vector into smaller sub-vectors and quantizing each sub-vector separately. This method is not as friendly to SIMD (Single Instruction, Multiple Data) operations, which can make distance calculations slower compared to scalar quantization. Additionally, product quantization typically results in a loss of accuracy, and therefore, it is recommended for use with high-dimensional vectors where this trade-off is acceptable.","But there are some tradeoffs. Product quantization distance calculations are not SIMD-friendly, so it is slower than scalar quantization.\n\nAlso, product quantization has a loss of accuracy, so it is recommended to use it only for high-dimensional vectors.\n\n\n\nPlease refer to the [Quantization Tips](#quantization-tips) section for more information on how to optimize the quantization parameters for your use case.\n\n\n\n## How to choose the right quantization method\n\n*Available as of v1.1.0*\n\n\n\nScalar quantization, in the context of vector search engines, is a compression technique that compresses vectors by reducing the number of bits used to represent each vector component.\n\n\n\n\n\nFor instance, Qdrant uses 32-bit floating numbers to represent the original vector components. Scalar quantization allows you to reduce the number of bits used to 8.\n\nIn other words, Qdrant performs `float32 -> uint8` conversion for each vector component."
0d4fc07d2f4b8680,Tell me about ‘always_ram’ parameter?,"The `always_ram` parameter is a configuration option that determines whether quantized vectors should be kept cached in RAM at all times. By default, quantized vectors are loaded in the same way as the original vectors. However, if you want to speed up the search process by avoiding the need to load quantized vectors from disk, you can set `always_ram` to `true`. This will ensure that the quantized vectors are stored in RAM, potentially improving search performance at the cost of increased memory usage.","""compression"": ""x32"",\n\n ""always_ram"": true\n\n }\n\n },\n\n ""on_disk"": true\n\n }\n\n },\n\n ""hnsw_config"": {\n\n ""ef_construct"": 123\n\n },\n\n ""quantization_config"": {\n\n ""scalar"": {\n\n ""type"": ""int8"",\n\n ""quantile"": 0.8,\n\n ""always_ram"": false\n\n }\n\n }\n\n}\n\n```\n\n\n\n```bash\n\ncurl -X PATCH http://localhost:6333/collections/test_collection1 \\n\nIt might be worth tuning this parameter if you experience a significant decrease in search quality.\n\n\n\n`always_ram` - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors.\n\nHowever, in some setups you might want to keep quantized vectors in RAM to speed up the search process.\n\n\n\nIn this case, you can set `always_ram` to `true` to store quantized vectors in RAM.\n\n\n\n### Setting up Binary Quant..."
c730cea5ff8e9206,What is vaccum optimizer ?,"The ""vacuum optimizer"" appears to be a configuration setting within a larger system, possibly related to managing and optimizing data storage or indexing. The specific parameters mentioned, such as ""deleted_threshold,"" ""vacuum_min_vector_number,"" ""default_segment_number,"" ""max_segment_size,"" ""memmap_threshold,"" ""indexing_threshold,"" ""flush_interval_sec,"" and ""max_optimization_threads,"" suggest that it deals with the maintenance and optimization of data segments, possibly by removing deleted or obsolete data entries when certain conditions are met, such as the number of vectors falling below a threshold. It may also be responsible for optimizing the performance of the system by managing how data is indexed and stored in memory. The ""flush_interval_sec"" parameter suggests that it performs some operations at regular intervals.","return optimizer\n\n```\n\n\n\nCaching in Quaterion is used for avoiding calculation of outputs of a frozen pretrained `Encoder` in every epoch.\n\nWhen it is configured, outputs will be computed once and cached in the preferred device for direct usage later on.\n\nIt provides both a considerable speedup and less memory footprint.\n\nHowever, it is quite a bit versatile and has several knobs to tune.\n\n},\n\n ""optimizer_config"": {\n\n ""deleted_threshold"": 0.2,\n\n ""vacuum_min_vector_number"": 1000,\n\n ""default_segment_number"": 0,\n\n ""max_segment_size"": null,\n\n ""memmap_threshold"": null,\n\n ""indexing_threshold"": 20000,\n\n ""flush_interval_sec"": 5,\n\n ""max_optimization_threads"": 1\n\n },\n\n ""wal_config"": {\n\n ""wal_capacity_mb"": 32,"


In [457]:
retrieved_documents_df_hybrid

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0ebc273b9cb63812,0,f84efadffd297edc786accdfedbbf1ee,What is the impact of ‘write_consistency_factor’ ?,"- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations.",0.832443
0ebc273b9cb63812,1,f84efadffd297edc786accdfedbbf1ee,What is the impact of ‘write_consistency_factor’ ?,"### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python",0.825855
5762473bc1710405,0,9c6987aff35894e4e124ab2060bcfe40,What is significance of ‘on_disk_payload’ setting?,* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).,0.821666
5762473bc1710405,1,9c6987aff35894e4e124ab2060bcfe40,What is significance of ‘on_disk_payload’ setting?,"The payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage.",0.787703
6a389c9f00bc1faa,0,45d3a1028595a7d95fcfce1b16ec4251,How do you use ‘ordering’ parameter?,"- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents",0.770652
6a389c9f00bc1faa,1,45d3a1028595a7d95fcfce1b16ec4251,How do you use ‘ordering’ parameter?,"```http\n\nPUT /collections/{collection_name}/points?ordering=strong\n\n{\n\n ""batch"": {\n\n ""ids"": [1, 2, 3],\n\n ""payloads"": [\n\n {""color"": ""red""},\n\n {""color"": ""green""},\n\n {""color"": ""blue""}\n\n ],\n\n ""vectors"": [\n\n [0.9, 0.1, 0.1],\n\n [0.1, 0.9, 0.1],\n\n [0.1, 0.1, 0.9]\n\n ]\n\n }\n\n}\n\n```\n\n\n\n```python\n\nclient.upsert(\n\n collection_name=""{collection_name}"",\n\n points=models.Batch(\n\n ids=[1, 2, 3],",0.740061
331a47badf78052a,0,3bf9cc3bc3fb4169328e7c7ab4470445,What is the purpose of ef_construct in HNSW ?,"(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n )),\n\n }),",0.78715
331a47badf78052a,1,3bf9cc3bc3fb4169328e7c7ab4470445,What is the purpose of ef_construct in HNSW ?,"The larger the value of it, the higher the precision of the search, but more space required. The `ef_construct` parameter is the number of \n\nneighbours to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time.\n\nThe default values of these parameters are `m=16` and `ef_construct=100`. Let's try to increase them to `m=32` and `ef_construct=200` and",0.767757
2355564201a39480,0,e5c8605ba444188fe9d1639dfecf49de,What is the purpose of ‘CreatePayloadIndexAsync’?,"client.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};",0.717572
2355564201a39480,1,e5c8605ba444188fe9d1639dfecf49de,What is the purpose of ‘CreatePayloadIndexAsync’?,"},\n\n ""api"": {\n\n ""type"": ""openapi"",\n\n ""url"": ""https://your-application-name.fly.dev/.well-known/openapi.yaml"",\n\n ""has_user_authentication"": false\n\n },\n\n ""logo_url"": ""https://your-application-name.fly.dev/.well-known/logo.png"",\n\n ""contact_email"": ""email@domain.com"",\n\n ""legal_info_url"": ""email@domain.com""\n\n}\n\n```\n\n\n\nThat was the last step before running the final command. The command that will deploy \n\nthe application on the server:\n\n\n\n```bash\n\nflyctl deploy\n\n```",0.698218


### **17. Define your evaluation model and your evaluators for Hybrid Search**

Next, define your evaluation model and your evaluators.

Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

In [458]:


# all spans created within this context will be associated with the `HYBRID_RAG_PROJECT` project.
eval_model = OpenAIModel(
    model="gpt-4-turbo-preview",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df_hybrid, qa_correctness_eval_df_hybrid = run_evals(
    dataframe=queries_df_hybrid,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df_hybrid = run_evals(
    dataframe=retrieved_documents_df_hybrid,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df_hybrid),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df_hybrid),
    project_name=HYBRID_RAG_PROJECT,
)
px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df_hybrid),
                            project_name=HYBRID_RAG_PROJECT)

run_evals |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s