<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/qdrant_arize.png" width="500"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Tuning a RAG Pipeline using Qdrant and Arize Phoenix</h1>

ℹ️ This notebook requires an OpenAI API key.

### **1. Import Relevant Packages**

In [2]:
import os

# Setup projects
SIMPLE_RAG_PROJECT = "simple-rag"
HYBRID_RAG_PROJECT = "hybrid-rag"
os.environ["PHOENIX_PROJECT_NAME"] = SIMPLE_RAG_PROJECT

In [3]:
import datetime
import json
import os
import pickle
import ssl
import time
import urllib
from getpass import getpass
from urllib.request import urlopen

import certifi
import nest_asyncio
import openai
import pandas as pd
import phoenix as px
import requests
from bs4 import BeautifulSoup
from llama_index.core import (
    ServiceContext, StorageContext, download_loader,
    load_index_from_storage, set_global_handler
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.graph_stores.simple import SimpleGraphStore
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
    HallucinationEvaluator, OpenAIModel, QAEvaluator,
    RelevanceEvaluator, run_evals
)
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import PointStruct

nest_asyncio.apply()  # needed for concurrent evals in notebook environments
pd.set_option("display.max_colwidth", 1000)

### **2. Launch Phoenix**
You can run Phoenix in the background to collect trace data emitted by any LlamaIndex application that has been instrumented with the OpenInferenceTraceCallbackHandler. Phoenix supports LlamaIndex's one-click observability which will automatically instrument your LlamaIndex application! You can consult our integration guide for a more detailed explanation of how to instrument your LlamaIndex application.

Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI (the UI should be empty because we have yet to run the LlamaIndex application).

In [4]:
session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


Be sure to enable phoenix as your global handler for tracing!

In [5]:
set_global_handler("arize_phoenix")

### **3. Setup your openai key and retrieve the documents to be used**

In [6]:
from dotenv import load_dotenv
load_dotenv()

True

In [7]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

### **4. Retrieve the documents / dataset to be used**

In [8]:
from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("atitaarora/qdrant_doc", split="train")

In [9]:
dataset.info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='csv', dataset_name='qdrant_doc', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1767967, num_examples=240, shard_lengths=None, dataset_name='qdrant_doc')}, download_checksums={'hf://datasets/atitaarora/qdrant_doc@8d859890840f65337c38e96d660b81b1441bbecd/documents.csv': {'num_bytes': 1777260, 'checksum': None}}, download_size=1777260, post_processing_size=None, dataset_size=1767967, size_in_bytes=3545227)

### **5. Definition of global chunk properties and chunk processing**
Processing each document with desired **TEXT_SPLITTER_ALGO , CHUNK_SIZE , CHUNK_OVERLAP** etc

In [10]:
## Global config for chunk processing
CHUNK_SIZE = 512 #1000
CHUNK_OVERLAP = 50

### **6. Process dataset as langchain (or llamaindex) document for further processing**

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument
from llama_index.core import Document

## Split and process the document chunks from the given dataset

def process_document_chunks(dataset,chunk_size,chunk_overlap):
    langchain_docs = [
        LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
        for doc in tqdm(dataset)
    ]

    # could showcase another variation of processed documents
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        add_start_index=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in langchain_docs:
        docs_processed += text_splitter.split_documents([doc])

    ## Converting Langchain document chunks above into Llamaindex Document for ingestion
    llama_documents = [
        Document.from_langchain_format(doc)
        for doc in docs_processed
    ]
    return llama_documents

In [12]:
documents = process_document_chunks(dataset, CHUNK_SIZE, CHUNK_OVERLAP)
len(documents)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 240/240 [00:00<00:00, 14011.96it/s]


4431

### **7. Setting up Qdrant and Collection**

We first set up the qdrant client and then create a collection so that our data may be stored.

In [13]:
##Uncomment to initialise qdrant client in memory
#client = qdrant_client.QdrantClient(
#    location=":memory:",
#)

##Uncomment below to connect to Qdrant Cloud
client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

## Uncomment below to connect to local Qdrant
#client = qdrant_client.QdrantClient("http://localhost:6333")

In [14]:
## Collection Name 
COLLECTION_NAME = "qdrant_docs_arize_dense"

In [23]:
## General Collection level operations

## Get information about existing collections 
client.get_collections()

## Get information about specific collection
#collection_info = client.get_collection(COLLECTION_NAME)
#print(collection_info)

## Deleting collection, if need be
#client.delete_collection(COLLECTION_NAME)

CollectionsResponse(collections=[])

In [24]:
## Declaring the intended Embedding Model with Fastembed
from fastembed.embedding import TextEmbedding

pd.DataFrame(TextEmbedding.list_supported_models())

Unnamed: 0,model,dim,description,size_in_GB,sources
0,BAAI/bge-base-en,768,Base English model,0.42,{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz'}
1,BAAI/bge-base-en-v1.5,768,"Base English model, v1.5",0.21,"{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz', 'hf': 'qdrant/bge-base-en-v1.5-onnx-q'}"
2,BAAI/bge-large-en-v1.5,1024,"Large English model, v1.5",1.2,{'hf': 'qdrant/bge-large-en-v1.5-onnx'}
3,BAAI/bge-small-en,384,Fast English model,0.13,{'url': 'https://storage.googleapis.com/qdrant-fastembed/BAAI-bge-small-en.tar.gz'}
4,BAAI/bge-small-en-v1.5,384,Fast and Default English model,0.067,{'hf': 'qdrant/bge-small-en-v1.5-onnx-q'}
5,BAAI/bge-small-zh-v1.5,512,Fast and recommended Chinese model,0.09,{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz'}
6,sentence-transformers/all-MiniLM-L6-v2,384,"Sentence Transformer model, MiniLM-L6-v2",0.09,"{'url': 'https://storage.googleapis.com/qdrant-fastembed/sentence-transformers-all-MiniLM-L6-v2.tar.gz', 'hf': 'qdrant/all-MiniLM-L6-v2-onnx'}"
7,sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2,384,"Sentence Transformer model, paraphrase-multilingual-MiniLM-L12-v2",0.22,{'hf': 'qdrant/paraphrase-multilingual-MiniLM-L12-v2-onnx-Q'}
8,nomic-ai/nomic-embed-text-v1,768,8192 context length english model,0.52,{'hf': 'nomic-ai/nomic-embed-text-v1'}
9,nomic-ai/nomic-embed-text-v1.5,768,8192 context length english model,0.52,{'hf': 'nomic-ai/nomic-embed-text-v1.5'}


### **8. Document Embedding processing and Ingestion**

This example uses a `QdrantVectorStore` and creates a new collection to work fully connected with Qdrant but you can use whatever LlamaIndex application you like.

In [25]:
import llama_index
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from phoenix.trace import suppress_tracing
## Uncomment it if you'd like to use FastEmbed instead of OpenAI
## For the complete list of supported models,
##please check https://qdrant.github.io/fastembed/examples/Supported_Models/
from llama_index.embeddings.fastembed import FastEmbedEmbedding

vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

##Uncomment if using FastEmbed
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

## Uncomment it if you'd like to use OpenAI Embeddings instead of FastEmbed
#Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

Settings.llm = OpenAI(model="gpt-4-1106-preview", temperature=0.0)

with suppress_tracing():
  index = VectorStoreIndex.from_documents(
      documents,
      storage_context=storage_context,
      show_progress=True
  )

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/4431 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/335 [00:00<?, ?it/s]

### **8a. Connecting to existing Collection**

This example uses a `QdrantVectorStore` and uses the previously generated collection to work fully connected with Qdrant.

In [18]:
## Uncomment it if using an existing collection
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever
vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

In [26]:
client.count(collection_name=COLLECTION_NAME)

CountResult(count=4431)

### **9.Running an example query and printing out the response.**

In [27]:
##Initialise retriever to interact with the Qdrant collection
retriever = VectorIndexRetriever(
    index=index,
    vector_store_query_mode=VectorStoreQueryMode.DEFAULT,
    similarity_top_k=5
)

In [28]:
response = retriever.retrieve("What is quantization?")
for i, node in enumerate(response):
    print(i + 1, node.text, end="\n\n")

1 ---

title: Quantization

weight: 120

aliases:

  - ../quantization

---



# Quantization



Quantization is an optional feature in Qdrant that enables efficient storage and search of high-dimensional vectors.

By transforming original vectors into a new representations, quantization compresses data while preserving close to original relative distances between vectors.

Different quantization methods have different mechanics and tradeoffs. We will cover them in this section.

2 Quantum quantization is a novel approach that leverages the power of quantum computing to speed up the search process in ANNs. By converting traditional float32 vectors into qbit vectors, we can create quantum entanglement between the qbits. Quantum entanglement is a unique phenomenon in which the states of two or more particles become interdependent, regardless of the distance between them. This property of quantum systems can be harnessed to create highly efficient vector search algorithms.

3 Quantization

In [29]:
response

[NodeWithScore(node=TextNode(id_='df39c370-ba20-4e50-8353-6e58202253ca', embedding=None, metadata={'source': 'documentation/guides/quantization.md', 'start_index': 0}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='351ec4ed-96a0-4ede-98ea-8dba39aaa0c4', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'source': 'documentation/guides/quantization.md', 'start_index': 0}, hash='240f864edcd69917078ac2bc6629b1b03ad8c8601ccee13a0e901fab43f94a63'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='8bcb0b87-08eb-454d-b3b6-b17d808b31e7', node_type=<ObjectType.TEXT: '1'>, metadata={'source': 'documentation/guides/installation.md', 'start_index': 7919}, hash='0b833f3c854a8ab5b29cffcf4a534546f07575e77c51742cc5a47fc20c551961'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='04fb059d-26f2-4f5d-97e2-36159f0745be', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='fe388f2da43760b1a146b3d0ca323ff

In [30]:
# We can view the above data in the UI
px.active_session().view()

📺 Opening a view to the Phoenix app. The app is running at http://localhost:6006/


### **10. Run Your Query Engine and View Your Traces in Phoenix**

We've compiled a list of the baseline questions about Qdrant. Let's download the sample queries and take a look.

In [31]:
## Loading the Eval dataset
from datasets import load_dataset
qdrant_qa = load_dataset("atitaarora/qdrant_doc_qna", split="train")
qdrant_qa_question = qdrant_qa.select_columns(['question'])

In [32]:
qdrant_qa_question['question'][:10]

['What is the purpose of oversampling in Qdrant search process?',
 'How does Qdrant address the search accuracy problem in comparison to other search engines using HNSW?',
 'What is the difference between regular and neural search?',
 'How can I use Qdrant as a vector store in Langchain Go?',
 'How did Dust leverage compression features in Qdrant to manage the balance between storing vectors on disk and keeping quantized vectors in RAM effectively?',
 'Why do we still need keyword search?',
 'What principles did Qdrant follow while designing benchmarks for vector search engines?',
 'What models does Qdrant support for embedding generation?',
 'How can you parallelize the upload of a large dataset using shards in Qdrant?',
 'What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?']

In [33]:
query_engine = index.as_query_engine()
for query in tqdm(qdrant_qa_question['question'][:10]):
    try:
      query_engine.query(query)
    except Exception as e:
      pass

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:16<00:00,  7.64s/it]


Check the Phoenix UI as your queries run. Your traces should appear in real time.

Open the Phoenix UI with the link below if you haven't already and click through the queries to better understand how the query engine is performing. For each trace you will see a break

Phoenix can be used to understand and troubleshoot your by surfacing:
 - **Application latency** - highlighting slow invocations of LLMs, Retrievers, etc.
 - **Token Usage** - Displays the breakdown of token usage with LLMs to surface up your most expensive LLM calls
 - **Runtime Exceptions** - Critical runtime exceptions such as rate-limiting are captured as exception events.
 - **Retrieved Documents** - view all the documents retrieved during a retriever call and the score and order in which they were returned
 - **Embeddings** - view the embedding text used for retrieval and the underlying embedding model
LLM Parameters - view the parameters used when calling out to an LLM to debug things like temperature and the system prompts
 - **Prompt Templates** - Figure out what prompt template is used during the prompting step and what variables were used.
 - **Tool Descriptions** - view the description and function signature of the tools your LLM has been given access to
 - **LLM Function Calls** - if using OpenAI or other a model with function calls, you can view the function selection and function messages in the input messages to the LLM.

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/RAG_trace_details.png" alt="Trace Details View on Phoenix" style="width:100%; height:auto;">

In [34]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: http://localhost:6006/


### **11. Export and Evaluate Your Trace Data**
You can export your trace data as a pandas dataframe for further analysis and evaluation.

In this case, we will export our retriever spans into two separate dataframes:

queries_df, in which the retrieved documents for each query are concatenated into a single column, retrieved_documents_df, in which each retrieved document is "exploded" into its own row to enable the evaluation of each query-document pair in isolation. This will enable us to compute multiple kinds of evaluations, including:

relevance: Are the retrieved documents grounded in the response? Q&A correctness: Are your application's responses grounded in the retrieved context? hallucinations: Is your application making up false information?

In [35]:
queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())

In [36]:
queries_df

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6af1f0cf3b4b91ea,What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?,"Maximizing the distance between all points in the response when utilizing vector similarity for diversity search is significant because it ensures that the algorithm outputs a set of results that are as dissimilar from each other as possible. This approach is useful for creating a diverse selection from a collection, which can be beneficial in various applications where variety is desired, such as in recommendation systems or when trying to cover a wide range of topics or features within a dataset. By focusing on maximizing distances, the need for manual labeling or categorization is reduced, as the diversity is achieved through the inherent differences in the data points' vector representations.","{{< figure width=80% src=/articles_data/vector-similarity-beyond-search/diversity-force.png caption=""Example of similarity-based sampling"" >}}\n\n\n\n\n\nThe power of vector similarity, in the context of being able to compare any two points, allows making a diverse selection of the collection possible without any labeling efforts.\n\nBy maximizing the distance between all points in the response, we can have an algorithm that will sequentially output dissimilar results.\n\nDiversity search is a method for finding the most distinctive examples in the data.\n\nAs similarity search, it also operates on embeddings and measures the distances between them.\n\nThe difference lies in deciding which point should be extracted next.\n\n\n\nLet's imagine how to get 3 points with similarity search and then with diversity search.\n\n\n\nSimilarity:\n\n1. Calculate distance matrix\n\n2. Choose your anchor\n\n3. Get a vector corresponding to the distances from the selected anchor from the distance ..."
16bc404cafe31afb,How can you parallelize the upload of a large dataset using shards in Qdrant?,"You can parallelize the upload of a large dataset in Qdrant by creating multiple shards for each collection. Each shard operates with its own Write-Ahead-Log (WAL), which is responsible for ordering operations. By having multiple shards, you can distribute the upload process across them, which allows for parallel data ingestion. A reasonable number of shards to create per machine ranges from 2 to 4. To set up a collection with multiple shards, you would specify the ""shard_number"" in your collection creation request. For example, to create a collection with 2 shards, you would include `""shard_number"": 2` in the PUT request to the `/collections/{collection_name}` endpoint.","## Parallel upload into multiple shards\n\n\n\nIn Qdrant, each collection is split into shards. Each shard has a separate Write-Ahead-Log (WAL), which is responsible for ordering operations.\n\nBy creating multiple shards, you can parallelize upload of a large dataset. From 2 to 4 shards per one machine is a reasonable number.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 768,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 2\n\n}\n\n```\n\n\n\n```python\n\nsetup with distributed deployment out of the box. This, combined with sharding, enables you to horizontally scale \n\nboth the size of your collections and the throughput of your cluster. This means that you can use Qdrant to handle \n\nlarge amounts of data without sacrificing performance or reliability.\n\n\n\n## Administration API\n\n\n\nAnother new feature is the administration API, which allows you to disable write operations to the service. This is"
861c06c418766e7a,What models does Qdrant support for embedding generation?,"Qdrant does not support embedding generation by itself; it requires external models to generate embeddings before they can be stored in the vector database. The models mentioned for embedding generation are the Mpnet model, various multilingual models, and the All-MiniLM-L6-V2 model from the SentenceTransformers package.",". So we did a lot of experiments. We used, I think, Mpnet model and a lot of multilingual models as well. But after doing those experiments, we realized that this is the best model that offers the best balance between speed and accuracy cool of the Embeddings. So we have deployed it in a serverless inference endpoint in SageMaker. And once we generate the Embeddings in a glue job, we then store them into the vector database Qdrant.\n\nSince Qdrant doesn't embed by itself, I had to decide on an embedding model. The prior version used the [SentenceTransformers](https://www.sbert.net/) package, which in turn employs Bert-based [All-MiniLM-L6-V2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main) model. This model is battle-tested and delivers fair results at speed, so not experimenting on this front I took an [ONNX version](https://huggingface.co/optimum/all-MiniLM-L6-v2/tree/main) and ran that within the service."
8896e9be994a49f8,What principles did Qdrant follow while designing benchmarks for vector search engines?,"The principles that Qdrant followed while designing benchmarks for vector search engines are not explicitly stated in the provided context information. The context only indicates that Qdrant has created the first comparative benchmark and benchmarking framework for vector search engines, and that an article will compare Qdrant's performance against other vector search engines. To understand the specific principles Qdrant followed, one would need to refer to additional information or documentation that details their benchmarking methodology and design considerations.","preview_image: /benchmarks/benchmark-1.png\n\nseo_schema: { ""@context"": ""https://schema.org"", ""@type"": ""Article"", ""headline"": ""Vector Search Comparative Benchmarks"", ""image"": [ ""https://qdrant.tech/benchmarks/benchmark-1.png"" ], ""abstract"": ""The first comparative benchmark and benchmarking framework for vector search engines"", ""datePublished"": ""2022-08-23"", ""dateModified"": ""2022-08-23"", ""author"": [{ ""@type"": ""Organization"", ""name"": ""Qdrant"", ""url"": ""https://qdrant.tech"" }] }\n\n \n\n---\n\n. In this article, we will compare how Qdrant performs against the other vector search engines."
8969248e1b4c6205,Why do we still need keyword search?,"Keyword-based search was the standard for search engines historically because it was the only option available. Despite facing common issues, efforts were made to improve its effectiveness by implementing strategies such as converting words into their root forms and removing stopwords. These enhancements aimed to make search engines more user-friendly. The fundamental concept of keyword search has been around since the Middle Ages and has been refined over time to better meet user needs.","2. Vector search with keyword-based search. This one is covered in this article.\n\n3. A mix of dense and sparse vectors. That strategy will be covered in the upcoming article.\n\n\n\n## Why do we still need keyword search?\n\n\n\nA keyword-based search was the obvious choice for search engines in the past. It struggled with some\n\ncommon issues, but since we didn't have any alternatives, we had to overcome them with additional\n\n. We also started converting words into their root forms to cover more cases, removing stopwords, etc. Effectively we were becoming more and more user-friendly. Still, the idea behind the whole process is derived from the most straightforward keyword-based search known since the Middle Ages, with some tweaks."
9fe19aa1d622b9a6,How did Dust leverage compression features in Qdrant to manage the balance between storing vectors on disk and keeping quantized vectors in RAM effectively?,Dust leveraged the control of the MMAP payload threshold and Scalar Quantization to manage the balance between storing vectors on disk and keeping quantized vectors in RAM effectively. This approach allowed them to scale smoothly and maintain good performance even when RAM was fully utilized.,"compression features](https://qdrant.tech/documentation/guides/quantization/). In particular, Dust leveraged the control of the [MMAP\n\npayload threshold](https://qdrant.tech/documentation/concepts/storage/#configuring-memmap-storage) as well as [Scalar Quantization](https://qdrant.tech/articles/scalar-quantization/), which enabled Dust to manage\n\nthe balance between storing vectors on disk and keeping quantized vectors in RAM,\n\nmore effectively. “This allowed us to scale smoothly from there,” Polu says.\n\n![“We were able to reduce the footprint of vectors in memory, which led to a significant cost reduction as\n\nwe don’t have to run lots of nodes in parallel. While being memory-bound, we were\n\nable to push the same instances further with the help of quantization. While you\n\nget pressure on MMAP in this case you maintain very good performance even if the\n\nRAM is fully used. With this we were able to reduce our cost by 2x.” - Stanislas Polu, Co-Founder of Dust](/case-st..."
7838606c9ae3bc39,How can I use Qdrant as a vector store in Langchain Go?,"To use Qdrant as a vector store in Langchain Go, you need to install the `langchain-go` project dependency by running the following command in your terminal:\n\n```bash\ngo get -u github.com/tmc/langchaingo\n```\n\nAfter installing the dependency, you can start integrating Qdrant into your Langchain Go applications. While the context does not provide a specific code example for Langchain Go, you can refer to the documentation at `https://tmc.github.io/langchaingo/docs/` for detailed instructions and usage examples.","---\n\ntitle: Langchain Go\n\nweight: 120\n\n---\n\n\n\n# Langchain Go\n\n\n\n[Langchain Go](https://tmc.github.io/langchaingo/docs/) is a framework for developing data-aware applications powered by language models in Go.\n\n\n\nYou can use Qdrant as a vector store in Langchain Go.\n\n\n\n## Setup\n\n\n\nInstall the `langchain-go` project dependency\n\n\n\n```bash\n\ngo get -u github.com/tmc/langchaingo\n\n```\n\n\n\n## Usage\n\n\n\nBefore you use the following code sample, customize the following values for your configuration:\n\n```bash\n\npip install langchain\n\n```\n\n\n\nQdrant acts as a vector index that may store the embeddings with the documents used to generate them. There are various ways \n\nhow to use it, but calling `Qdrant.from_texts` is probably the most straightforward way how to get started:\n\n\n\n```python\n\nfrom langchain.vectorstores import Qdrant\n\nfrom langchain.embeddings import HuggingFaceEmbeddings\n\n\n\nembeddings = HuggingFaceEmbeddings(\n\n model..."
2869ccb19423efe8,What is the difference between regular and neural search?,"The difference between regular and neural search lies in the underlying technology used to perform the search. Regular search typically uses keyword matching and traditional algorithms to find results that contain the same or similar words to the query. Neural search, on the other hand, leverages neural networks to understand the context and meaning behind the query, allowing it to find more relevant results even if the exact keywords are not present in the content. This can lead to improved accuracy and a more intuitive search experience, as neural search can interpret the intent and semantic relationships within the data it searches through.","In this tutorial we are going to find answers to these questions:\n\n\n\n* What is the difference between regular and neural search?\n\n* What neural networks could be used for search?\n\n* In what tasks is neural network search useful?\n\n* How to build and deploy own neural search service step-by-step?\n\n\n\n**What is neural search?**\n\nFrom web-pages search to product recommendations.\n\nFor many years, this technology didn't get much change until neural networks came into play.\n\n\n\nIn this tutorial we are going to find answers to these questions:\n\n\n\n* What is the difference between regular and neural search?\n\n* What neural networks could be used for search?\n\n* In what tasks is neural network search useful?\n\n* How to build and deploy own neural search service step-by-step?\n\n\n\n## What is neural search?"
e24aea5dff264cce,How does Qdrant address the search accuracy problem in comparison to other search engines using HNSW?,"Qdrant addresses the search accuracy problem by using a modified version of the HNSW algorithm that is compatible with the use of filters during a search. This approach eliminates the need for pre- or post-filtering, which can lead to issues such as a disconnected HNSW graph when too many vectors are filtered out. This modification allows Qdrant to maintain high accuracy and speed in search results. Additionally, Qdrant provides the ability to configure HNSW parameters on a collection and named vector level, which can be used to fine-tune search performance.","On top of it, there is also a problem with search accuracy.\n\nIt appears if too many vectors are filtered out, so the HNSW graph becomes disconnected.\n\n\n\nQdrant uses a different approach, not requiring pre- or post-filtering while addressing the accuracy problem.\n\nRead more about the Qdrant approach in our [Filtrable HNSW](/articles/filtrable-hnsw/) article.\n\nHNSW is chosen for several reasons.\n\nFirst, HNSW is well-compatible with the modification that allows Qdrant to use filters during a search.\n\nSecond, it is one of the most accurate and fastest algorithms, according to [public benchmarks](https://github.com/erikbern/ann-benchmarks).\n\n\n\n*Available as of v1.1.1*\n\n\n\nThe HNSW parameters can also be configured on a collection and named vector\n\nlevel by setting [`hnsw_config`](../indexing/#vector-index) to fine-tune search\n\nperformance."
99311e9c03c71f6d,What is the purpose of oversampling in Qdrant search process?,"The purpose of oversampling in the Qdrant search process is to improve the accuracy and performance of similarity search algorithms. It allows for the compression of high-dimensional vectors in memory while compensating for the accuracy loss by re-scoring additional points with the original vectors. This technique is used to control the precision of the search in real time by retrieving more vectors than needed from quantized storage and then assigning a more precise score upon re-scoring with the original vectors. From this overselection, only the vectors that are most relevant to the user's query are chosen.","### Oversampling for quantization\n\n\n\nWe are introducing [oversampling](/documentation/guides/quantization/#oversampling) as a new way to help you improve the accuracy and performance of similarity search algorithms. With this method, you are able to significantly compress high-dimensional vectors in memory and then compensate the accuracy loss by re-scoring additional points with the original vectors.\n\nYeah, so oversampling is a special technique we use to control precision of the search in real time, in query time. And the thing is, we can internally retrieve from quantized storage a bit more vectors than we actually need. And when we do rescoring with original vectors, we assign more precise score. And therefore from this overselection, we can pick only those vectors which are actually good for the user"


In [37]:
retrieved_documents_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ab6af76a07854549,0,7228c47d158f56ba49d4b3d9f72069ef,What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?,"{{< figure width=80% src=/articles_data/vector-similarity-beyond-search/diversity-force.png caption=""Example of similarity-based sampling"" >}}\n\n\n\n\n\nThe power of vector similarity, in the context of being able to compare any two points, allows making a diverse selection of the collection possible without any labeling efforts.\n\nBy maximizing the distance between all points in the response, we can have an algorithm that will sequentially output dissimilar results.",0.884858
ab6af76a07854549,1,7228c47d158f56ba49d4b3d9f72069ef,What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?,"Diversity search is a method for finding the most distinctive examples in the data.\n\nAs similarity search, it also operates on embeddings and measures the distances between them.\n\nThe difference lies in deciding which point should be extracted next.\n\n\n\nLet's imagine how to get 3 points with similarity search and then with diversity search.\n\n\n\nSimilarity:\n\n1. Calculate distance matrix\n\n2. Choose your anchor\n\n3. Get a vector corresponding to the distances from the selected anchor from the distance matrix",0.856284
bf5a72b20f7ff529,0,15df98dd17e8ccfdff644291dddb8a71,How can you parallelize the upload of a large dataset using shards in Qdrant?,"## Parallel upload into multiple shards\n\n\n\nIn Qdrant, each collection is split into shards. Each shard has a separate Write-Ahead-Log (WAL), which is responsible for ordering operations.\n\nBy creating multiple shards, you can parallelize upload of a large dataset. From 2 to 4 shards per one machine is a reasonable number.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 768,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 2\n\n}\n\n```\n\n\n\n```python",0.881565
bf5a72b20f7ff529,1,15df98dd17e8ccfdff644291dddb8a71,How can you parallelize the upload of a large dataset using shards in Qdrant?,"setup with distributed deployment out of the box. This, combined with sharding, enables you to horizontally scale \n\nboth the size of your collections and the throughput of your cluster. This means that you can use Qdrant to handle \n\nlarge amounts of data without sacrificing performance or reliability.\n\n\n\n## Administration API\n\n\n\nAnother new feature is the administration API, which allows you to disable write operations to the service. This is",0.796063
35f2880c26e61280,0,ed4015186f097820b266bff8f2d15d30,What models does Qdrant support for embedding generation?,". So we did a lot of experiments. We used, I think, Mpnet model and a lot of multilingual models as well. But after doing those experiments, we realized that this is the best model that offers the best balance between speed and accuracy cool of the Embeddings. So we have deployed it in a serverless inference endpoint in SageMaker. And once we generate the Embeddings in a glue job, we then store them into the vector database Qdrant.",0.845401
35f2880c26e61280,1,ed4015186f097820b266bff8f2d15d30,What models does Qdrant support for embedding generation?,"Since Qdrant doesn't embed by itself, I had to decide on an embedding model. The prior version used the [SentenceTransformers](https://www.sbert.net/) package, which in turn employs Bert-based [All-MiniLM-L6-V2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main) model. This model is battle-tested and delivers fair results at speed, so not experimenting on this front I took an [ONNX version](https://huggingface.co/optimum/all-MiniLM-L6-v2/tree/main) and ran that within the service.",0.832906
8d1fe45a23c35db7,0,447a7f7dd1599ca09ffd51825d2ad83c,What principles did Qdrant follow while designing benchmarks for vector search engines?,"preview_image: /benchmarks/benchmark-1.png\n\nseo_schema: { ""@context"": ""https://schema.org"", ""@type"": ""Article"", ""headline"": ""Vector Search Comparative Benchmarks"", ""image"": [ ""https://qdrant.tech/benchmarks/benchmark-1.png"" ], ""abstract"": ""The first comparative benchmark and benchmarking framework for vector search engines"", ""datePublished"": ""2022-08-23"", ""dateModified"": ""2022-08-23"", ""author"": [{ ""@type"": ""Organization"", ""name"": ""Qdrant"", ""url"": ""https://qdrant.tech"" }] }\n\n \n\n---",0.857022
8d1fe45a23c35db7,1,447a7f7dd1599ca09ffd51825d2ad83c,What principles did Qdrant follow while designing benchmarks for vector search engines?,". In this article, we will compare how Qdrant performs against the other vector search engines.",0.853702
c2e8c0ed7ba9d913,0,a7fb45e4eaa6c26721257639531a9393,Why do we still need keyword search?,"2. Vector search with keyword-based search. This one is covered in this article.\n\n3. A mix of dense and sparse vectors. That strategy will be covered in the upcoming article.\n\n\n\n## Why do we still need keyword search?\n\n\n\nA keyword-based search was the obvious choice for search engines in the past. It struggled with some\n\ncommon issues, but since we didn't have any alternatives, we had to overcome them with additional",0.795507
c2e8c0ed7ba9d913,1,a7fb45e4eaa6c26721257639531a9393,Why do we still need keyword search?,". We also started converting words into their root forms to cover more cases, removing stopwords, etc. Effectively we were becoming more and more user-friendly. Still, the idea behind the whole process is derived from the most straightforward keyword-based search known since the Middle Ages, with some tweaks.",0.762174


### **12. Define your evaluation model and your evaluators**

Next, define your evaluation model and your evaluators.

Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

In [38]:
eval_model = OpenAIModel(
    model="gpt-4-turbo-preview",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
)
px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df))

run_evals |          | 0/22 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/25 (0.0%) | ⏳ 00:00<? | ?it/s

Your evaluations should now appear as annotations on the appropriate spans in Phoenix.

![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)

### **13. Let's try Hybrid search now**

In [39]:
## Define a new collection to store your hybrid emebeddings
COLLECTION_NAME_HYBRID = "qdrant_docs_arize_hybrid"

In [40]:
##Reprocess documents with different settings if needed 
#documents = process_document_chunks(dataset , CHUNK_SIZE , CHUNK_OVERLAP)

In [41]:
##List of supported sparse vector models
from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding
SparseTextEmbedding.list_supported_models()

[{'model': 'prithvida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Misspelled version of the model. Retained for backward compatibility. Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}},
 {'model': 'prithivida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}}]

### **14. Ingest Sparse and Dense vectors into Qdrant**

Ingest sparse and dense vectors into Qdrant Collection.
We are using Splade++ model for Sparse Vector Model and default Fastembed model - bge-small-en-1.5 for dense embeddings. 

In [45]:
import llama_index
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding, SparseEmbedding
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from typing import List, Tuple

sparse_model_name = "prithivida/Splade_PP_en_v1"

# This triggers the model download
sparse_model = SparseTextEmbedding(model_name=sparse_model_name, batch_size=32)

batch_size = 10
parallel = 0

## Computing sparse vectors
def compute_sparse_vectors(
    texts: List[str],
    ) -> Tuple[List[List[int]], List[List[float]]]:
    indices, values = [], []
    for embedding in sparse_model.embed(texts):
        indices.append(embedding.indices.tolist())
        values.append(embedding.values.tolist())
    return indices, values

## Creating a vector store with Hybrid search enabled
vector_store = QdrantVectorStore(
    client=client,
    collection_name=COLLECTION_NAME_HYBRID,
    enable_hybrid=True,
    sparse_doc_fn=compute_sparse_vectors,
    sparse_query_fn=compute_sparse_vectors)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

## Ingesting sparse and dense vectors into Qdrant collection
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True
)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/133 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/4431 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/335 [00:00<?, ?it/s]

In [46]:
## collection level operations
client.get_collection(COLLECTION_NAME_HYBRID)
#client.delete_collection(COLLECTION_NAME_HYBRID)

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=8862, indexed_vectors_count=4429, points_count=4431, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors={'text-dense': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None)}, shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors={'text-sparse': SparseVectorParams(index=SparseIndexParams(full_scan_threshold=None, on_disk=None))}), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_thread

In [47]:
## Check the number of documents matches the expected number of document chunks 
client.count(collection_name=COLLECTION_NAME_HYBRID)

CountResult(count=4431)

### **15. Hybrid Search with Qdrant**

In [59]:
## Initialise Hybrid Vector Store 
vector_store_hybrid = QdrantVectorStore(
    client=client,
    collection_name=COLLECTION_NAME_HYBRID,
    enable_hybrid=True,
    batch_size=20,  # this is important for the ingestion
)

## Followed by initialising index for interacting with the Hybrid Collection in Qdrant

hybrid_index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store_hybrid,
    storage_context=storage_context,
)

In [53]:
!pip freeze | grep transformers

/Users/atitaarora/.zshenv:2: no such file or directory: /Users/atitaarora/qdrant/workspace/qdrant-rag-eval/workshop-rag-eval-qdrant-arize/arize-eval/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Users/atitaarora/.cargo/bin=/Users/atitaarora/qdrant/workspace/qdrant-rag-eval/workshop-rag-eval-qdrant-arize/arize-eval/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Us

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


transformers==4.39.3


In [50]:
##TODO add this to poetry
#!pip install "transformers[torch]"

In [60]:
## Before moving further , lets try Sparse Vector Search Retreiver 
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever

sparse_retriever = VectorIndexRetriever(
    index=hybrid_index,
    vector_store_query_mode=VectorStoreQueryMode.SPARSE,
    sparse_top_k=2,
)

## Pure sparse vector search
nodes = sparse_retriever.retrieve("What is quantization?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

1 Quantization is primarily used to reduce the memory footprint and accelerate the search process in high-dimensional vector spaces.

In the context of the Qdrant, quantization allows you to optimize the search engine for specific use cases, striking a balance between accuracy, storage efficiency, and search speed.



There are tradeoffs associated with quantization.

On the one hand, quantization allows for significant reductions in storage requirements and faster search times.

2 ---

title: Quantization

weight: 120

aliases:

  - ../quantization

---



# Quantization



Quantization is an optional feature in Qdrant that enables efficient storage and search of high-dimensional vectors.

By transforming original vectors into a new representations, quantization compresses data while preserving close to original relative distances between vectors.

Different quantization methods have different mechanics and tradeoffs. We will cover them in this section.



In [61]:
## Let's try Hybrid Search Retriever now
hybrid_retriever = VectorIndexRetriever(
    index=hybrid_index,
    vector_store_query_mode=VectorStoreQueryMode.HYBRID,
    sparse_top_k=2,
    similarity_top_k=5,
    alpha=0.1,
)

nodes = hybrid_retriever.retrieve("What is quantization?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

1 Quantization is primarily used to reduce the memory footprint and accelerate the search process in high-dimensional vector spaces.

In the context of the Qdrant, quantization allows you to optimize the search engine for specific use cases, striking a balance between accuracy, storage efficiency, and search speed.



There are tradeoffs associated with quantization.

On the one hand, quantization allows for significant reductions in storage requirements and faster search times.

2 ---

title: Quantization

weight: 120

aliases:

  - ../quantization

---



# Quantization



Quantization is an optional feature in Qdrant that enables efficient storage and search of high-dimensional vectors.

By transforming original vectors into a new representations, quantization compresses data while preserving close to original relative distances between vectors.

Different quantization methods have different mechanics and tradeoffs. We will cover them in this section.

3 Quantum quantization is a no

In [58]:
# We shouldn't be modifying the alpha parameter after the retriever has been created
# but that's the easiest way to show the effect of the parameter
#hybrid_retriever._alpha = 0.1
hybrid_retriever._alpha = 0.9

nodes = hybrid_retriever.retrieve("What is quantization?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

1 ---

title: Quantization

weight: 120

aliases:

  - ../quantization

---



# Quantization



Quantization is an optional feature in Qdrant that enables efficient storage and search of high-dimensional vectors.

By transforming original vectors into a new representations, quantization compresses data while preserving close to original relative distances between vectors.

Different quantization methods have different mechanics and tradeoffs. We will cover them in this section.

2 Quantization is primarily used to reduce the memory footprint and accelerate the search process in high-dimensional vector spaces.

In the context of the Qdrant, quantization allows you to optimize the search engine for specific use cases, striking a balance between accuracy, storage efficiency, and search speed.



There are tradeoffs associated with quantization.

On the one hand, quantization allows for significant reductions in storage requirements and faster search times.

3 Quantum quantization is a no

### **16. Re-Run Your Query Engine and View Your Traces in Phoenix**

Let's rerun the list of the baseline questions about Qdrant on the Hybrid Retriever. 

In [75]:
## Switching phoenix project space
from phoenix.trace import using_project

# Switch project to run evals
with using_project(HYBRID_RAG_PROJECT):
# all spans created within this context will be associated with the `HYBRID_RAG_PROJECT` project.

    ##Reuse the previously loaded dataset `qdrant_qa_question`

    query_engine_hybrid = hybrid_index.as_query_engine()
    for query in tqdm(qdrant_qa_question['question'][:10]):
        try:
          query_engine_hybrid.query(query)
        except Exception as e:
          pass





  0%|                                                                                                                                              | 0/10 [00:00<?, ?it/s][A[A[A[A



 10%|█████████████▍                                                                                                                        | 1/10 [00:06<01:02,  6.92s/it][A[A[A[A



 20%|██████████████████████████▊                                                                                                           | 2/10 [00:10<00:41,  5.19s/it][A[A[A[A



 30%|████████████████████████████████████████▏                                                                                             | 3/10 [00:17<00:39,  5.61s/it][A[A[A[A



 40%|█████████████████████████████████████████████████████▌                                                                                | 4/10 [00:22<00:33,  5.65s/it][A[A[A[A



 50%|█████████████████████████████████████████████████████████████

In [76]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: http://localhost:6006/


In [77]:
## Switching phoenix project space
from phoenix.trace import using_project

# Switch project to run evals
with using_project(HYBRID_RAG_PROJECT):
    queries_df_hybrid = get_qa_with_reference(px.Client())
    retrieved_documents_df_hybrid = get_retrieved_documents(px.Client())

In [78]:
queries_df_hybrid

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
15e565515ffd3b11,What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?,"Maximizing the distance between all points in the response when utilizing vector similarity for diversity search is significant because it ensures that the algorithm selects a set of results that are as dissimilar from each other as possible. This approach is used to create a diverse selection from a collection, which is particularly useful when the goal is to represent the broadest possible range of data without redundancy. By focusing on maximizing distances, the algorithm avoids clustering around similar points and instead provides a more varied and representative sample of the dataset. This is beneficial in scenarios where diversity is desired, such as in recommendation systems, content discovery, or when trying to understand the full scope of a dataset without the need for manual labeling.","{{< figure width=80% src=/articles_data/vector-similarity-beyond-search/diversity-force.png caption=""Example of similarity-based sampling"" >}}\n\n\n\n\n\nThe power of vector similarity, in the context of being able to compare any two points, allows making a diverse selection of the collection possible without any labeling efforts.\n\nBy maximizing the distance between all points in the response, we can have an algorithm that will sequentially output dissimilar results.\n\nDiversity search is a method for finding the most distinctive examples in the data.\n\nAs similarity search, it also operates on embeddings and measures the distances between them.\n\nThe difference lies in deciding which point should be extracted next.\n\n\n\nLet's imagine how to get 3 points with similarity search and then with diversity search.\n\n\n\nSimilarity:\n\n1. Calculate distance matrix\n\n2. Choose your anchor\n\n3. Get a vector corresponding to the distances from the selected anchor from the distance ..."
d6033f6352905182,How can you parallelize the upload of a large dataset using shards in Qdrant?,"To parallelize the upload of a large dataset using shards in Qdrant, you can create multiple shards within a collection. Each shard operates with its own Write-Ahead-Log (WAL), which is responsible for ordering operations. By having multiple shards, you can distribute the upload process across these shards, allowing for concurrent data ingestion. A reasonable number of shards to create per machine ranges from 2 to 4. When setting up your collection, you can specify the number of shards using the ""shard_number"" parameter in your collection creation request. Here is an example of how to set the shard number to 2 in a collection creation request:\n\n```http\nPUT /collections/{collection_name}\n{\n ""vectors"": {\n ""size"": 768,\n ""distance"": ""Cosine""\n },\n ""shard_number"": 2\n}\n```","## Parallel upload into multiple shards\n\n\n\nIn Qdrant, each collection is split into shards. Each shard has a separate Write-Ahead-Log (WAL), which is responsible for ordering operations.\n\nBy creating multiple shards, you can parallelize upload of a large dataset. From 2 to 4 shards per one machine is a reasonable number.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 768,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 2\n\n}\n\n```\n\n\n\n```python\n\nsetup with distributed deployment out of the box. This, combined with sharding, enables you to horizontally scale \n\nboth the size of your collections and the throughput of your cluster. This means that you can use Qdrant to handle \n\nlarge amounts of data without sacrificing performance or reliability.\n\n\n\n## Administration API\n\n\n\nAnother new feature is the administration API, which allows you to disable write operations to the service. This is"
1091e771fd80e9b7,What models does Qdrant support for embedding generation?,"Qdrant does not support embedding generation by itself; it requires external models to generate embeddings before they can be stored in the vector database. The models mentioned for embedding generation are the Mpnet model, various multilingual models, and the All-MiniLM-L6-V2 model from the SentenceTransformers package. An ONNX version of the All-MiniLM-L6-V2 model is also mentioned as being used within the service.",". So we did a lot of experiments. We used, I think, Mpnet model and a lot of multilingual models as well. But after doing those experiments, we realized that this is the best model that offers the best balance between speed and accuracy cool of the Embeddings. So we have deployed it in a serverless inference endpoint in SageMaker. And once we generate the Embeddings in a glue job, we then store them into the vector database Qdrant.\n\nSince Qdrant doesn't embed by itself, I had to decide on an embedding model. The prior version used the [SentenceTransformers](https://www.sbert.net/) package, which in turn employs Bert-based [All-MiniLM-L6-V2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main) model. This model is battle-tested and delivers fair results at speed, so not experimenting on this front I took an [ONNX version](https://huggingface.co/optimum/all-MiniLM-L6-v2/tree/main) and ran that within the service."
be59a382b2fbca25,What principles did Qdrant follow while designing benchmarks for vector search engines?,"The principles that Qdrant followed while designing benchmarks for vector search engines are not explicitly stated in the provided context information. The context only indicates that Qdrant has created the first comparative benchmark and benchmarking framework for vector search engines and that an article will compare Qdrant's performance against other vector search engines. To understand the specific principles Qdrant followed, one would need to refer to additional information or documentation that details their benchmarking methodology and design considerations.","preview_image: /benchmarks/benchmark-1.png\n\nseo_schema: { ""@context"": ""https://schema.org"", ""@type"": ""Article"", ""headline"": ""Vector Search Comparative Benchmarks"", ""image"": [ ""https://qdrant.tech/benchmarks/benchmark-1.png"" ], ""abstract"": ""The first comparative benchmark and benchmarking framework for vector search engines"", ""datePublished"": ""2022-08-23"", ""dateModified"": ""2022-08-23"", ""author"": [{ ""@type"": ""Organization"", ""name"": ""Qdrant"", ""url"": ""https://qdrant.tech"" }] }\n\n \n\n---\n\n. In this article, we will compare how Qdrant performs against the other vector search engines."
5072852478af9880,Why do we still need keyword search?,"Keyword-based search was the standard for search engines historically because it was the only option available at the time. Despite facing common issues, efforts were made to improve its effectiveness by implementing strategies such as converting words into their root forms and removing stopwords to make the search process more user-friendly. The fundamental concept of keyword search has been around since the Middle Ages and has been refined over time to better meet user needs.","2. Vector search with keyword-based search. This one is covered in this article.\n\n3. A mix of dense and sparse vectors. That strategy will be covered in the upcoming article.\n\n\n\n## Why do we still need keyword search?\n\n\n\nA keyword-based search was the obvious choice for search engines in the past. It struggled with some\n\ncommon issues, but since we didn't have any alternatives, we had to overcome them with additional\n\n. We also started converting words into their root forms to cover more cases, removing stopwords, etc. Effectively we were becoming more and more user-friendly. Still, the idea behind the whole process is derived from the most straightforward keyword-based search known since the Middle Ages, with some tweaks."
15428eaf64fa2be9,How did Dust leverage compression features in Qdrant to manage the balance between storing vectors on disk and keeping quantized vectors in RAM effectively?,Dust leveraged the control of the MMAP payload threshold and Scalar Quantization to manage the balance between storing vectors on disk and keeping quantized vectors in RAM effectively. This approach allowed them to scale smoothly and maintain good performance even when RAM was fully utilized.,"compression features](https://qdrant.tech/documentation/guides/quantization/). In particular, Dust leveraged the control of the [MMAP\n\npayload threshold](https://qdrant.tech/documentation/concepts/storage/#configuring-memmap-storage) as well as [Scalar Quantization](https://qdrant.tech/articles/scalar-quantization/), which enabled Dust to manage\n\nthe balance between storing vectors on disk and keeping quantized vectors in RAM,\n\nmore effectively. “This allowed us to scale smoothly from there,” Polu says.\n\n![“We were able to reduce the footprint of vectors in memory, which led to a significant cost reduction as\n\nwe don’t have to run lots of nodes in parallel. While being memory-bound, we were\n\nable to push the same instances further with the help of quantization. While you\n\nget pressure on MMAP in this case you maintain very good performance even if the\n\nRAM is fully used. With this we were able to reduce our cost by 2x.” - Stanislas Polu, Co-Founder of Dust](/case-st..."
2ec4818782c0c9b2,How can I use Qdrant as a vector store in Langchain Go?,"To use Qdrant as a vector store in Langchain Go, you need to install the `langchain-go` project dependency by running the following command in your terminal:\n\n```bash\ngo get -u github.com/tmc/langchaingo\n```\n\nAfter installing the dependency, you can integrate Qdrant into your Langchain Go application. While the context does not provide a specific code example for using Qdrant with Langchain Go, you would typically need to import the necessary packages and configure Qdrant as your vector store within your Go application.","---\n\ntitle: Langchain Go\n\nweight: 120\n\n---\n\n\n\n# Langchain Go\n\n\n\n[Langchain Go](https://tmc.github.io/langchaingo/docs/) is a framework for developing data-aware applications powered by language models in Go.\n\n\n\nYou can use Qdrant as a vector store in Langchain Go.\n\n\n\n## Setup\n\n\n\nInstall the `langchain-go` project dependency\n\n\n\n```bash\n\ngo get -u github.com/tmc/langchaingo\n\n```\n\n\n\n## Usage\n\n\n\nBefore you use the following code sample, customize the following values for your configuration:\n\n```bash\n\npip install langchain\n\n```\n\n\n\nQdrant acts as a vector index that may store the embeddings with the documents used to generate them. There are various ways \n\nhow to use it, but calling `Qdrant.from_texts` is probably the most straightforward way how to get started:\n\n\n\n```python\n\nfrom langchain.vectorstores import Qdrant\n\nfrom langchain.embeddings import HuggingFaceEmbeddings\n\n\n\nembeddings = HuggingFaceEmbeddings(\n\n model..."
fabee3aeeb75ea50,What is the difference between regular and neural search?,"The difference between regular and neural search lies in the underlying technology used to perform the search. Regular search typically uses keyword matching and traditional algorithms to find results that contain the same or similar words to those in the query. Neural search, on the other hand, leverages neural networks to understand the context and semantics of the query, providing more relevant results by interpreting the intent behind the search terms rather than relying solely on keyword matching.","In this tutorial we are going to find answers to these questions:\n\n\n\n* What is the difference between regular and neural search?\n\n* What neural networks could be used for search?\n\n* In what tasks is neural network search useful?\n\n* How to build and deploy own neural search service step-by-step?\n\n\n\n**What is neural search?**\n\nFrom web-pages search to product recommendations.\n\nFor many years, this technology didn't get much change until neural networks came into play.\n\n\n\nIn this tutorial we are going to find answers to these questions:\n\n\n\n* What is the difference between regular and neural search?\n\n* What neural networks could be used for search?\n\n* In what tasks is neural network search useful?\n\n* How to build and deploy own neural search service step-by-step?\n\n\n\n## What is neural search?"
82f0ed3dffe39306,How does Qdrant address the search accuracy problem in comparison to other search engines using HNSW?,"Qdrant addresses the search accuracy problem by using a modified version of the HNSW algorithm that is compatible with the use of filters during a search without requiring pre- or post-filtering. This approach allows Qdrant to maintain high accuracy and speed, as evidenced by public benchmarks. Additionally, Qdrant offers the ability to configure HNSW parameters on a collection and named vector level, which can be used to fine-tune search performance.","On top of it, there is also a problem with search accuracy.\n\nIt appears if too many vectors are filtered out, so the HNSW graph becomes disconnected.\n\n\n\nQdrant uses a different approach, not requiring pre- or post-filtering while addressing the accuracy problem.\n\nRead more about the Qdrant approach in our [Filtrable HNSW](/articles/filtrable-hnsw/) article.\n\nHNSW is chosen for several reasons.\n\nFirst, HNSW is well-compatible with the modification that allows Qdrant to use filters during a search.\n\nSecond, it is one of the most accurate and fastest algorithms, according to [public benchmarks](https://github.com/erikbern/ann-benchmarks).\n\n\n\n*Available as of v1.1.1*\n\n\n\nThe HNSW parameters can also be configured on a collection and named vector\n\nlevel by setting [`hnsw_config`](../indexing/#vector-index) to fine-tune search\n\nperformance."
cbf315dd14863caf,What is the purpose of oversampling in Qdrant search process?,"The purpose of oversampling in the Qdrant search process is to improve the accuracy and performance of similarity search algorithms. It allows for significant compression of high-dimensional vectors in memory while compensating for the accuracy loss by re-scoring additional points with the original vectors. This technique is used to control the precision of the search in real-time, by retrieving more vectors than needed from quantized storage and then assigning a more precise score during re-scoring. From this overselection, only the vectors that are most relevant to the user are chosen.","### Oversampling for quantization\n\n\n\nWe are introducing [oversampling](/documentation/guides/quantization/#oversampling) as a new way to help you improve the accuracy and performance of similarity search algorithms. With this method, you are able to significantly compress high-dimensional vectors in memory and then compensate the accuracy loss by re-scoring additional points with the original vectors.\n\nYeah, so oversampling is a special technique we use to control precision of the search in real time, in query time. And the thing is, we can internally retrieve from quantized storage a bit more vectors than we actually need. And when we do rescoring with original vectors, we assign more precise score. And therefore from this overselection, we can pick only those vectors which are actually good for the user"


In [79]:
retrieved_documents_df_hybrid

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
acac1350b30419a7,0,d696bfe6c1d5c59504a3d36db03cebfa,What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?,"{{< figure width=80% src=/articles_data/vector-similarity-beyond-search/diversity-force.png caption=""Example of similarity-based sampling"" >}}\n\n\n\n\n\nThe power of vector similarity, in the context of being able to compare any two points, allows making a diverse selection of the collection possible without any labeling efforts.\n\nBy maximizing the distance between all points in the response, we can have an algorithm that will sequentially output dissimilar results.",0.884858
acac1350b30419a7,1,d696bfe6c1d5c59504a3d36db03cebfa,What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?,"Diversity search is a method for finding the most distinctive examples in the data.\n\nAs similarity search, it also operates on embeddings and measures the distances between them.\n\nThe difference lies in deciding which point should be extracted next.\n\n\n\nLet's imagine how to get 3 points with similarity search and then with diversity search.\n\n\n\nSimilarity:\n\n1. Calculate distance matrix\n\n2. Choose your anchor\n\n3. Get a vector corresponding to the distances from the selected anchor from the distance matrix",0.856284
1e618386a651dd3d,0,1ced6331fee751b540b0bc1b4185f566,How can you parallelize the upload of a large dataset using shards in Qdrant?,"## Parallel upload into multiple shards\n\n\n\nIn Qdrant, each collection is split into shards. Each shard has a separate Write-Ahead-Log (WAL), which is responsible for ordering operations.\n\nBy creating multiple shards, you can parallelize upload of a large dataset. From 2 to 4 shards per one machine is a reasonable number.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 768,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 2\n\n}\n\n```\n\n\n\n```python",0.881565
1e618386a651dd3d,1,1ced6331fee751b540b0bc1b4185f566,How can you parallelize the upload of a large dataset using shards in Qdrant?,"setup with distributed deployment out of the box. This, combined with sharding, enables you to horizontally scale \n\nboth the size of your collections and the throughput of your cluster. This means that you can use Qdrant to handle \n\nlarge amounts of data without sacrificing performance or reliability.\n\n\n\n## Administration API\n\n\n\nAnother new feature is the administration API, which allows you to disable write operations to the service. This is",0.796063
cd4b03b790d138f7,0,0f79ae1cdb4939de42c2a640c80c1dd1,What models does Qdrant support for embedding generation?,". So we did a lot of experiments. We used, I think, Mpnet model and a lot of multilingual models as well. But after doing those experiments, we realized that this is the best model that offers the best balance between speed and accuracy cool of the Embeddings. So we have deployed it in a serverless inference endpoint in SageMaker. And once we generate the Embeddings in a glue job, we then store them into the vector database Qdrant.",0.845401
...,...,...,...,...,...
b5fa4a67b747eca0,0,bce3b2cf798ea5f3f5c7568e4a25f08f,What is quantization?,"---\n\ntitle: Quantization\n\nweight: 120\n\naliases:\n\n - ../quantization\n\n---\n\n\n\n# Quantization\n\n\n\nQuantization is an optional feature in Qdrant that enables efficient storage and search of high-dimensional vectors.\n\nBy transforming original vectors into a new representations, quantization compresses data while preserving close to original relative distances between vectors.\n\nDifferent quantization methods have different mechanics and tradeoffs. We will cover them in this section.",0.858141
b5fa4a67b747eca0,1,bce3b2cf798ea5f3f5c7568e4a25f08f,What is quantization?,"Quantum quantization is a novel approach that leverages the power of quantum computing to speed up the search process in ANNs. By converting traditional float32 vectors into qbit vectors, we can create quantum entanglement between the qbits. Quantum entanglement is a unique phenomenon in which the states of two or more particles become interdependent, regardless of the distance between them. This property of quantum systems can be harnessed to create highly efficient vector search algorithms.",0.814146
b5fa4a67b747eca0,2,bce3b2cf798ea5f3f5c7568e4a25f08f,What is quantization?,"Quantization is primarily used to reduce the memory footprint and accelerate the search process in high-dimensional vector spaces.\n\nIn the context of the Qdrant, quantization allows you to optimize the search engine for specific use cases, striking a balance between accuracy, storage efficiency, and search speed.\n\n\n\nThere are tradeoffs associated with quantization.\n\nOn the one hand, quantization allows for significant reductions in storage requirements and faster search times.",0.810805
b5fa4a67b747eca0,3,bce3b2cf798ea5f3f5c7568e4a25f08f,What is quantization?,"*Available as of v1.1.0*\n\n\n\nScalar quantization, in the context of vector search engines, is a compression technique that compresses vectors by reducing the number of bits used to represent each vector component.\n\n\n\n\n\nFor instance, Qdrant uses 32-bit floating numbers to represent the original vector components. Scalar quantization allows you to reduce the number of bits used to 8.\n\nIn other words, Qdrant performs `float32 -> uint8` conversion for each vector component.",0.789160


### **17. Define your evaluation model and your evaluators for Hybrid Search**

Next, define your evaluation model and your evaluators.

Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

In [80]:
## Switching phoenix project space
from phoenix.trace import using_project

# Switch project to run evals
with using_project(HYBRID_RAG_PROJECT):
# all spans created within this context will be associated with the `HYBRID_RAG_PROJECT` project.
    eval_model = OpenAIModel(
        model="gpt-4-turbo-preview",
    )
    hallucination_evaluator = HallucinationEvaluator(eval_model)
    qa_correctness_evaluator = QAEvaluator(eval_model)
    relevance_evaluator = RelevanceEvaluator(eval_model)
    
    hallucination_eval_df_hybrid, qa_correctness_eval_df_hybrid = run_evals(
        dataframe=queries_df_hybrid,
        evaluators=[hallucination_evaluator, qa_correctness_evaluator],
        provide_explanation=True,
    )
    relevance_eval_df_hybrid = run_evals(
        dataframe=retrieved_documents_df_hybrid,
        evaluators=[relevance_evaluator],
        provide_explanation=True,
    )[0]
    
    px.Client().log_evaluations(
        SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df_hybrid),
        SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df_hybrid),
        HYBRID_RAG_PROJECT,
    )
    px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df_hybrid),
                                HYBRID_RAG_PROJECT)

run_evals |          | 0/56 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/74 (0.0%) | ⏳ 00:00<? | ?it/s

AttributeError: 'str' object has no attribute 'to_pyarrow_table'

In [None]:
Your evaluations should now appear as annotations on the appropriate spans in Phoenix.

![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)