<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/qdrant_arize.png" width="500"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Tuning a RAG Pipeline using Qdrant and Arize Phoenix</h1>

ℹ️ This notebook requires an OpenAI API key.

### **1. Import Relevant Packages**

In [1]:
import os

# Setup projects
SIMPLE_RAG_PROJECT = "simple-rag"
os.environ["PHOENIX_PROJECT_NAME"] = SIMPLE_RAG_PROJECT

In [4]:
import datetime
import json
import os
import pickle
import ssl
import time
import urllib
from getpass import getpass
from urllib.request import urlopen

import certifi
import nest_asyncio
import openai
import pandas as pd
import phoenix as px
import requests
from bs4 import BeautifulSoup
#from gcsfs import GCSFileSystem
from llama_index.core import (
    ServiceContext, StorageContext, download_loader,
    load_index_from_storage, set_global_handler
)
from phoenix.trace.llama_index import OpenInferenceTraceCallbackHandler
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.graph_stores.simple import SimpleGraphStore
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
    HallucinationEvaluator, OpenAIModel, QAEvaluator,
    RelevanceEvaluator, run_evals
)
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import PointStruct

nest_asyncio.apply()  # needed for concurrent evals in notebook environments
pd.set_option("display.max_colwidth", 1000)

### **2. Launch Phoenix**
You can run Phoenix in the background to collect trace data emitted by any LlamaIndex application that has been instrumented with the OpenInferenceTraceCallbackHandler. Phoenix supports LlamaIndex's one-click observability which will automatically instrument your LlamaIndex application! You can consult our integration guide for a more detailed explanation of how to instrument your LlamaIndex application.

Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI (the UI should be empty because we have yet to run the LlamaIndex application).

In [5]:
session = px.launch_app()
callback_handler = OpenInferenceTraceCallbackHandler()

Existing running Phoenix instance detected! Shutting it down and starting a new instance...


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


Be sure to enable phoenix as your global handler for tracing!

In [6]:
set_global_handler("arize_phoenix")

### **3. Setup your openai key and retrieve the documents to be used**

In [7]:
from dotenv import load_dotenv
load_dotenv()

True

In [8]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

### **4. Retrieve the documents / dataset to be used**

In [22]:
from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("atitaarora/qdrant_doc", split="train")

In [23]:
dataset.info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='csv', dataset_name='qdrant_doc', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1767967, num_examples=240, shard_lengths=None, dataset_name='qdrant_doc')}, download_checksums={'hf://datasets/atitaarora/qdrant_doc@8d859890840f65337c38e96d660b81b1441bbecd/documents.csv': {'num_bytes': 1777260, 'checksum': None}}, download_size=1777260, post_processing_size=None, dataset_size=1767967, size_in_bytes=3545227)

### **5. Definition of global chunk properties and chunk processing**
Processing each document with desired **TEXT_SPLITTER_ALGO , CHUNK_SIZE , CHUNK_OVERLAP** etc

In [24]:
## Global config for chunk processing
CHUNK_SIZE = 512 #1000
CHUNK_OVERLAP = 50

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument

langchain_docs = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(dataset)
]

# could showcase another variation of processed documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 240/240 [00:00<00:00, 9829.25it/s]


In [26]:
len(docs_processed)

4431

### **6. Process dataset as langchain (or llamaindex) document for further processing**

In [27]:
## Converting Langchain document chunks above into Llamaindex Document for ingestion
from llama_index.core import Document
documents = [
        Document.from_langchain_format(doc)
        for doc in docs_processed
    ]

In [28]:
len(documents)

4431

### **7. Setting up Qdrant and Collection**

We first set up the qdrant client and then create a collection so that our data may be stored.

In [29]:
##Uncomment to initialise qdrant client in memory
#client = qdrant_client.QdrantClient(
#    location=":memory:",
#)

##Uncomment below to connect to Qdrant Cloud
client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

## Uncomment below to connect to local Qdrant
#client = qdrant_client.QdrantClient("http://localhost:6333")

In [30]:
## Collection Name 
COLLECTION_NAME = "qdrant_docs_arize_dense"

In [33]:
## General Collection level operations

## Get information about existing collections 
client.get_collections()

## Get information about specific collection
#collection_info = client.get_collection(COLLECTION_NAME)
#print(collection_info)

## Deleting collection, if need be
#client.delete_collection(COLLECTION_NAME)

CollectionsResponse(collections=[])

In [34]:
## Declaring the intended Embedding Model with Fastembed
from fastembed.embedding import TextEmbedding

pd.DataFrame(TextEmbedding.list_supported_models())

Unnamed: 0,model,dim,description,size_in_GB,sources
0,BAAI/bge-base-en,768,Base English model,0.42,{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz'}
1,BAAI/bge-base-en-v1.5,768,"Base English model, v1.5",0.21,"{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz', 'hf': 'qdrant/bge-base-en-v1.5-onnx-q'}"
2,BAAI/bge-large-en-v1.5,1024,"Large English model, v1.5",1.2,{'hf': 'qdrant/bge-large-en-v1.5-onnx'}
3,BAAI/bge-small-en,384,Fast English model,0.13,{'url': 'https://storage.googleapis.com/qdrant-fastembed/BAAI-bge-small-en.tar.gz'}
4,BAAI/bge-small-en-v1.5,384,Fast and Default English model,0.067,{'hf': 'qdrant/bge-small-en-v1.5-onnx-q'}
5,BAAI/bge-small-zh-v1.5,512,Fast and recommended Chinese model,0.09,{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz'}
6,sentence-transformers/all-MiniLM-L6-v2,384,"Sentence Transformer model, MiniLM-L6-v2",0.09,"{'url': 'https://storage.googleapis.com/qdrant-fastembed/sentence-transformers-all-MiniLM-L6-v2.tar.gz', 'hf': 'qdrant/all-MiniLM-L6-v2-onnx'}"
7,sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2,384,"Sentence Transformer model, paraphrase-multilingual-MiniLM-L12-v2",0.22,{'hf': 'qdrant/paraphrase-multilingual-MiniLM-L12-v2-onnx-Q'}
8,nomic-ai/nomic-embed-text-v1,768,8192 context length english model,0.52,{'hf': 'nomic-ai/nomic-embed-text-v1'}
9,nomic-ai/nomic-embed-text-v1.5,768,8192 context length english model,0.52,{'hf': 'nomic-ai/nomic-embed-text-v1.5'}


### **8. Document Embedding processing and Ingestion**

This example uses a `QdrantVectorStore` and creates a new collection to work fully connected with Qdrant but you can use whatever LlamaIndex application you like.

In [35]:
import llama_index
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from phoenix.trace import suppress_tracing
## Uncomment it if you'd like to use FastEmbed instead of OpenAI
## For the complete list of supported models,
##please check https://qdrant.github.io/fastembed/examples/Supported_Models/
from llama_index.embeddings.fastembed import FastEmbedEmbedding

vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

##Uncomment if using FastEmbed
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

## Uncomment it if you'd like to use OpenAI Embeddings instead of FastEmbed
#Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

Settings.llm = OpenAI(model="gpt-4-1106-preview", temperature=0.0)

with suppress_tracing():
  index = VectorStoreIndex.from_documents(
      documents,
      storage_context=storage_context,
      show_progress=True
  )

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/4431 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/335 [00:00<?, ?it/s]

### **8a. Connecting to existing Collection**

This example uses a `QdrantVectorStore` and uses the previously generated collection to work fully connected with Qdrant.

In [38]:
## Uncomment it if using an existing collection
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever

vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = load_index_from_storage(storage_context)

ValueError: No index in storage context, check if you specified the right persist_dir.

In [36]:
client.count(collection_name=COLLECTION_NAME)

CountResult(count=4431)

### **9.Running an example query and printing out the response.**

In [37]:

retriever = VectorIndexRetriever(
    index=index,
    vector_store_query_mode=VectorStoreQueryMode.DEFAULT,
    similarity_top_k=5
)

In [None]:
response = retriever.retrieve("What is quantization?")
for i, node in enumerate(response):
    print(i + 1, node.text, end="\n\n")

1 Quantum quantization is a novel approach that leverages the power of quantum computing to speed up the search process in ANNs. By converting traditional float32 vectors into qbit vectors, we can create quantum entanglement between the qbits. Quantum entanglement is a unique phenomenon in which the states of two or more particles become interdependent, regardless of the distance between them. This property of quantum systems can be harnessed to create highly efficient vector search algorithms.

2 Quantization is primarily used to reduce the memory footprint and accelerate the search process in high-dimensional vector spaces.

In the context of the Qdrant, quantization allows you to optimize the search engine for specific use cases, striking a balance between accuracy, storage efficiency, and search speed.



There are tradeoffs associated with quantization.

On the one hand, quantization allows for significant reductions in storage requirements and faster search times.

3 *Available a

In [None]:
response

[NodeWithScore(node=TextNode(id_='864902ad-5065-411e-a092-647f9b265ffb', embedding=None, metadata={'source': 'articles/quantum-quantization.md', 'start_index': 1259}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='00c7358d-e94c-4902-998e-c5a26d580c03', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'source': 'articles/quantum-quantization.md', 'start_index': 1259}, hash='6995d4c68367c5e2ddc6c6e93e46af14fc2bce5da24c9515aad3a31f9121fdff'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='c57bec6a-73d7-41b9-9bb0-f64fb77ed848', node_type=<ObjectType.TEXT: '1'>, metadata={'source': 'articles/quantum-quantization.md', 'start_index': 1215}, hash='8138e512d7d721a38698c779c5ba62cec91f390a0d8d1a96d6403fb9827296a2'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='85287b67-9c60-4296-b75c-f816964cd46b', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='5019a68c10071f1a30e21328ba697e1ff8dac

In [None]:
# We can view the above data in the UI
px.active_session().view()

📺 Opening a view to the Phoenix app. The app is running at https://dea13fsaqp42-496ff2e9c6d22116-6006-colab.googleusercontent.com/


## 5. Run Your Query Engine and View Your Traces in Phoenix

We've compiled a list of commonly asked questions about Arize. Let's download the sample queries and take a look.

In [None]:
## Loading the Eval dataset
from datasets import load_dataset
qdrant_qa = load_dataset("atitaarora/qdrant_docs_qna_ragas", split="train")
qdrant_qa_question = qdrant_qa.select_columns(['question'])

Downloading readme:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/125k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
qdrant_qa_question['question'][:10]

['What is the purpose of oversampling in Qdrant search process?',
 'How does Qdrant address the search accuracy problem in comparison to other search engines using HNSW?',
 'What is the difference between regular and neural search?',
 'How can I use Qdrant as a vector store in Langchain Go?',
 'How did Dust leverage compression features in Qdrant to manage the balance between storing vectors on disk and keeping quantized vectors in RAM effectively?',
 'Why do we still need keyword search?',
 'What principles did Qdrant follow while designing benchmarks for vector search engines?',
 'What models does Qdrant support for embedding generation?',
 'How can you parallelize the upload of a large dataset using shards in Qdrant?',
 'What is the significance of maximizing the distance between all points in the response when utilizing vector similarity for diversity search?']

In [None]:
query_engine = index.as_query_engine()
for query in tqdm(qdrant_qa_question['question'][:10]):
    try:
      #query_engine.query(query)
      retriever.retrieve(query)
    except Exception as e:
      pass

100%|██████████| 10/10 [00:04<00:00,  2.49it/s]


Check the Phoenix UI as your queries run. Your traces should appear in real time.

Open the Phoenix UI with the link below if you haven't already and click through the queries to better understand how the query engine is performing. For each trace you will see a break

Phoenix can be used to understand and troubleshoot your by surfacing:
 - **Application latency** - highlighting slow invocations of LLMs, Retrievers, etc.
 - **Token Usage** - Displays the breakdown of token usage with LLMs to surface up your most expensive LLM calls
 - **Runtime Exceptions** - Critical runtime exceptions such as rate-limiting are captured as exception events.
 - **Retrieved Documents** - view all the documents retrieved during a retriever call and the score and order in which they were returned
 - **Embeddings** - view the embedding text used for retrieval and the underlying embedding model
LLM Parameters - view the parameters used when calling out to an LLM to debug things like temperature and the system prompts
 - **Prompt Templates** - Figure out what prompt template is used during the prompting step and what variables were used.
 - **Tool Descriptions** - view the description and function signature of the tools your LLM has been given access to
 - **LLM Function Calls** - if using OpenAI or other a model with function calls, you can view the function selection and function messages in the input messages to the LLM.

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/RAG_trace_details.png" alt="Trace Details View on Phoenix" style="width:100%; height:auto;">

In [None]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: https://5ajrh2kq8qt1-496ff2e9c6d22116-6006-colab.googleusercontent.com/


## 6. Export and Evaluate Your Trace Data
You can export your trace data as a pandas dataframe for further analysis and evaluation.

In this case, we will export our retriever spans into two separate dataframes:

queries_df, in which the retrieved documents for each query are concatenated into a single column, retrieved_documents_df, in which each retrieved document is "exploded" into its own row to enable the evaluation of each query-document pair in isolation. This will enable us to compute multiple kinds of evaluations, including:

relevance: Are the retrieved documents grounded in the response? Q&A correctness: Are your application's responses grounded in the retrieved context? hallucinations: Is your application making up false information?

In [None]:
queries_df = get_qa_with_reference(session)
retrieved_documents_df = get_retrieved_documents(session)

Next, define your evaluation model and your evaluators.

Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

In [None]:
eval_model = OpenAIModel(
    model="gpt-4-1106-preview",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
)
px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df))

run_evals |          | 0/22 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/35 (0.0%) | ⏳ 00:00<? | ?it/s

Your evaluations should now appear as annotations on the appropriate spans in Phoenix.

![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)