# ReadtheDocs Retrieval Augmented Generation (RAG) using Zilliz Free Tier

In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product.  The chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.

Many RAG demos use OpenAI for the Embedding Model and ChatGPT for the Generative AI model.  **In this notebook, we will demo a fully open source RAG stack.**

Using open-source Q&A with retrieval saves money since we make free calls to our own data almost all the time - retrieval, evaluation, and development iterations.  We only make a paid call to OpenAI once for the final chat generation step. 

<div>
<img src="../../pics/rag_image.png" width="80%"/>
</div>

Let's get started!

In [1]:
# For colab install these libraries in this order:
# !pip install pymilvus, langchain, torch, transformers, python-dotenv

# Import common libraries.
import sys, os, time, pprint
import numpy as np

# Import custom functions for splitting and search.
sys.path.append("..")  # Adds higher directory to python modules path.
import milvus_utilities as _utils

## Start up a Zilliz free tier cluster.

Code in this notebook uses fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  
  1. Choose the default "Starter" option when you provision > Create collection > Give it a name > Create cluster and collection.  
  2. On the Cluster main page, copy your `API Key` and store it locally in a .env variable.  See note below how to do that.
  3. Also on the Cluster main page, copy the `Public Endpoint URI`.

💡 Note: To keep your tokens private, best practice is to use an **env variable**.  See [how to save api key in env variable](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). <br>

In Jupyter, you also need a .env file (in same dir as notebooks) containing lines like this:
- VARIABLE_NAME=value


In [2]:
# STEP 1. CONNECT TO MILVUS

# !pip install pymilvus #python sdk for milvus
from pymilvus import connections, utility

# Jupyter notebooks:
# from dotenv import load_dotenv
# load_dotenv()
# TOKEN = os.getenv("ZILLIZ_API_KEY")

# Usual way:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
TOKEN = os.environ["ZILLIZ_API_KEY"]

# Connect to Zilliz cloud using endpoint URI and API key TOKEN.
# TODO change this.
CLUSTER_ENDPOINT="https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443"
connections.connect(
  alias='default',
  #  Public endpoint obtained from Zilliz Cloud
  uri=CLUSTER_ENDPOINT,
  # API key or a colon-separated cluster username and password
  token=TOKEN,
)

# Check if the server is ready and get colleciton name.
print(f"Type of server: {utility.get_server_version()}")

Type of server: Zilliz Cloud Vector Database(Compatible with Milvus 2.3)


## Load the Embedding Model checkpoint and use it to create vector embeddings
**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally. 

Two model parameters of note below:
1. EMBEDDING_DIM refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 1024. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>
2. MAX_SEQ_LENGTH is the maximum length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input.

In [3]:
# STEP 2. DOWNLOAD AN OPEN SOURCE EMBEDDING MODEL.

# Import torch.
import torch
from torch.nn import functional as F
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')
print(f"device: {DEVICE}")

# Load the model from huggingface model hub.
# python -m pip install -U angle-emb
model_name = "WhereIsAI/UAE-Large-V1"
encoder = SentenceTransformer(model_name, device=DEVICE)
print(type(encoder))
print(encoder)

# Get the model parameters and save for later.
EMBEDDING_DIM = encoder.get_sentence_embedding_dimension()
MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length() 
# # Assume tokens are 3 characters long.
# MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS * 3
# HF_EOS_TOKEN_LENGTH = 1 * 3
# Test with 512 sequence length.
MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS
HF_EOS_TOKEN_LENGTH = 1

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_DIM: {EMBEDDING_DIM}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

device: cpu


No sentence-transformers model found with name /Users/christybergman/.cache/torch/sentence_transformers/WhereIsAI_UAE-Large-V1. Creating a new one with MEAN pooling.


<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
model_name: WhereIsAI/UAE-Large-V1
EMBEDDING_DIM: 1024
MAX_SEQ_LENGTH: 512


## Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases.  The **collection** will contain the 
- **Schema** (or [no-schema Milvus client](https://milvus.io/docs/using_milvusclient.md)).  
💡 You'll need the vector `EMBEDDING_DIM` parameter from your embedding model.
Typical values are:
   - 1024 for sbert embedding models
   - 1536 for ada-002 OpenAI embedding models
- **Vector index** for efficient vector search
- **Vector distance metric** for measuring nearest neighbor vectors
- **Consistency level**
In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.

## Add a Vector Index

The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  

Most vector indexes use different sets of parameters depending on whether the database is:
- **inserting vectors** (creation mode) - vs - 
- **searching vectors** (search mode) 

Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)
- AUTOINDEX - Automatically determined based on OSS vs [Zilliz cloud](https://docs.zilliz.com/docs/autoindex-explained), type of GPU, size of data.

Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered "close" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized.

In [4]:
# STEP 3. CREATE A NO-SCHEMA MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.

from pymilvus import MilvusClient

# Set the Milvus collection name.
COLLECTION_NAME = "wikipedia"

# Add custom HNSW search index to the collection.
# M = max number graph connections per layer. Large M = denser graph.
# Choice of M: 4~64, larger M for larger data and larger embedding lengths.
M = 16
# efConstruction = num_candidate_nearest_neighbors per layer. 
# Use Rule of thumb: int. 8~512, efConstruction = M * 2.
efConstruction = M * 2
# Create the search index for local Milvus server.
INDEX_PARAMS = dict({
    'M': M,               
    "efConstruction": efConstruction })
index_params = {
    "index_type": "HNSW", 
    "metric_type": "COSINE", 
    "params": INDEX_PARAMS
    }

# Use no-schema Milvus client uses flexible json key:value format.
# https://milvus.io/docs/using_milvusclient.md
mc = MilvusClient(
    uri=CLUSTER_ENDPOINT,
    # API key or a colon-separated cluster username and password
    token=TOKEN)

# Check if collection already exists, if so drop it.
has = utility.has_collection(COLLECTION_NAME)
if has:
    drop_result = utility.drop_collection(COLLECTION_NAME)
    print(f"Successfully dropped collection: `{COLLECTION_NAME}`")

# Create the collection.
mc.create_collection(COLLECTION_NAME, 
                     EMBEDDING_DIM,
                     consistency_level="Eventually", 
                     auto_id=True,
                     # skip setting params below, if using AUTOINDEX
                     params=index_params
                    )

print(f"Successfully created collection: `{COLLECTION_NAME}`")
# pprint.pprint(mc.describe_collection(COLLECTION_NAME))

Successfully dropped collection: `wikipedia`
Successfully created collection: `wikipedia`


## Insert data into Milvus

For each original text chunk, we'll write the quadruplet (`vector, text, source, h1, h2`) into the database.

<div>
<img src="../../pics/db_insert.png" width="80%"/>
</div>

**The Milvus Client wrapper can only handle loading data from a list of dictionaries.**

Otherwise, in general, Milvus supports loading data from:
- pandas dataframes 
- list of dictionaries

Below, we use the embedding model provided by HuggingFace, download its checkpoint, and run it locally as the encoder.  

In [5]:
# INSERT WIKIPEDIA CHUNKS INTO A SEPARATE PARTITION.
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# load the Wikipedia page and create index
loader = WebBaseLoader("https://en.wikipedia.org/wiki/New_York_City")
docs = loader.load()

# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
print(f"Num docs: {len(docs)}")
chunks = text_splitter.split_documents(docs)
print(f"Num chunks: {len(chunks)}")

# Convert chunks to a list of dictionaries.
chunk_list = []
for chunk in chunks:
    # pprint.pprint(chunk)
    # Generate embeddings using encoder from HuggingFace.
    embeddings = torch.tensor(encoder.encode([chunk.page_content]))
    embeddings = np.array(embeddings / np.linalg.norm(embeddings)) #use numpy
    converted_values = list(map(np.float32, embeddings))[0]
    
    # Assemble embedding vector, original text chunk, metadata.
    chunk_dict = {
        'vector': converted_values,
        'chunk': chunk.page_content,
        'source': chunk.metadata['source'],
        'h1': chunk.metadata['title'][:50],
    }
    chunk_list.append(chunk_dict)

# Insert data into the Milvus collection.
print("Start inserting entities")
start_time = time.time()
insert_result = mc.insert(
    COLLECTION_NAME,
    data=chunk_list,
    append=True,
    progress_bar=True)
end_time = time.time()
print(f"Milvus Client insert time for {len(chunk_list)} vectors: {end_time - start_time} seconds")
# Milvus Client insert time for 646 vectors: 4.732278823852539 seconds

# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush(COLLECTION_NAME)

Num docs: 1
Num chunks: 704
Start inserting entities


100%|██████████| 1/1 [00:03<00:00,  3.95s/it]


Milvus Client insert time for 704 vectors: 3.9572505950927734 seconds


## Define Evaluation Metrics

In [6]:
import openai, pprint
from openai import OpenAI

# Define the generation llm model to use.
LLM_NAME = "gpt-3.5-turbo-1106"
TEMPERATURE = 0.1
RANDOM_SEED = 415

# Reasonable values for the penalty coefficients are around 0.1 to 1 if the aim is to just reduce repition 
# somewhat. To strongly suppress repetition, set coefficients = 2.
FREQUENCY_PENALTY = 2

# See how to save api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
openai_client = OpenAI(
    # This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
)

In [7]:
# Ragas default uses HuggingFace Datasets.
# https://docs.ragas.io/en/latest/getstarted/evaluation.html
from datasets import Dataset
# Ragas default uses OpenAI through LangChain.
from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM
from ragas import evaluate

# Choose the metrics you want to see.
from ragas.metrics import (
    # Question -> Context metrics
    context_recall, 
    context_precision, 
    # Context -> Answer metrics
    faithfulness, 
    # Question -> Answer metrics
    answer_similarity,
    answer_relevancy, 
    answer_correctness
    )
metrics = ['context_recall', 'context_precision', 'answer_relevancy', 'faithfulness', 'answer_similarity', 'answer_correctness']

# Customize LLM used by Ragas (uses LangChain OpenAI `gpt-3.5-turbo-16k` by default).
# Possible to swtich out a HuggingFace open LLM here if you want.
# https://docs.ragas.io/en/latest/howtos/customisations/llms.html
llm_langchain = ChatOpenAI(model_name=LLM_NAME, temperature=TEMPERATURE)
gpt3_wrapper = LangchainLLM(llm=llm_langchain)
# Change the default llm for each metric.
for metric in metrics:
    globals()[metric].llm = gpt3_wrapper

  warn_deprecated(


In [8]:
def assemble_ragas_dataset(input_df, answer_col_name="OpenAI_RAG_answer", context_exists=False, row_number=-9999):
    """Assemble a RAGAS HuggingFace Dataset from lists of values."""

    # Subset input_df to the row number.
    if row_number >= 0:
        subset_df = input_df.iloc[row_number:row_number+1, :]
    else:
        subset_df = input_df.copy()

    question_list = subset_df.Question.to_list()
    answer_list = subset_df[answer_col_name].to_list()

    # contexts: list[list[str]] - The contexts which were passed into the LLM to answer the question.
    if context_exists:
        context_list = subset_df.Custom_RAG_context.to_list()
        context_list = [[context] for context in context_list]
    else:
        context_list = [ [""] for _ in question_list]

    # ground_truths: list[list[str]] - The ground truth answer to the questions. 
    truth_list = subset_df.ground_truth_answer.to_list()
    truth_list = [[truth] for truth in truth_list]

    # Create a HuggingFace Dataset from the ground truth lists.
    ragas_ds = Dataset.from_dict({"question": question_list,
                            "contexts": context_list,
                            "answer": answer_list,
                            "ground_truths": truth_list})
    
    return ragas_ds

def evaluate_ragas(input_df, answer_col_name="OpenAI_RAG_answer", context_exists=False, row_number=-9999, metrics="final_only"):

    # Create a ragas dataset.
    ragas_input_ds = assemble_ragas_dataset(input_df, answer_col_name, context_exists, row_number)

    # Evaluate the dataset.
    if metrics == "final_only":
        ragas_result = evaluate(
            ragas_input_ds,
            metrics=[
                answer_similarity,
                answer_relevancy,
                answer_correctness,])
    else:
        # calculate all metrics
        ragas_result = evaluate(
            ragas_input_ds,
            metrics=[
                # Question -> Context metrics
                context_recall, 
                context_precision, 
                # Context -> Answer metrics
                faithfulness, 
                # Question -> Answer metrics
                answer_similarity,
                answer_relevancy,
                answer_correctness,])
        
    return ragas_result

In [9]:
# Read questions and ground truth answers into a pandas dataframe.
import pandas as pd

# Read ground truth answers from file.
eval_df = pd.read_csv("../../../christy_coding_scratch/data/milvus_ground_truth.csv", 
                      header=0, skip_blank_lines=True)
display(eval_df.head())

# Get all the questions.
question_list = eval_df.Question.to_list()

# Get all the ground truth answers.
truth_list = eval_df.ground_truth_answer.to_list()

# Get all the ground truth sources.
uri_list = eval_df.Uri.to_list()

# Get all the OpenAI Answers.
openai_answer_list = eval_df.OpenAI_RAG_answer.to_list()

Unnamed: 0,Question,ground_truth_answer,OpenAI_RAG_answer,Custom_RAG_answer,Custom_RAG_context,Uri,H1,H2,Score,Reason
0,What do the parameters for HNSW mean?\n,- M: maximum degree of nodes in a layer of the...,The HNSW parameters include the “nlist” which ...,The parameters for HNSW have the following mea...,"performance, HNSW limits the maximum degree of...",https://pymilvus.readthedocs.io/en/latest/para...,Index,Milvus support to create index to accelerate v...,,
1,What are HNSW good default parameters when dat...,"M=16, efConstruction=32, ef=32",The default HNSW parameters for data size of 2...,For a data size of 25K vectors with a dimensio...,Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 I...,https://pymilvus.readthedocs.io/en/latest/para...,,,,
2,what is the default distance metric used in AU...,"Trick answer: IP inner product, not yet updat...",The default distance metric used in AUTOINDEX ...,The default distance metric used in AUTOINDEX ...,The attributes of collection can be extracted ...,https://pymilvus.readthedocs.io/en/latest/tuto...,,,,
3,How did New York City get its name?,"In the 1600’s, the Dutch planted a trading pos...","I'm sorry, but I couldn't find any information...",New York City was originally named New Amsterd...,Etymology\nSee also: Nicknames of New York Cit...,https://en.wikipedia.org/wiki/New_York_City,,,,


## Define a Custom Execution Loop for RAG.

In [10]:
import requests, json, pprint

# Milvus search, define how many retrieval results to return.
# Milvus automatically sorts results descending by distance score.
TOP_K = 3

# Search a collection containing Milvus Documentation.
def zilliz_pipeline_collection_search(token, question):
    # Define the URL, headers, and data
    url = "https://controller.api.gcp-us-west1.zillizcloud.com/v1/pipelines/pipe-3de3fb4a9bc3c2a64a786b/run"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {token}",
    }
    data = {
        "data": {
            "query_text": question
        },
        "params": {
            "limit": 3,
            "offset": 0,
            "outputFields": ["chunk_text", "chunk_id", "doc_name", "source"],
            "filter": "chunk_id >= 0 && doc_name == 'param.html'",
        }
    }

    # Send the POST request
    response = requests.post(url, headers=headers, json=data)

    # # Print the response
    # pprint.pprint(response.json())
    return response.json()

# Search a collection containing Wikipedia articles about New York City.
def wikipedia_search(mc, collection_name, collection_encoder, question, output_fields=None, top_k=3):
    # Embed the query
    query_embeddings = _utils.embed_query(collection_encoder, [question])

    # Define search parameters
    INDEX_PARAMS = dict({
        'M': M,               
        "efConstruction": efConstruction })
    SEARCH_PARAMS = dict({
        "ef": INDEX_PARAMS['efConstruction']
    })

    # Define output fields to return
    OUTPUT_FIELDS = ["h1", "source", "chunk"]

    # Perform the search
    answers = mc.search(
        collection_name,
        data=query_embeddings, 
        search_params=SEARCH_PARAMS,
        output_fields=output_fields, 
        filter="(source like 'https://en.wikipedia.org%')",
        limit=top_k,
        consistency_level="Eventually"
    )

    return answers

In [11]:
# Function to get OpenAI response and token usage.
def get_openai_chat(llm_name, user_prompt, retrieval_context, retrieval_source, message_history,
                     temperature=0.0, random_seed=415, frequency_penalty=2):
    """ 
    Returns 2 pandas dataframes: response, token_use.
    """
    
    system_message = f"""
    Use the Context to answer the user's question. Be clear, factual, complete, concise.
    If the answer is not in the Context, say "I don't know".  Otherwise answer using this format:
    Context: {retrieval_context}
    Answer: The answer to the question.
    Grounding source: {retrieval_source}
    """
    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': f"{user_prompt}"},
        {'role': 'assistant', 'content': f"Relevant context:\n{retrieval_context}"}
    ]

    # Define the OpenAIEvaluator.
    responses = openai_client.chat.completions.create(
        response_format={
            "type": "json_object", 
            # "schema": Result.schema_json()
        },
        messages=message_history + messages,
        model=llm_name,
        temperature=temperature, # the degree of randomness of the model's output
        seed=random_seed,  # for reproducibility
        frequency_penalty=frequency_penalty, # allowed amount of repitition in the model's output
        # max_tokens=max_tokens # maximum number of tokens the model can output
    )
    message_history = message_history + messages[1:]

    # Make sure total_tokens < 4096.
    token_dict = {
        'prompt_tokens':responses.usage.prompt_tokens,
        'completion_tokens':responses.usage.completion_tokens,
        'total_tokens':responses.usage.total_tokens,
    }

    # Return answer as a JSON object.
    openai_response = responses.choices[0].message.content
    json_response = json.loads(openai_response)
    json_response # single json object with 3 fields

    # Create a DataFrame from a list of dictionaries.
    response_df = pd.DataFrame([json_response])
    token_use_df = pd.DataFrame([token_dict])

    return response_df, token_use_df

def get_answer_from_openai_chat_response(chat_response):
    # Extract the answer from the 0th choice's message content
    answer = chat_response.choices[0].message.content
    return answer

In [12]:
# STEP1: Moderation check of user question.  If pass, continue.
# STEP2: Retrieve closest chunk to question from default collection.
#        Check distance score of the retrieved chunk.  
#   STEP3:  If score is too low, get the intent from the question.
#   STEP4:  Based on question intent, retrieve from a different collection containing that data.
# STEP5: Generate answer to the user's question, using context in the ASSISTANT PROMPT.
# STEP6: Moderation check of generated answer.  If pass, continue.
# STEP7: Return final answer to user.

# Define a custom execution loop for RAG.
def process_user_message(user_input, question_number, message_history, top_k=3, debug=False):
    delimiter = "```"
    retrieval_done = False
    threshold_retrieval_score = 0.6
    ragas_metrics= ['answer_relevancy', 'faithfulness']

    # # Step 1: Check input to see if it flags the Moderation API or is a prompt injection
    # if debug:
    #    print()
    #    print("STEP 1: Check input to see if it flags the Moderation API or is a prompt injection")
    # response = openai_client.moderations.create(input=user_input)
    # moderation_output = response.results[0]
    # print(moderation_output.flagged) # False

    # if moderation_output.flagged:
    #     print("Step 1: Input flagged by Moderation API.")
    #     return "Sorry, we cannot process this request.", message_history

    # Step 2: Retrieval from collection #1.
    if debug:
        print()
        print("STEP 2: Retrieval from collection #1 MilvusDocs.")
    response = zilliz_pipeline_collection_search(TOKEN, user_input)
    distance_score = response['data']['result'][0]['distance']

    # Branching logic based on distance score.
    if distance_score >= threshold_retrieval_score: 
        # Extract the retrieval context.
        retrieval_context = response['data']['result'][0]['chunk_text']
        retrieval_source = response['data']['result'][0]['source']
        if debug:
            print(f"DISTANCE SCORE: {distance_score} branch logic.")
            print(f"chunk_answer: {retrieval_context[:150]}")
        retrieval_done = True

    if not retrieval_done and distance_score < threshold_retrieval_score:
        # Step 3: If score is too low, get the intent from the prompt.
        if debug:
            print(f"DISTANCE SCORE: {distance_score} branching logic...")
            print()
            print("STEP 3: Score is too low, GET INTENT from the user's question.")
        if "New York City" in user_input:
            intent = "new_york"
            print(f"intent = {intent}")
        # elif could check for other intents here...
            
        # Step 4: Based on question intent, retrieve from collection containing that data.
        if intent == "new_york":
            if debug:
                print()
                print("STEP 4: Based on question intent, retrieve from collection #2 Wikipedia.")
            OUTPUT_FIELDS = ["h1", "source", "chunk"]
            response = wikipedia_search(mc, COLLECTION_NAME, encoder, user_input, OUTPUT_FIELDS, top_k)
            # Extract the retrieval score, context, source citation.
            distance_score = response[0][0]['distance']
            retrieval_context = response[0][0]['entity']['chunk']
            retrieval_source = response[0][0]['entity']['source']
            if debug:
                print(f"chunk_answer: {retrieval_context[:150]}")
        else:
            print(f"STEP 4: No matching collection for intent {intent}.")
            return "Sorry, we cannot process this request.", message_history

    # Branching logic based on distance score.
    if debug:
        print(f"DISTANCE SCORE: {distance_score} branch logic...")
    if distance_score < threshold_retrieval_score: 
        print("UNABLE TO MATCH INTENT WITH ANY INTERNAL DOC STORE.")
        return "Sorry, we cannot process this request.", message_history
    else: 
        print()
        print(f"Score from custom RAG Retrieval is above threshold, proceed to answer generation step.")
        # STEP 5: Generating GPT3.5 answer from the custom execution loop for RAG in the ASSISTANT PROMPT.
        if debug:
            print()
            print("STEP 5: Generating GPT3.5 answer from the custom execution loop for RAG in the ASSISTANT PROMPT.")
        system_message = f"""
        Use the Context below to answer the user's question. Be clear, factual, complete, concise.
        If the answer is not in the Context, say "I don't know".  Otherwise answer using this format:
        Context: {retrieval_context}
        Answer: The answer to the question.
        Grounding source: {retrieval_source}
        """
        messages = [
            {'role': 'system', 'content': system_message},
            {'role': 'user', 'content': f"{delimiter}{user_input}{delimiter}"},
            {'role': 'assistant', 'content': f"Relevant context:\n{retrieval_context}"}
        ]
        final_response = openai_client.chat.completions.create(
            messages=message_history + messages,
            model=LLM_NAME,
            temperature=TEMPERATURE,
            seed=RANDOM_SEED,
        )
        message_history = message_history + messages[1:]
        answer = get_answer_from_openai_chat_response(final_response)

        # STEP 6: Evaluate whether the chatbot response answers the initial user query well.
        if debug:
            print()
            print("STEP 6: Evaluate whether the chatbot response answers the initial user query well.")
            ragas_result = evaluate_ragas(eval_df, "Custom_RAG_answer", True, question_number, "final_only")
            ragas_df = ragas_result.to_pandas()
            print(f"Ragas evaluation: answer similarity: {ragas_df.answer_similarity[0]}, answer relevancy: {np.round(ragas_df.answer_relevancy[0],3)}, answer correctness: {np.round(ragas_df.answer_correctness[0],3)}")
            # could also check for other metrics here...
            evaluation_response = "Y"

        # STEP 7: If LLM answer passed Evaluation, return it to the user.
        if evaluation_response == "Y":
            if debug:
                print()
                print("STEP 7: LLM answer passed Evaluation, return it to the user.")
            return answer, message_history
        else:
            if debug:
                print()
                print(f"STEP 7: The LLM answer does not pass Evaluation.")
            return answer, message_history


In [13]:
# Test the custom RAG execution loop using a question.

QUESTION_NUMBER = 3 #2 or 3
SAMPLE_QUESTION = question_list[QUESTION_NUMBER]
print(f"question = {SAMPLE_QUESTION}")

truth_answer = truth_list[QUESTION_NUMBER]

# Test the OpenAI answer.
all_messages = []
answer_history = []
openai_answer, messages = process_user_message(SAMPLE_QUESTION, QUESTION_NUMBER, all_messages, debug=True)
all_messages.append(messages)
answer_history.append(openai_answer)
pprint.pprint(f"Answer: {openai_answer}")

question = How did New York City get its name?

STEP 2: Retrieval from collection #1 MilvusDocs.
DISTANCE SCORE: 0.39108937978744507 branching logic...

STEP 3: Score is too low, GET INTENT from the user's question.
intent = new_york

STEP 4: Based on question intent, retrieve from collection #2 Wikipedia.
chunk_answer: New York City traces its origins to Fort Amsterdam and a trading post founded on the southern tip of Manhattan Island by Dutch colonists in approximat
DISTANCE SCORE: 0.7961502075195312 branch logic...

Score from custom RAG Retrieval is above threshold, proceed to answer generation step.

STEP 5: Generating GPT3.5 answer from the custom execution loop for RAG in the ASSISTANT PROMPT.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



STEP 6: Evaluate whether the chatbot response answers the initial user query well.
evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.49it/s]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:01<00:00,  1.73s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:05<00:00,  5.98s/it]


Ragas evaluation: answer similarity: 0.9421961714808575, answer relevancy: 0.894, answer correctness: 0.664

STEP 7: LLM answer passed Evaluation, return it to the user.
('Answer: New York City was originally named New Amsterdam by Dutch colonists '
 'in 1626. However, it was renamed New York in 1664 after King Charles II '
 'granted the lands to his brother, the Duke of York, when the city came under '
 'British control.')


## Final Eval Comparisons Custom RAG vs OpenAI RAG

In [14]:
# Run Ragas Eval for all Questions, all Custom RAG Answers.

# def evaluate_ragas(input_df, answer_col_name="OpenAI_RAG_answer", context_exists=False, row_number=-9999, metrics="final_only"):
ragas_result = evaluate_ragas(eval_df, "Custom_RAG_answer", True, -9999, "all")
ragas_df_Custom_RAG = ragas_result.to_pandas()

# Rename the columns.
rename_dict = {
    "contexts": "contexts_Custom_RAG",
    "answer": "answer_Custom_RAG",
    "answer_similarity": "answer_similarity_Custom_RAG",
    "answer_relevancy": "answer_relevancy_Custom_RAG",
    "answer_correctness": "answer_correctness_Custom_RAG"
}
ragas_df_Custom_RAG.rename(columns=rename_dict, inplace=True)
# Reorder the columns.
ragas_df_Custom_RAG = ragas_df_Custom_RAG.iloc[:,[0, 3, 1, 2, 4,5,6,7,8,9]]
display(ragas_df_Custom_RAG.head())

evaluating with [context_recall]


100%|██████████| 1/1 [00:14<00:00, 14.62s/it]


evaluating with [context_precision]


100%|██████████| 1/1 [00:07<00:00,  7.86s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [00:29<00:00, 29.35s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:01<00:00,  1.20s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:07<00:00,  7.96s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:20<00:00, 20.12s/it]


Unnamed: 0,question,ground_truths,contexts_Custom_RAG,answer_Custom_RAG,context_recall,context_precision,faithfulness,answer_similarity_Custom_RAG,answer_relevancy_Custom_RAG,answer_correctness_Custom_RAG
0,What do the parameters for HNSW mean?\n,[- M: maximum degree of nodes in a layer of th...,"[performance, HNSW limits the maximum degree o...",The parameters for HNSW have the following mea...,1.0,1.0,0.8,0.844867,0.979217,0.620304
1,What are HNSW good default parameters when dat...,"[M=16, efConstruction=32, ef=32]",[Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 ...,For a data size of 25K vectors with a dimensio...,0.0,0.0,0.0,0.776006,0.977902,0.62255
2,what is the default distance metric used in AU...,"[Trick answer: IP inner product, not yet upda...",[The attributes of collection can be extracted...,The default distance metric used in AUTOINDEX ...,0.0,0.0,0.0,0.73806,0.990814,0.484557
3,How did New York City get its name?,"[In the 1600’s, the Dutch planted a trading po...",[Etymology\nSee also: Nicknames of New York Ci...,New York City was originally named New Amsterd...,1.0,1.0,0.5,0.942196,0.894259,0.66412


In [15]:
# Run Ragas Eval for all Questions, all OpenAI RAG Answers.

ragas_result = evaluate_ragas(eval_df, "OpenAI_RAG_answer", False, -9999)
ragas_df_OpenAI_RAG = ragas_result.to_pandas()

# Rename the columns.
# Rename the columns.
rename_dict = {
    "contexts": "contexts_OpenAI_RAG",
    "answer": "answer_OpenAI_RAG",
    "answer_similarity": "answer_similarity_OpenAI_RAG",
    "answer_relevancy": "answer_relevancy_OpenAI_RAG",
    "answer_correctness": "answer_correctness_OpenAI_RAG"
}
ragas_df_OpenAI_RAG.rename(columns=rename_dict, inplace=True)
# Reorder the columns.
ragas_df_OpenAI_RAG = ragas_df_OpenAI_RAG.iloc[:,[0, 3, 1, 2, 4,5,6]]
display(ragas_df_OpenAI_RAG)

evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  2.01it/s]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:07<00:00,  7.85s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:14<00:00, 14.49s/it]


Unnamed: 0,question,ground_truths,contexts_OpenAI_RAG,answer_OpenAI_RAG,answer_similarity_OpenAI_RAG,answer_relevancy_OpenAI_RAG,answer_correctness_OpenAI_RAG
0,What do the parameters for HNSW mean?\n,[- M: maximum degree of nodes in a layer of th...,[],The HNSW parameters include the “nlist” which ...,0.747939,0.936005,0.186985
1,What are HNSW good default parameters when dat...,"[M=16, efConstruction=32, ef=32]",[],The default HNSW parameters for data size of 2...,0.824929,0.981672,0.206232
2,what is the default distance metric used in AU...,"[Trick answer: IP inner product, not yet upda...",[],The default distance metric used in AUTOINDEX ...,0.77059,0.990814,0.692648
3,How did New York City get its name?,"[In the 1600’s, the Dutch planted a trading po...",[],"I'm sorry, but I couldn't find any information...",0.777967,0.0,0.194492


In [16]:
# Merge the 2 ragas dfs so they are easier to compare.
ragas_merged_df = ragas_df_Custom_RAG.iloc[:,[0,1,2,3,7,8,9]].merge(ragas_df_OpenAI_RAG.iloc[:, 2:], how='inner', left_index=True, right_index=True)
# reorder columns
ragas_merged_df = ragas_merged_df.iloc[:,[0,1,2,3,7,8,4,5,6,9,10,11]]
display(ragas_merged_df.head())

print()
print(f"####### FINAL SCORES OPENAI RAG vs MILVUS CUSTOM RAG #########")
print(f"LLM as judge model: {LLM_NAME} with temperature: {TEMPERATURE} scores:")
print(f"# Truth vs RAG answers: {len(ragas_merged_df)}")
print()
print(f"avg_similarity_Custom_RAG: {np.round(ragas_merged_df.answer_similarity_Custom_RAG.mean(), 2)}")
print(f"avg_similarity_OpenAI_RAG: {np.round(ragas_merged_df.answer_similarity_OpenAI_RAG.mean(), 2)}")
print()
print(f"answer_relevancy_Custom_RAG: {np.round(ragas_merged_df.answer_relevancy_Custom_RAG.mean(), 2)}")
print(f"avg_relevancy_OpenAI_RAG: {np.round(ragas_merged_df.answer_relevancy_OpenAI_RAG.mean(), 2)}")
print()
print(f"avg_correctness_Custom_RAG: {np.round(ragas_merged_df.answer_correctness_Custom_RAG.mean(), 2)}")
print(f"avg_correctness_OpenAI_RAG: {np.round(ragas_merged_df.answer_correctness_OpenAI_RAG.mean(), 2)}")

Unnamed: 0,question,ground_truths,contexts_Custom_RAG,answer_Custom_RAG,contexts_OpenAI_RAG,answer_OpenAI_RAG,answer_similarity_Custom_RAG,answer_relevancy_Custom_RAG,answer_correctness_Custom_RAG,answer_similarity_OpenAI_RAG,answer_relevancy_OpenAI_RAG,answer_correctness_OpenAI_RAG
0,What do the parameters for HNSW mean?\n,[- M: maximum degree of nodes in a layer of th...,"[performance, HNSW limits the maximum degree o...",The parameters for HNSW have the following mea...,[],The HNSW parameters include the “nlist” which ...,0.844867,0.979217,0.620304,0.747939,0.936005,0.186985
1,What are HNSW good default parameters when dat...,"[M=16, efConstruction=32, ef=32]",[Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 ...,For a data size of 25K vectors with a dimensio...,[],The default HNSW parameters for data size of 2...,0.776006,0.977902,0.62255,0.824929,0.981672,0.206232
2,what is the default distance metric used in AU...,"[Trick answer: IP inner product, not yet upda...",[The attributes of collection can be extracted...,The default distance metric used in AUTOINDEX ...,[],The default distance metric used in AUTOINDEX ...,0.73806,0.990814,0.484557,0.77059,0.990814,0.692648
3,How did New York City get its name?,"[In the 1600’s, the Dutch planted a trading po...",[Etymology\nSee also: Nicknames of New York Ci...,New York City was originally named New Amsterd...,[],"I'm sorry, but I couldn't find any information...",0.942196,0.894259,0.66412,0.777967,0.0,0.194492



####### FINAL SCORES OPENAI RAG vs MILVUS CUSTOM RAG #########
LLM as judge model: gpt-3.5-turbo-1106 with temperature: 0.1 scores:
# Truth vs RAG answers: 4

avg_similarity_Custom_RAG: 0.83
avg_similarity_OpenAI_RAG: 0.78

answer_relevancy_Custom_RAG: 0.96
avg_relevancy_OpenAI_RAG: 0.73

avg_correctness_Custom_RAG: 0.6
avg_correctness_OpenAI_RAG: 0.32


In [17]:
# Drop collection
utility.drop_collection(COLLECTION_NAME)

In [18]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p torch,transformers,sentence_transformers,pymilvus,langchain,openai --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.11.6
IPython version      : 8.18.1

torch                : 2.1.1
transformers         : 4.35.2
sentence_transformers: 2.2.2
pymilvus             : 2.3.4
langchain            : 0.1.0
openai               : 1.7.2

conda environment: py311new

