# ReadtheDocs Retrieval Augmented Generation (RAG)

In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product.  The chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.

Many RAG demos use OpenAI for the Embedding Model and ChatGPT for the Generative AI model.  **In this notebook, we will demo a fully open source RAG stack.**

Using open-source Q&A with retrieval saves money since we make free calls to our own data almost all the time - retrieval, evaluation, and development iterations.

<div>
<img src="../../../images/rag_image.png" width="80%"/>
</div>

Let's get started!

In [1]:
# For colab install these libraries in this order:
# !python -m pip install torch transformers sentence-transformers langchain
# !python -m pip install -U pymilvus
# !python -m pip install unstructured openai tqdm numpy ipykernel 
# !python -m pip install ragas datasets

In [2]:
# Import common libraries.
import sys, os, time, pprint

# Import custom functions for splitting and search.
sys.path.append("../..")  # Adds higher directory to python modules path.
import milvus_utilities as _utils

## Download Data

The data used in this notebook is Milvus documentation web pages.

The code block below downloads all the web pages into a local directory called `rtdocs`.  

I've already uploaded the `rtdocs` data folder to github, so you should see it if you cloned my repo.

In [3]:
# # UNCOMMENT TO DOWNLOAD THE DOCS.

# # !pip install -U langchain
# from langchain_community.document_loaders import RecursiveUrlLoader

# DOCS_PAGE="https://milvus.io/docs/"

# loader = RecursiveUrlLoader(DOCS_PAGE)
# docs = loader.load()

# num_documents = len(docs)
# print(f"loaded {num_documents} documents")

In [4]:
# # Save Langchain docs to a local directory.
# OUTPUT_DIR = "../../RAG/rtdocs_new/"
# os.makedirs(OUTPUT_DIR, exist_ok=True)

# # Convert each doc to HTML and save to the specified directory
# for doc in docs:
#     # Extract file name
#     filename = doc.metadata['source'].split('/')[-1].replace(".md", ".html")
    
#     # Check that filename is not empty
#     if filename:
#         with open(os.path.join(OUTPUT_DIR, filename), "w") as f:
#             f.write(doc.page_content)
#     else:
#         print("Filename is empty. Skipping this doc.")
#         pprint.pprint(doc.metadata)
#         pprint.pprint(doc.page_content[:500])

In [5]:
# UNCOMMENT TO READ THE DOCS FROM A LOCAL DIRECTORY.

# Read docs into LangChain
# !pip install -U langchain
# !pip install unstructured
from langchain.document_loaders import DirectoryLoader

# Load HTML files from a local directory
path = "../../RAG/rtdocs_new/"
loader = DirectoryLoader(path, glob='*.html')
docs = loader.load()

num_documents = len(docs)
print(f"loaded {num_documents} documents")

# # Subset docs for faster testing
# docs = docs[5:7].copy()
# num_documents = len(docs)
# print(f"testing with {num_documents} documents")

# Print the type of the docs.
print(type(docs))
print(type(docs[0]))

loaded 22 documents
<class 'list'>
<class 'langchain_core.documents.base.Document'>


# Connect to Milvus Lite

Milvus Lite is a local Python server intended for quick prototyping and local testing.  <br>
It can run in Jupyter notebooks, Colab, or locally.  Requires pymilvus>=2.4.3.

⛔️ Milvus Lite is not meant for production workloads.

In [6]:
# !python -m pip install -U pymilvus

In [7]:
# STEP 1. CONNECT A CLIENT TO LIGHT MILVUS PYTHON SERVER.

# !python -m pip install -U pymilvus
import pymilvus
print(f"pymilvus:{pymilvus.__version__}")

# Connect a client to the Milvus Lite server.
from pymilvus import MilvusClient
mc = MilvusClient("milvus_demo.db")

pymilvus:2.4.4


# Optional - Connect to Zilliz Cloud free tier cluster
To use fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  
  1. Choose the default "Starter" option and accept the default Cloud Provider and Region when you create a cluster. 
  2. On the Cluster main page, copy your `API Key` and store it locally in a .env variable.  See [this note](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) how to do that.
  3. Also on the Cluster main page, copy the `Public Endpoint URI` and store it somewhere convenient.
  4. Jupyter also requires them in a local .env file. <br>
Anywhere in the bootcamp directory, create a .env file
Insert lines like this, substituting your actual API keys for the sample text: <br>
ZILLIZ_API_KEY=f370c <br>
OPENAI_API_KEY=sk-H <br>
ANYSCALE_ENPOINT_KEY=es <br>
ANTHROPIC_API_KEY=sk-an <br>
VARIABLE_NAME=value <br>
Save the .env file <br>

In [8]:
# # STEP 1. CONNECT TO ZILLIZ CLOUD
# import os
# import pymilvus
# print(f"pymilvus version: {pymilvus.__version__}")
# from pymilvus import connections, utility, MilvusClient
# TOKEN = os.getenv("ZILLIZ_API_KEY")

# # Connect to Zilliz cloud using endpoint URI and API key TOKEN.
# # TODO change this.
# CLUSTER_ENDPOINT="https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443"
# CLUSTER_ENDPOINT="https://in03-48a5b11fae525c9.api.gcp-us-west1.zillizcloud.com:443"
# connections.connect(
#   alias='default',
#   #  Public endpoint obtained from Zilliz Cloud
#   uri=CLUSTER_ENDPOINT,
#   # API key or a colon-separated cluster username and password
#   token=TOKEN,
# )

# # Use no-schema Milvus client uses flexible json key:value format.
# # https://milvus.io/docs/using_milvusclient.md
# mc = MilvusClient(
#     uri=CLUSTER_ENDPOINT,
#     # API key or a colon-separated cluster username and password
#     token=TOKEN)

# # Check if the server is ready and get colleciton name.
# print(f"Type of server: {utility.get_server_version()}")

## Optional - Start up Milvus running in local Docker

>⛔️ Make sure you pip install the correct version of pymilvus and server yml file.  **Versions (major and minor) should all match**.

1. [Install Docker](https://docs.docker.com/get-docker/)
2. Start your Docker Desktop
3. Download the latest [docker-compose.yml](https://milvus.io/docs/install_standalone-docker.md#Download-the-YAML-file) (or run the wget command, replacing version to what you are using)
> wget https://github.com/milvus-io/milvus/releases/download/v2.4.0-rc.1/milvus-standalone-docker-compose.yml -O docker-compose.yml
4. From your terminal:  
   - cd into directory where you saved the .yml file (usualy same dir as this notebook)
   - docker compose up -d
   - verify (either in terminal or on Docker Desktop) the containers are running
5. From your code (see notebook code below):
   - Import milvus
   - Connect to the local milvus server

In [9]:
# # CONNECT TO MILVUS STANDALONE DOCKER.

# import pymilvus, time
# from pymilvus import (connections, MilvusClient, utility)
# print(f"Pymilvus: {pymilvus.__version__}")

# # ####################################################################################################
# # # Connect to local server running in Docker container.
# # # Download the latest .yaml file: https://milvus.io/docs/install_standalone-docker.md
# # # Or, download directly from milvus github (replace with desired version):
# !wget https://github.com/milvus-io/milvus/releases/download/v2.4.4/milvus-standalone-docker-compose.yml -O docker-compose.yml
# # ####################################################################################################

# # Start Milvus standalone on docker, running quietly in the background.
# !docker compose up -d

# # Verify which local port the Milvus server is listening on
# !docker ps -a #19530/tcp

# # Connect to the local server.
# connection = connections.connect(
#   alias="default", 
#   host='localhost', # or '0.0.0.0' or 'localhost'
#   port='19530'
# )

# # Get server version.
# print(utility.get_server_version())

# # Use no-schema Milvus client uses flexible json key:value format.
# mc = MilvusClient(connections=connection)

## Load the Embedding Model checkpoint and use it to create vector embeddings

#### What are Embeddings?

Check out [this blog](https://zilliz.com/glossary/vector-embeddings) for an introduction to embeddings.  

An excellent place to start is by selecting an embedding model from the [HuggingFace MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), sorted descending by the "Retrieval Average'' column since this task is most relevant to RAG. Then, choose the smallest, highest-ranking embedding model. But, Beware!! some models listed are overfit to the training data, so they won't perform on your data as promised.  

Milvus (and Zilliz) only supports tested embedding models that are **not overfit**!

In [10]:
# !python -m pip install -U sentence-transformers transformers

In [11]:
# STEP 2. DOWNLOAD AN OPEN SOURCE EMBEDDING MODEL.

# Import torch.
import torch
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')

# Load the model from huggingface model hub.
model_name = "BAAI/bge-large-en-v1.5"
# model_name = "BAAI/bge-m3"
encoder = SentenceTransformer(model_name, device=DEVICE)
# print(encoder)

# Get the model parameters and save for later.
EMBEDDING_DIM = encoder.get_sentence_embedding_dimension()
MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length() 
# Assume tokens are 3 characters long.
MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS * 3
EOS_TOKEN_LENGTH = 1 * 3

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_DIM: {EMBEDDING_DIM}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")



model_name: BAAI/bge-large-en-v1.5
EMBEDDING_DIM: 1024
MAX_SEQ_LENGTH: 1536


## Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases.  The **collection** will contain the 
- **Schema** (or [no-schema Milvus client](https://milvus.io/docs/using_milvusclient.md)).  
💡 You'll need the vector `EMBEDDING_DIM` parameter from your embedding model.
Typical values are:
   - 1024 for sbert embedding models
   - 1536 for ada-002 OpenAI embedding models
- **Vector index** for efficient vector search
- **Vector distance metric** for measuring nearest neighbor vectors
- **Consistency level**
In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.

## Add a Vector Index

The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  

Most vector indexes use different sets of parameters depending on whether the database is:
- **inserting vectors** (creation mode) - vs - 
- **searching vectors** (search mode) 

Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)
- AUTOINDEX - OSS or [Zilliz cloud](https://docs.zilliz.com/docs/autoindex-explained) automatic index based on type of GPU, size of data.

Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered "close" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized.

### Exercise #1 (2 min):
Create a collection named "movies".  Use the default AUTOINDEX.
> 💡 AUTOINDEX works on both Milvus and Zilliz Cloud (where it is the fastest!)

In [12]:
# STEP 3. USE MILVUS LITE: CREATE A MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.

COLLECTION_NAME = "MilvusDocs"

# Check if collection already exists, if so drop it.
if mc.has_collection(COLLECTION_NAME):
    mc.drop_collection(COLLECTION_NAME)
    print(f"Successfully dropped collection: `{COLLECTION_NAME}`")

# Create a collection with flexible schema and AUTOINDEX.
# Uses Milvus AUTOINDEX, which defaults to HNSW.
mc.create_collection(
    COLLECTION_NAME, 
    EMBEDDING_DIM,
    consistency_level="Eventually", 
    auto_id=True,  
    overwrite=True,
    )
print(f"Successfully created collection: `{COLLECTION_NAME}`")

Successfully dropped collection: `MilvusDocs`
Successfully created collection: `MilvusDocs`


In [13]:
# # STEP 3. DOCKER OR ZILLIZ CLOUD: CREATE A NO-SCHEMA MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.

# # Set the Milvus collection name.
# COLLECTION_NAME = "MilvusDocs"

# # Add custom HNSW search index to the collection.
# # M = max number graph connections per layer. Large M = denser graph.
# # Choice of M: 4~64, larger M for larger data and larger embedding lengths.
# M = 16
# # efConstruction = num_candidate_nearest_neighbors per layer. 
# # Use Rule of thumb: int. 8~512, efConstruction = M * 2.
# efConstruction = M * 2
# # Create the search index for local Milvus server.
# INDEX_PARAMS = dict({
#     'M': M,               
#     "efConstruction": efConstruction })
# index_params = {
#     "index_type": "HNSW", 
#     "metric_type": "COSINE", 
#     "params": INDEX_PARAMS
#     }

# # Check if collection already exists, if so drop it.
# has = utility.has_collection(COLLECTION_NAME)
# if has:
#     drop_result = utility.drop_collection(COLLECTION_NAME)
#     print(f"Successfully dropped collection: `{COLLECTION_NAME}`")

# # Create the collection.
# mc.create_collection(
#     COLLECTION_NAME, 
#     EMBEDDING_DIM,
#     consistency_level="Eventually", 
#     auto_id=True,  
#     overwrite=True,
#     # skip setting params below, if using AUTOINDEX
#     params=index_params
#     )
# print(f"Successfully created collection: `{COLLECTION_NAME}`")

## Simple Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  This section uses:
- **Strategy** = Simple fixed chunk lengths.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = 
  - Langchain's `RecursiveCharacterTextSplitter` to split up long reviews recursively.

### Exercise #2 (2 min):
Change the chunk_size and see what happens?  Model default is 1536.

- What do your observations imply about changing the chunk_size and the number of vectors?
- How many vectors are there with chunk_size=512?

In [14]:
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# import numpy as np
# import pprint

# ###############
# ## EXERCISE #2: Change chunk_size to 512 below.  How many chunks (vectors) does this create?
# ## ANSWER:  427
# ## BONUS:   Can you explain why the number of vectors changed from 134 to 427?  
# ##          Hint:  What is the default chunk overlap?  134 * (3 + 0.10) approx. equals 804.
# ###############
# chunk_size = #(exercise): code here
# chunk_overlap = np.round(chunk_size * 0.10, 0)
# print(f"chunk_size: {chunk_size}, chunk_overlap: {chunk_overlap}")

# # Create an instance of the RecursiveCharacterTextSplitter
# child_splitter = RecursiveCharacterTextSplitter(
#     chunk_size = chunk_size,
#     chunk_overlap = chunk_overlap,
#     length_function = len,  # using built-in Python len function
# )

# # Split the documents further into smaller, recursive chunks.
# chunks = child_splitter.split_documents(docs)
# print(f"docs: {len(docs)}, split into: {len(chunks)}")

In [15]:
# STEP 4. PREPARE DATA: CHUNK AND EMBED

# !python -m pip install lxml
from langchain.text_splitter import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter
import numpy as np
import pprint

# Define chunk size and overlap 10% chunk_size.
CHUNK_SIZE = 512
chunk_overlap = np.round(CHUNK_SIZE * 0.10, 0)
print(f"chunk_size: {CHUNK_SIZE}, chunk_overlap: {chunk_overlap}")

# Create an instance of the RecursiveCharacterTextSplitter
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CHUNK_SIZE,
    chunk_overlap = chunk_overlap,
    length_function = len,  # use built-in Python len function
    # separators=["\n\n"],
)

# Split the documents further into smaller, recursive chunks.
smaller_chunks = child_splitter.split_documents(docs)
print(f"docs: {len(docs)}, split into chunks: {len(smaller_chunks)}")
print(f"type: list of {type(smaller_chunks[0])}") 

# Clean up newlines in the chunks.
for chunk in smaller_chunks:
    chunk.page_content = chunk.page_content.replace("\n", " ")
    
# Clean up the metadata urls
for chunk in smaller_chunks:
    new_url = chunk.metadata["source"]
    new_url = new_url.replace("../../RAG/rtdocs", "https://milvus.io/docs")
    new_url = new_url.replace(".html", ".md")
    chunk.metadata.update({"source": new_url})

# Inspect a chunk.
print()
print("Looking at a sample chunk...")
pprint.pprint(smaller_chunks[0].page_content[:100])
pprint.pprint(smaller_chunks[0].metadata)

chunk_size: 512, chunk_overlap: 51.0
docs: 22, split into chunks: 427
type: list of <class 'langchain_core.documents.base.Document'>

Looking at a sample chunk...
('Why Milvus  Docs  Tutorials  Tools  Blog  Community  Stars0  Try Managed '
 'Milvus FREE  Search  Home  ')
{'source': 'https://milvus.io/docs_new/quickstart.md'}


## HTML Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  This section uses:
- **Strategy** = Use markdown header hierarchies.  Keep markdown sections together unless they are too long.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = 
  - Langchain's `HTMLHeaderTextSplitter` to split markdown sections.
  - Langchain's `RecursiveCharacterTextSplitter` to split up long reviews recursively.


Notice below, each chunk is grounded with the document source page.  <br>
In addition, header titles are kept together with the chunk of markdown text.

In [16]:
# STEP 4. PREPARE DATA: CHUNK AND EMBED

# !python -m pip install lxml
from langchain.text_splitter import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter

# Define chunk size 512 and overlap 10% chunk_size.
# These will be ANN search vectors.
CHUNK_SIZE = 512
chunk_overlap = round(CHUNK_SIZE * 0.10, 0)
print(f"chunk_size: {CHUNK_SIZE}, chunk_overlap: {chunk_overlap}")

# Define chunk size for "larger" chunks.
# These will be parent chunks retrieved to stuff in the Prompt.
PARENT_CHUNK_SIZE = MAX_SEQ_LENGTH #2000

# Splitter is used to create the parent "larger" chunks.
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=PARENT_CHUNK_SIZE,
    length_function = len,
    # add_start_index=True
    )

# Splitter is used to create the child "smaller" chunks for ANN search.
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CHUNK_SIZE,
    # chunk_overlap = chunk_overlap,
    length_function = len,
    add_start_index=True
    )

# Define the headers to split on for the HTMLHeaderTextSplitter
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]
# Create an instance of the HTMLHeaderTextSplitter
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Split the HTML text using the HTMLHeaderTextSplitter.
html_header_splits = []
doc_index = 0
for doc in docs:
    splits = html_splitter.split_text(doc.page_content)
    for split in splits:
        # Add the source URL and header values to the metadata
        metadata = {}
        new_text = split.page_content
        for header_name, metadata_header_name in headers_to_split_on:
            # Handle exception if h1 does not exist.
            try:
                header_value = new_text.split("¶ ")[0].strip()[:50]
                metadata[header_name] = header_value
            except:
                break
            # Handle exception if h2 does not exist.
            try:
                new_text = new_text.split("¶ ")[1].strip()[:50]
                metadata[header_name] = new_text
            except:
                break
        split.metadata = {
            **metadata,
            "source": doc.metadata["source"],
            'doc_index': doc_index
        }
    html_header_splits.extend(splits)
    doc_index += 1

# Split the HTML chunks into parent chunks.
chunks = parent_splitter.split_documents(html_header_splits)
print(f"docs: {len(docs)}, split into parent chunks: {len(chunks)}")

# Split HTML header chunks into smaller chunks for ANN search.
smaller_chunks = []
for chunk in chunks:
    smaller_chunks.extend(child_splitter.split_documents([chunk]))
print(f"parent_chunks: {len(chunks)}, split into smaller chunks: {len(smaller_chunks)}")
print(f"type: list of {type(smaller_chunks[0])}") 

# Insert HTML headers into smaller chunks (extends their "context").
for chunk in smaller_chunks:
    if chunk.page_content.startswith(chunk.metadata['h1'][:20]):
        continue
    metadata_str = ' '.join(str(v) for k, v in chunk.metadata.items() if k in ['h1', 'h2'])
    chunk.page_content = f'{metadata_str} {chunk.page_content}'

# Inspect a parent chunk.
print()
print("Looking at a sample chunk...")
pprint.pprint(chunks[4].page_content[:200])
pprint.pprint(chunks[4].metadata)

chunk_size: 512, chunk_overlap: 51.0
docs: 22, split into parent chunks: 129
parent_chunks: 129, split into smaller chunks: 586
type: list of <class 'langchain_core.documents.base.Document'>

Looking at a sample chunk...
('immediately after data insertions may result in empty result set. To avoid '
 'this, you are advised to wait for a few seconds. Single-vector search The '
 'value of the query_vectors variable is a list conta')
{'doc_index': 0,
 'h1': 'Why Milvus Docs Tutorials Tools Blog Community Sta',
 'source': '../../RAG/rtdocs_new/quickstart.html'}


In [17]:
# Function to remove newlines and double spaces from a string.
def clean_text(text):
    clean_text = text.replace("\n\n", " ")\
                     .replace("<br><br>", " ")\
                     .replace("<br /><br />", " ")
    return clean_text

# Clean up the metadata urls and chunk texts.
for doc in smaller_chunks:
    doc.metadata["source"] = \
        doc.metadata["source"]\
            .replace("../../RAG/rtdocs_new", "https://milvus.io/docs")\
            .replace(".html", ".md")
    doc.page_content = clean_text(doc.page_content)

# Inspect a child chunk.
print()
print("Looking at a sample chunk...")
pprint.pprint(smaller_chunks[15].page_content[:200])
pprint.pprint(smaller_chunks[15].metadata)


Looking at a sample chunk...
('Why Milvus Docs Tutorials Tools Blog Community Sta loaded collection, refer '
 'to Manage Collections. Collections created using the RESTful API are always '
 'automatically loaded. Insert Data Collections cr')
{'doc_index': 0,
 'h1': 'Why Milvus Docs Tutorials Tools Blog Community Sta',
 'source': 'https://milvus.io/docs/quickstart.md',
 'start_index': 0}


In [18]:
# # Double-check if parent chunks are correctly indexed.
# temp_doc_index = smaller_chunks[15].metadata['doc_index']
# temp_start_index = smaller_chunks[15].metadata['start_index']
# temp_start_index = temp_start_index - 512
# temp_end_index = temp_start_index + 1536

# temp = docs[temp_doc_index].page_content
# print(len(temp))
# pprint.pprint(temp[temp_start_index:temp_end_index].replace("\n\n", " "))

### Transform chunks into vectors using the embedding model

In [19]:
# STEP 5. TRANSFORM CHUNKS INTO VECTORS USING EMBEDDING MODEL INFERENCE.

# Encoder input is docs as a list of strings.
list_of_strings = [doc.page_content for doc in smaller_chunks if hasattr(doc, 'page_content')]

# Embedding inference using the Milvus built-in sparse-dense-reranking encoder.
start_time = time.time()
embeddings = torch.tensor(encoder.encode(list_of_strings))
end_time = time.time()

print(f"Embedding time for {len(list_of_strings)} chunks: ", end="")
print(f"{round(end_time - start_time, 2)} seconds")

# Inference Embeddings: 100%|██████████| 19/19 [00:35<00:00,  1.86s/it]
# Embedding time for 127 chunks: 57.92 seconds

Embedding time for 586 chunks: 59.12 seconds


In [20]:
# Normalize the embeddings.
embeddings = np.array(embeddings / np.linalg.norm(embeddings))

# Convert embeddings to list of `numpy.ndarray`, each containing `numpy.float32` numbers.
converted_values = list(map(np.float32, embeddings))

# Inspect the embeddings.
# assert len(chunks[0].page_content) <= MAX_SEQ_LENGTH_IN_TOKENS
assert len(converted_values[0]) == EMBEDDING_DIM
print(f"type embeddings: {type(converted_values)} of {type(converted_values[0])}")
print(f"of numbers: {type(converted_values[0][0])}")

type embeddings: <class 'list'> of <class 'numpy.ndarray'>
of numbers: <class 'numpy.float32'>


## Insert data into Milvus

For each original text chunk, we'll write the sextuplet (`chunk, h1, h2, source, dense_vector, sparse_vector`) into the database.

<div>
<img src="../../../images/db_insert.png" width="80%"/>
</div>

**The Milvus Client wrapper can only handle loading data from a list of dictionaries.**

Otherwise, in general, Milvus supports loading data from:
- pandas dataframes 
- list of dictionaries 

In [21]:
# STEP 6. INSERT CHUNK LIST INTO MILVUS OR ZILLIZ.

# Create chunk_list and dict_list in a single loop
dict_list = []
for chunk, vector in zip(smaller_chunks, converted_values):
    # Assemble embedding vector, original text chunk, metadata.
    chunk_dict = {
        'chunk': chunk.page_content,
        'h1': chunk.metadata.get('h1', "")[:50],
        'h2': chunk.metadata.get('h2', "")[:50],
        'source': chunk.metadata.get('source', ""),
        'doc_index': chunk.metadata.get('doc_index', 0),
        'start_index': chunk.metadata.get('start_index', 0),
        'vector': vector,
    }
    dict_list.append(chunk_dict)

# # TODO - Uncomment to inspect a chunk and its metadata.
# print(len(dict_list))
# print(type(dict_list[1]), len(dict_list[1]))
# pprint.pprint(dict_list[1])

In [22]:
# Insert data into the Milvus collection.
print("Start inserting entities")

start_time = time.time()
mc.insert(
    COLLECTION_NAME,
    data=dict_list,
    progress_bar=True)

end_time = time.time()
print(f"Milvus insert time for {len(dict_list)} vectors: ", end="")
print(f"{np.round(end_time - start_time, 2)} seconds")

Start inserting entities
Milvus insert time for 586 vectors: 0.23 seconds


## Aside - example Milvus collection API calls
https://milvus.io/docs/manage-collections.md#View-Collections

Below are some common API calls for checking a collection.
- `.num_entities`, flushes data and executes row count.
- `.describe_collection()`, gives details about the schema, index, collection.
- `.query()`, gives back selected data from the collection.

In [23]:
# Example Milvus Collection utility API calls.
# https://milvus.io/docs/manage-collections.md#View-Collections

# View collection info, incurs a call to .flush() first.
start_time = time.time()
pprint.pprint(mc.describe_collection(COLLECTION_NAME))
end_time = time.time()
print(f"timing: {round(end_time - start_time, 4)} seconds")
print()

# Milvus Lite - notice a delay, so wait 30 seconds.
time.sleep(15)

# Count rows, incurs a call to .flush() first.
start_time = time.time()
res = mc.query( collection_name=COLLECTION_NAME, 
               filter="", 
               output_fields = ["count(*)"], )
pprint.pprint(res)
end_time = time.time()
print(f"timing: {round(end_time - start_time, 4)} seconds")

{'aliases': [],
 'auto_id': True,
 'collection_id': 0,
 'collection_name': 'MilvusDocs',
 'consistency_level': 0,
 'description': '',
 'enable_dynamic_field': True,
 'fields': [{'auto_id': True,
             'description': '',
             'field_id': 100,
             'is_primary': True,
             'name': 'id',
             'params': {},
             'type': <DataType.INT64: 5>},
            {'description': '',
             'field_id': 101,
             'name': 'vector',
             'params': {'dim': 1024},
             'type': <DataType.FLOAT_VECTOR: 101>}],
 'num_partitions': 0,
 'num_shards': 0,
 'properties': {}}
timing: 0.0014 seconds

data: ["{'count(*)': 586}"] , extra_info: {'cost': 0}
timing: 0.0023 seconds


## Ask a question about your data

So far in this demo notebook: 
1. Your custom data has been mapped into a vector embedding space
2. Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 In LLM vocabulary:
> **Query** is the generic term for user questions.  
A query is a list of multiple individual questions, up to maybe 1000 different questions!

> **Question** usually refers to a single user question.  
In our example below, the user question is "What is AUTOINDEX in Milvus Client?"

> **Semantic Search** = very fast search of the entire knowledge base to find the `TOP_K` documentation chunks with the closest embeddings to the user's query.

💡 The same model should always be used for consistency for all the embeddings data and the query.

In [24]:
# Define a sample question about your data.
QUESTION1 = "What do the parameters for HNSW mean?"
QUESTION2 = "What are good default values for HNSW parameters with 25K vectors dim 1024?"
QUESTION3 = "What does nlist vs nprobe mean in ivf_flat?"
QUESTION4 = "What is the default AUTOINDEX index and vector field distance metric in Milvus?"

# In case you want to ask all the questions at once.
QUERY = [QUESTION1, QUESTION2, QUESTION3, QUESTION4]

# Inspect the length of one question.
QUERY_LENGTH = len(QUESTION2)
print(f"example query length: {QUERY_LENGTH}")

example query length: 75


In [25]:
# SELECT A PARTICULAR QUESTION TO ASK.

SAMPLE_QUESTION = QUESTION1

## Execute a vector search

Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).

💡 By their nature, vector searches are "semantic" searches.  For example, if you were to search for "leaky faucet": 
> **Traditional Key-word Search** - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

> **Semantic search** - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words.

### Exercise #3 (2 min):
Search Milvus using the default search index.

In [26]:
# query_embeddings = _utils.embed_query(encoder, [SAMPLE_QUESTION])
# TOP_K = 2

# results = mc.search(
#     #(exercise): code here # Answer: COLLECTION_NAME,
#     data=query_embeddings,
#     limit=TOP_K,
#     consistency_level="Eventually"
# )
# print(f"Found top {len(results[0])} results for question: {SAMPLE_QUESTION}")

In [27]:
# Define metadata fields you can filter on.
OUTPUT_FIELDS = list(dict_list[0].keys())
OUTPUT_FIELDS.remove('vector')
print(f"output fields: {OUTPUT_FIELDS}")

query_embeddings = _utils.embed_query(encoder, [SAMPLE_QUESTION])
TOP_K = 2

results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    # search_params=SEARCH_PARAMS,
    output_fields=OUTPUT_FIELDS, 
    # Milvus can utilize metadata in boolean expressions to filter search.
    # filter=filter_expression,
    limit=TOP_K,
    consistency_level="Eventually"
)
print(f"Found top {len(results[0])} results for question: {SAMPLE_QUESTION}")

# Define a convenience function for searching.
def mc_run_search(question, filter_expression, top_k=3):
    # Embed the question using the same encoder.
    query_embeddings = _utils.embed_query(encoder, [question])

    # # Return top k results with HNSW index.
    # SEARCH_PARAMS = dict({
    #     # Re-use index param for num_candidate_nearest_neighbors.
    #     "ef": INDEX_PARAMS['efConstruction']
    # })

    # Run semantic vector search using your query and the vector database.
    results = mc.search(
        COLLECTION_NAME,
        data=query_embeddings, 
        # search_params=SEARCH_PARAMS,
        output_fields=OUTPUT_FIELDS, 
        # Milvus can utilize metadata in boolean expressions to filter search.
        filter=filter_expression,
        limit=top_k,
        consistency_level="Eventually"
    )

    # Assemble retrieved context and context metadata.
    # The search result is in the variable `results[0]`, which is type 
    # 'pymilvus.orm.search.SearchResult'. 
    METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']
    formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(
        results, metadata_fields=METADATA_FIELDS, num_shot_answers=top_k)
    
    return formatted_results, context, context_metadata

output fields: ['chunk', 'h1', 'h2', 'source', 'doc_index', 'start_index']
Found top 2 results for question: What do the parameters for HNSW mean?


In [28]:
# STEP 7. RETRIEVE ANSWERS FROM YOUR DOCUMENTS STORED IN MILVUS OR ZILLIZ.

# Metadata filters for CSV dataset.
# expression = 'film_year >= 2019'
expression = ""
print(f"filter: {expression}")
TOP_K = 2

start_time = time.time()
formatted_results, contexts, context_metadata = \
    mc_run_search(SAMPLE_QUESTION, expression, TOP_K)
elapsed_time = time.time() - start_time
print(f"Milvus Client search time for {len(dict_list)} vectors: {elapsed_time} seconds")

# Inspect search result.
print(f"type: {type(formatted_results)}, count: {len(formatted_results)}")

filter: 
Milvus Client search time for 586 vectors: 0.12221956253051758 seconds
type: <class 'list'>, count: 2


## Assemble and inspect the search result

The search result is in the variable `results[0]` consisting of top_k-count of objects of type `'pymilvus.client.abstract.Hits'`



In [29]:
# Loop through search results, print metadata.
sources = []
for i in range(len(contexts)):
    print(f"Retrieved result #{i+1}")
    print(f"distance = {formatted_results[i][0]}")
    pprint.pprint(f"Chunk text: {contexts[i]}")
    for key, value in context_metadata[i].items():
        if key == "source":
            sources.append(value)
        print(f"{key}: {value}")
    print()

Retrieved result #1
distance = 0.7174309492111206
('Chunk text: Why Milvus Docs Tutorials Tools Blog Community Sta is a '
 'range-search parameter and terminates the search process whilst the number '
 'of consecutive empty buckets reaches the specified value.Increasing this '
 'value can improve recall rate at the cost of increased search time. [1, '
 '65535] 2 HNSW HNSW (Hierarchical Navigable Small World Graph) is a '
 'graph-based indexing algorithm. It builds a multi-layer navigation structure '
 'for an image according to certain rules. In this structure, the upper layers '
 'are more sparse and the distances between nodes are farther; the')
h1: Why Milvus Docs Tutorials Tools Blog Community Sta
h2: 
source: https://milvus.io/docs/index.md
doc_index: 3
start_index: 625

Retrieved result #2
distance = 0.7135186195373535
('Chunk text: Why Milvus Docs Tutorials Tools Blog Community Sta begin another '
 'search. After multiple iterations, it can quickly approach the target '
 'positi

In [30]:
unique_tuples = []
parent_chunks = []

# Loop through the search results and keep only unique parent chunks.
i = 0
for context, item in zip(contexts, context_metadata):
    # Extract doc_index and start_index from each item.
    doc_index = item['doc_index']
    start_index = item['start_index']
    
    # Create a tuple of (doc_index, start_index).
    current_tuple = (doc_index, start_index)
    
    # Initialize current tuple is unique.
    is_unique = True
    
    # Check if the start_index is within 2000 of any start_index in unique_tuples.
    for unique_tuple in unique_tuples:
        if unique_tuple[0] == current_tuple[0] \
            and abs(unique_tuple[1]-current_tuple[1])<=MAX_SEQ_LENGTH:
            is_unique = False
            print("Duplicate parent chunk text found.")
            break
    
    # Process unique tuples.
    if is_unique:
        # Append it to the list of unique tuples
        unique_tuples.append(current_tuple)

        # Get and clean parent chunk text.
        match_text = context
        temp_index = len(match_text) // 2
        match_text = match_text[temp_index:temp_index+40]
        print(f"match_text: {match_text}")
        match_text = "2 HNSW HNSW (Hierarchical Navigable"

        parent_text = docs[current_tuple[0]].page_content
        parent_text = clean_text(parent_text)
        temp_index = parent_text.find(match_text)

        if temp_index != -1:
            start_index = max(0, temp_index-200)
            end_index = min(len(parent_text), temp_index+MAX_SEQ_LENGTH-236)
            parent_chunk_text = parent_text[start_index:end_index]
            parent_chunks.append(parent_chunk_text)
        else:
            print("Text not found.")

        # # TODO: comment out debugging check if parents contain retrieved chunks.
        # print(f"Unique tuple: {current_tuple}")
        # print("Parent Chunk text: ")
        # pprint.pprint(parent_chunks[i])
        # print()

        i += 1

match_text: 5] 2 HNSW HNSW (Hierarchical Navigable S
Duplicate parent chunk text found.


## Use an LLM to Generate a chat response to the user's question using the Retrieved Context.

Many different generative LLMs exist these days.  Check out the lmsys [leaderboard](https://chat.lmsys.org/?leaderboard).

In this notebook, we'll try these LLMs:
- The newly released open-source Llama 3 from Meta.
- The cheapest, paid model from Anthropic Claude3 Haiku.
- The standard in its price cateogory, gpt-3.5-turbo, from Openai.

In [31]:
# STEP 8. LLM-GENERATED ANSWER TO THE QUESTION, GROUNDED BY RETRIEVED CONTEXT.

# Separate all the context together by space.
# Lance Martin, LangChain, says put best contexts at end.
# contexts_combined = ' '.join(reversed(contexts))
contexts_combined = ' '.join(reversed(parent_chunks))

# Separate all the sources together by comma.
source_combined = ' '.join(reversed(sources))
print(f"Length long text to summarize: {len(contexts_combined)}")

# Define temperature for the LLM and random seed.
TEMPERATURE = 0.1
TOP_P = 0.9
RANDOM_SEED = 415
MAX_TOKENS = 512
FREQUENCY_PENALTY = 2

Length long text to summarize: 1500


In [32]:
SYSTEM_PROMPT = f"""First, check if the Context below is relevant to 
the user's question.  Second, only if the context is strongly relevant, 
answer the question using the context.  Otherwise, if the context is not 
strongly relevant, answer the question without using the context.  
Be clear, concise, relevant.  Answer with fewer than 2 sentences and cite unique sources.
Grounding sources: {source_combined}
Context: {contexts_combined}
"""
print(f"Length prompt: {len(SYSTEM_PROMPT)}")

Length prompt: 1948


In [33]:
# # Inspect the prompt.
# pprint.pprint(SYSTEM_PROMPT)

# Try Meta Llama 3 with Ollama to generate a human-like chat response to the user's question

Follow the instructions to install ollama and pull a model.<br>
https://github.com/ollama/ollama

View details about which models are supported by ollama. <br>
https://ollama.com/library/llama3

That page says `ollama run llama3` will by default pull the latest "instruct" model, which is fine-tuned for chat/dialogue use cases.

The other kind of llama3 models are "pre-trained" base model. <br>
Example: ollama run llama3:text ollama run llama3:70b-text

**Format** `gguf` means the model runs on CPU.  gg = "Georgi Gerganov", creator of the C library model format ggml, which was recently changed to gguf.

**Quantization** (think of it like vector compaction) can lead to higher throughput at the expense of lower accuracy.  For the curious, quantization meanings can be found on: <br>
https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main.  

Below just listing the main quantization types.
- **q4_0**: Original quant method, 4-bit.
- **q4_k_m**: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
- **q5_0**: Higher accuracy, higher resource usage and slower inference.
- **q5_k_m**: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
- **q 6_k**: Uses Q8_K for all tensors
- **q8_0**: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

In [34]:
# !python -m pip install ollama
import ollama

# Verify details which model you are running.
ollama_llama3 = ollama.list()['models'][0]

# Print the model details.
keys = ['format', 'parameter_size', 'quantization_level']
print(f"MODEL:{ollama.list()['models'][0]['name']}", end=", ")
for key in keys:
    print(f"{str.upper(key)}:{ollama.list()['models'][0]['details'].get(key, 'Key not found in dictionary')}", end=", ")
print(end="\n\n")

MODEL:llama3:latest, FORMAT:gguf, PARAMETER_SIZE:8B, QUANTIZATION_LEVEL:Q4_0, 



In [35]:
# Send the question to llama 3 chat.
start_time = time.time()
response = ollama.chat(
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT,},
        {"role": "user", "content": f"question: {SAMPLE_QUESTION}",}
    ],
    model='llama3',
    stream=False,
    options={"temperature": TEMPERATURE, "seed": RANDOM_SEED,
             "top_p": TOP_P, 
            #  "max_tokens": MAX_TOKENS,
             "frequency_penalty": FREQUENCY_PENALTY}
)

ollama_llama3_time = time.time() - start_time
pprint.pprint(response['message']['content'].replace('\n', ' '))
print(f"ollama_llama3_time: {format(ollama_llama3_time, '.2f')} seconds")

("According to the context and documentation [1], here's what the parameters "
 'for Hierarchical Navigable Small World Graph (HNSW) mean:  **M**: Maximum '
 'number of outgoing connections in each layer of the graph. A higher M leads '
 'to a better accuracy at the cost of increased search time.  Range: (2, '
 '2048)  **efConstruction**: Controls the trade-off between index build speed '
 'and quality. Increasing this parameter may improve index quality but also '
 'increases indexing time.  Range: (1, int_max)  **ef**: Parameter controlling '
 'query time vs. recall rate during searching targets in HNSW indexes.  No '
 "specific range is mentioned for ef, as it depends on the application's "
 'requirements and constraints.  These parameters allow you to fine-tune your '
 'HNSW index for optimal performance based on your use case needs [1].  '
 'References: [1] https://milvus.io/docs/index.md (Milvus documentation)')
ollama_llama3_time: 10.43 seconds


# Now try Anyscale endpoints


In [36]:
# # List all the anyscale endpoint models.
# !llm models list

In [37]:
# Call Anyscale enpoint using OpenAI API.
import openai

LLM_NAME = "meta-llama/Llama-3-8b-chat-hf"

# 1. Get your API key: https://platform.openai.com/api-keys
# 2. Save your api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
anyscale_client = openai.OpenAI(
    base_url = "https://api.endpoints.anyscale.com/v1",
    api_key=os.environ.get("ANYSCALE_ENPOINT_KEY"),
)

# 3. Generate response using the OpenAI API.
start_time = time.time()
response = anyscale_client.chat.completions.create(
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT,},
        {"role": "user", "content": f"question: {SAMPLE_QUESTION}",}
    ],
    model=LLM_NAME,
    temperature=TEMPERATURE,
    seed=RANDOM_SEED,
    frequency_penalty=FREQUENCY_PENALTY,
    top_p=TOP_P, 
    max_tokens=MAX_TOKENS,
)
llama3_anyscale_endpoints_time = time.time() - start_time

# Print the response.
pprint.pprint(response.choices[0].message.content.replace('\n', ' '))
print(f"llama3_anyscale_endpoints_time: {format(llama3_anyscale_endpoints_time, '.2f')} seconds")

('According to the provided context, the parameters for HNSW (Hierarchical '
 'Navigable Small World Graph) are:  * M: defines the maximum number of '
 'outgoing connections in the graph. Higher M leads to higher accuracy and '
 'longer run time at fixed ef/efConstruction. * efConstruction: controls index '
 'search speed/build speed tradeoff. Increasing this parameter may enhance '
 'index quality, but it also tends to lengthen indexing time. * ef: Parameter '
 'controlling query time/recall rate tradeoff.  These parameters help balance '
 'between search efficiency and recall rate in HNSW indexing algorithm.')
llama3_anyscale_endpoints_time: 2.54 seconds


In [38]:
# Also try OctoAI
# !python -m pip install octoai
from octoai.text_gen import ChatMessage
from octoai.client import OctoAI

LLM_NAME = "meta-llama-3-8b-instruct"

octoai_client = OctoAI(
    api_key=os.environ.get("OCTOAI_TOKEN"),
)

# Generate response using OpenAI API.
start_time = time.time()
response = octoai_client.text_gen.create_chat_completion(
	messages=[
		ChatMessage(
			content=SYSTEM_PROMPT,
			role="system"
		),
		ChatMessage(
			content=SAMPLE_QUESTION,
			role="user"
		)
	],
	model=LLM_NAME,
    temperature=TEMPERATURE,
    # seed=RANDOM_SEED,
    frequency_penalty=FREQUENCY_PENALTY,
    top_p=TOP_P, 
    max_tokens=MAX_TOKENS,
)
llama3_octai_endpoints_time = time.time() - start_time

# Print the response.
pprint.pprint(response.choices[0].message.content.replace('\n', ' '))
print(f"llama3_octai_endpoints_time: {format(llama3_octai_endpoints_time, '.2f')} seconds")

('According to the context, the parameters for HNSW (Hierarchical Navigable '
 'Small World Graph) are:  * M: defines the maximum number of outgoing '
 'connections in the graph. Higher M leads to higher accuracy and longer run '
 'time at fixed ef/efConstruction. * efConstruction: controls index search '
 'speed/build speed tradeoff. Increasing this parameter may enhance index '
 'quality, but it also tends to lengthen indexing time. * ef: Parameter '
 'controlling query time.  These parameters can be adjusted to balance between '
 'accuracy and efficiency in HNSW searches.')
llama3_octai_endpoints_time: 1.70 seconds


In [39]:
# Also try Groq endpoints
# !python -m pip install groq
from groq import Groq

LLM_NAME = "llama3-8b-8192"

groq_client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

# Generate response using OpenAI API.
start_time = time.time()
response = groq_client.chat.completions.create(
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT,},
        {"role": "user", "content": f"question: {SAMPLE_QUESTION}",}
    ],
    model=LLM_NAME,
    temperature=TEMPERATURE,
    seed=RANDOM_SEED,
    frequency_penalty=FREQUENCY_PENALTY,
    top_p=TOP_P, 
    max_tokens=MAX_TOKENS,
)
llama3_groq_endpoints_time = time.time() - start_time

# Print the response.
pprint.pprint(response.choices[0].message.content.replace('\n', ' '))
print(f"llama3_groq_endpoints_time: {format(llama3_groq_endpoints_time, '.2f')} seconds")

('According to the context, the parameters for HNSW (Hierarchical Navigable '
 'Small World Graph) are:  * M: defines the maximum number of outgoing '
 'connections in the graph, with higher M leading to higher accuracy and '
 'longer run time at fixed ef/efConstruction. * efConstruction: controls the '
 'index search speed/build speed tradeoff, with increasing efConstruction '
 'enhancing index quality but lengthening indexing time. * ef: controls query '
 'time/search time, with higher ef leading to faster search but potentially '
 'lower accuracy.  These parameters can be adjusted to balance accuracy, '
 'search speed, and indexing time.')
llama3_groq_endpoints_time: 0.48 seconds


## Also try Anthropic Claude3 

We've practiced retrieval for free on our own data using open-source LLMs.  <br>

Now let's make a call to the paid Claude3. [List of models](https://docs.anthropic.com/claude/docs/models-overview)
- Opus - most expensive
- Sonnet
- Haiku - least expensive!

Prompt engineering tutorials
- [Interactive](https://docs.google.com/spreadsheets/d/19jzLgRruG9kjUQNKtCg1ZjdD6l6weA6qRXG5zLIAhC8/edit#gid=150872633)
- [Static](https://docs.google.com/spreadsheets/d/1jIxjzUWG-6xBVIa2ay6yDpLyeuOh_hR_ZB75a47KX_E/edit#gid=869808629)

In [40]:
# SYSTEM_PROMPT = f"""Use the Context below to answer the user's question. 
# Be clear, factual, complete, concise.
# If the answer is not in the Context, say "I don't know". 
# Otherwise answer with fewer than 4 sentences and cite the unique sources.
# Context: {contexts_combined}
# Sources: {source_combined}

# Answer with 2 parts: the answer and the source citations.
# Answer: The answer to the question.
# Sources: unique url sources
# """

In [41]:
# # !python -m pip install anthropic
# import anthropic

# ANTHROPIC_API_KEY=os.environ.get("ANTHROPIC_API_KEY")

# # # Model names
# # claude-3-opus-20240229
# # claude-3-sonnet-20240229
# # claude-3-haiku-20240307
# CLAUDE_MODEL = "claude-3-haiku-20240307"
# print(f"Model: {CLAUDE_MODEL}")
# print()

# client = anthropic.Anthropic(
#     # defaults to os.environ.get("ANTHROPIC_API_KEY")
#     api_key=ANTHROPIC_API_KEY,
# )

# # Print the question and answer along with grounding sources and citations.
# print(f"Question: {SAMPLE_QUESTION}")

# # CAREFUL!! THIS COSTS MONEY!!
# message = client.messages.create(
#     model=CLAUDE_MODEL,
#     max_tokens=1000,
#     temperature=0.0,
#     system=SYSTEM_PROMPT,
#     messages=[
#         {"role": "user", "content": SAMPLE_QUESTION}
#     ]
# )
# print("Answer:")
# pprint.pprint(message.content[0].text.replace('\n', ' '))

<div>
<img src="../../../images/anthropic_claude3.png" width="80%"/>
</div>

## Also try MistralAI's Mixtral 8x7B-Instruct-v0.1

This time ollama's version requires 48GB RAM. If you have big enough compute, run the command:
> ollama run mixtral

Since my laptop is a M2 with only 16GB RAM, I decided to **run Mixtral using Anyscale Endpoints**. Instructions to install. <br>
> https://github.com/simonw/llm-anyscale-endpoints

To get back to **Anyscale Endpoints** anytime, open the playground.<br>
https://console.anyscale.com/v2/playground

In [42]:
# Call Anyscale enpoint using OpenAI API.
import openai

LLM_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# 2. Get your API key: https://platform.openai.com/api-keys
# 3. Save your api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
openai_client = openai.OpenAI(
    base_url = "https://api.endpoints.anyscale.com/v1",
    api_key=os.environ.get("ANYSCALE_ENPOINT_KEY"),
)

# 4. Generate response using the OpenAI API.
start_time = time.time()
response = openai_client.chat.completions.create(
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT,},
        {"role": "user", "content": f"question: {SAMPLE_QUESTION}",}
    ],
    model=LLM_NAME,
    temperature=TEMPERATURE,
    seed=RANDOM_SEED,
    frequency_penalty=FREQUENCY_PENALTY,
)
mixtral_anyscale_endpoints_time = time.time() - start_time

# Print the response.
pprint.pprint(response.choices[0].message.content.replace('\n', ' '))
print(f"mixtral_anyscale_endpoints_time: {format(mixtral_anyscale_endpoints_time, '.2f')} seconds")

(' The parameters for HNSW, a graph-based indexing algorithm, are as follows:  '
 '1. M: Defines the maximum number of outgoing connections in the graph. A '
 'higher M leads to higher accuracy and longer run time, with a suggested '
 'range of (2, 2048). 2. efConstruction: Controls the index search speed/build '
 'speed tradeoff during indexing. Increasing this parameter may enhance index '
 'quality but also tends to lengthen the indexing time; it accepts values from '
 '1 to int\\_max. 3. ef: Controls query time; it is not specified what '
 'int\\_max is in this context but can be any positive integer value '
 '(including zero). This parameter affects query time/accuracy tradeoff during '
 'target searches within an existing HNSW structure '
 '([1](https://milvus.io/docs/v0.7.0/parameters_overview_HNSW%20index%20type_(standalone).md), '
 '[2](https://www.pinecone-database.com/docs/parameters-and-properties/#hnsw)).')
mixtral_anyscale_endpoints_time: 4.01 seconds


<div>
<img src="../../../images/mistral_mixtral.png" width="80%"/>
</div>

## Also try OpenAI

💡 Note: For use cases that need to always be factually grounded, use very low temperature values while more creative tasks can benefit from higher temperatures.

In [43]:
SYSTEM_PROMPT = f"""First, check if the Context below is relevant to 
the user's question.  Second, only if the context is strongly relevant, 
answer the question using the context.  Otherwise, if the context is not 
strongly relevant, answer the question without using the context.
Be clear, concise, relevant.  Answer with fewer than 4 sentences 
and cite unique grounding sources.
Grounding sources: {source_combined}
Context: {contexts_combined}
"""

In [44]:
import openai, pprint
from openai import OpenAI

# 1. Define the generation llm model to use.
# https://openai.com/blog/new-embedding-models-and-api-updates
# Customers using the pinned gpt-3.5-turbo model alias will be automatically upgraded to gpt-3.5-turbo-0125 two weeks after this model launches.
LLM_NAME = "gpt-3.5-turbo"

# 2. Get your API key: https://platform.openai.com/api-keys
# 3. Save your api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
openai_client = OpenAI(
    # This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# 4. Generate response using the OpenAI API.
start_time = time.time()
response = openai_client.chat.completions.create(
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT,},
        {"role": "user", "content": f"question: {SAMPLE_QUESTION}",}
    ],
    model=LLM_NAME,
    temperature=TEMPERATURE,
    seed=RANDOM_SEED,
    frequency_penalty=FREQUENCY_PENALTY,
)
chatgpt_35turbo_time = time.time() - start_time

# Print the question and answer along with grounding sources and citations.
print(f"Question: {SAMPLE_QUESTION}")

# 5. Print all answers in the response.
for i, choice in enumerate(response.choices, 1):
    pprint.pprint(f"Answer: {choice.message.content}")
    print("\n")
print(f"chatgpt_3.5_turbo_time: {format(chatgpt_35turbo_time, '.5f')}")

# Question1: What do the parameters for HNSW mean?
# Answer:  Looks perfect!
# Best answer:  M: maximum degree of nodes in a layer of the graph. 
# efConstruction: number of nearest neighbors to consider when connecting nodes in the graph.
# ef: number of nearest neighbors to consider when searching for similar vectors. 

# Question2: What are good default values for HNSW parameters with 25K vectors dim 1024?
# Answer: M=16, efConstruction=500, and ef=64
# Best answer:  M=16, efConstruction=32, ef=32

# Question3: what is the default distance metric used in AUTOINDEX in Milvus?
# Answer: L2 
# Best answer:  IP inner product, not yet updated in documentation still says L2.

# Question4: What does nlist mean in ivf_flat?
# 'Answer: In IVF_FLAT, nlist refers to the number of cluster units that divide '
#  'a vector space. When using the default value of 16384 for nlist in Milvus, '
#  "distances between the target vector and all 16384 clusters' centers are "
#  'compared to find the nearest clusters for further comparison with vectors '
#  'within those selected clusters. This parameter influences how clustering is '
#  'performed and affects search efficiency in Milvus.\n'
#  'Sources: https://milvus.io/docs/index.md')

Question: What do the parameters for HNSW mean?
('Answer: The parameters for HNSW (Hierarchical Navigable Small World Graph) '
 'are M and efConstruction for index building, and ef for searching targets. \n'
 '- M defines the maximum number of outgoing connections in the graph, '
 'affecting accuracy and runtime.\n'
 '- efConstruction controls search speed/build speed tradeoff during index '
 'construction.\n'
 '- ef is a parameter controlling query time/search range. \n'
 'These parameters help optimize performance by balancing accuracy, search '
 'time, and recall rate. [Source: https://milvus.io/docs/index.md]')


chatgpt_3.5_turbo_time: 2.45253


## Use Ragas to evaluate RAG pipeline

Ragas is an open source project for evaluating RAG components.  [Paper](https://arxiv.org/abs/2309.15217), [Code](https://docs.ragas.io/en/stable/getstarted/index.html), [Docs](https://docs.ragas.io/en/stable/getstarted/index.html), [Intro blog](https://medium.com/towards-data-science/rag-evaluation-using-ragas-4645a4c6c477).

<div>
<img src="../../../images/ragas_eval_image.png" width="80%"/>
</div>

**Please note that RAGAS can use a large amount of OpenAI api token consumption.** <br> 

Read through this notebook carefully and pay attention to the number of questions and metrics you want to evaluate.



In [45]:
# !python -m pip install -U ragas dataset

In [46]:
import os, sys
import pandas as pd
import numpy as np
import ragas, datasets
from langchain_huggingface import HuggingFaceEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

# Import custom functions for evaluation.
sys.path.append("../../Integration")  
import eval_ragas as _eval_ragas

# Import the evaluation metrics.
from ragas.metrics import (
    context_recall, 
    context_precision, 
    faithfulness, 
    answer_relevancy, 
    answer_similarity,
    answer_correctness
    )

# Get the current working directory.
cwd = os.getcwd()
relative_path = '/../../Evaluation/data/blog_eval_answers.csv'
file_path = cwd + relative_path
# print(f"file_path: {file_path}")

# Read ground truth answers from file.
eval_df = pd.read_csv(file_path, header=0, skip_blank_lines=True)
display(eval_df.head())

Unnamed: 0,Question,ground_truth_answer,Custom_RAG_context,simple_context,Custom_RAG_answer,llama3_ollama_answer,llama3_anyscale_answer,llama3_octoai_answer,llama3_groq_answer,mixtral_8x7b_anyscale_answer
0,What do the parameters for HNSW mean?,"* M: maximum degree, or number of connections ...",HNSW (Hierarchical Navigable Small World Graph...,"In order to improve performance, HNSW limits t...",- M defines the maximum number of outgoing con...,1. **M**: Maximum degree of nodes on each laye...,* M: defines the maximum number of outgoing co...,* M: defines the maximum number of outgoing co...,* M: defines the maximum number of outgoing co...,* The `M` parameter in `MM` defines the maximu...
1,What are good default values for HNSW paramete...,"M=16, efConstruction=32, ef=32",HNSW (Hierarchical Navigable Small World Graph...,Why Milvus Docs Tutorials Tools Blog Community...,- M = 32 - efConstruction = 100,M=16\nefConstruction=128 ef=64,* M: 16 * efConstruction: 100 * ef: top_k,* M: 16 * efConstruction: 100 * ef: top_k,* M: 16 * efConstruction: 100 * ef: 100,- M: 16 - efConstruction: 100 - ef: 50
2,What does nlist vs nprobe mean in ivf_flat?,# nlist: controls how the vector data is part...,IVF_FLAT divides vector data into nlist cluste...,`nlist` in IVF-Flat represents the number of c...,- nlist in IVF_FLAT refers to the number of cl...,- `nlist` refers to the number of cluster unit...,- `nlist` refers to the number of cluster unit...,- `nlist` refers to the number of cluster unit...,- `nlist` refers to the number of cluster unit...,- `nlist` refers to the number of cluster unit...
3,What is the default AUTOINDEX index and vector...,Index type = HNSW and distance metric=IP Inner...,"""AUTOINDEX"", metric_type: ""COSINE"", i...",Index parameters Index parameters dictate how ...,The default AUTOINDEX index in Milvus is IVF_S...,The default `AUTOINDEX` index uses a combinati...,"According to the Milvus documentation, the def...","According to the Milvus documentation, the def...","According to the Milvus documentation, the def...",The default AUTOINDEX index in Milvus is an An...


In [47]:
##########################################
# Set the evaluation type.
EVALUATE_WHAT = 'ANSWERS' 
EVALUATE_WHAT = 'CONTEXTS'
##########################################

# Set the columns to evaluate.
if EVALUATE_WHAT == 'CONTEXTS':
    cols_to_evaluate=['Custom_RAG_context', 'simple_context']
elif EVALUATE_WHAT == 'ANSWERS':
    cols_to_evaluate=['Custom_RAG_answer', 'llama3_ollama_answer', 
                      'llama3_anyscale_answer', 'llama3_octoai_answer',
                      'llama3_groq_answer', 'mixtral_8x7b_anyscale_answer']

# Set the metrics to evaluate.
if EVALUATE_WHAT == 'ANSWERS':
    eval_metrics=[
        answer_relevancy,
        answer_similarity,
        answer_correctness,
        faithfulness,
        ]
    metrics = ['answer_relevancy', 'answer_similarity', 'answer_correctness', 'faithfulness']
elif EVALUATE_WHAT == 'CONTEXTS':
    eval_metrics=[
        context_recall, 
        context_precision,
        ]
    metrics = ['context_recall', 'context_precision']
    
# Change the default the llm-as-critic model.
LLM_NAME = "gpt-3.5-turbo"
ragas_llm = ragas.llms.llm_factory(model=LLM_NAME)

# Change the default embeddings models to HuggingFace models.
EMB_NAME = "BAAI/bge-large-en-v1.5"
# Define the embedding model.
EMB_NAME = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
lc_embed_model = HuggingFaceEmbeddings(
    model_name=EMB_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)
ragas_emb = LangchainEmbeddingsWrapper(embeddings=lc_embed_model)

# Change embeddings and critic models for each metric.
for metric in metrics:
    globals()[metric].llm = ragas_llm
    globals()[metric].embeddings = ragas_emb

# Execute the evaluation.
print(f"Evaluating {EVALUATE_WHAT} using {eval_df.shape[0]} eval questions:")
ragas_result, scores = _eval_ragas.evaluate_ragas_model(
    eval_df, 
    eval_metrics, 
    what_to_evaluate=EVALUATE_WHAT,
    cols_to_evaluate=cols_to_evaluate)



Evaluating CONTEXTS using 4 eval questions:


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

Evaluate chunking: Custom_RAG_context, avg_score: 0.67


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

Evaluate chunking: simple_context, avg_score: 0.42


In [48]:
# Calculate and print the percent improvements.
if EVALUATE_WHAT == 'ANSWERS':
    # Sort scores from highest to lowest
    sorted_scores = sorted(scores, key=lambda item: sum(item.values()), reverse=True)
    pprint.pprint(sorted_scores)
    # Calculate the percent improvement of the best LLM over the worst LLM.
    highest_score = list(sorted_scores[0].values())[0]
    lowest_score = list(sorted_scores[-1].values())[0]
    best_llm = list(sorted_scores[0].keys())[0]
    worst_llm = list(sorted_scores[-1].keys())[0]
    percent_better = (highest_score - lowest_score) / lowest_score * 100
    print(f"{best_llm} {np.round(percent_better,0)}% improvement over {worst_llm}.")

elif EVALUATE_WHAT == 'CONTEXTS':
    pprint.pprint(scores)
    percent_better = (scores[0]['Custom_RAG_context'] - scores[1]['simple_context']) \
                     / scores[1]['simple_context'] * 100
    print(f"HTML chunking {np.round(percent_better,0)}% improvement over Simple chunking.")

# Display the evaluation details.
display(ragas_result)

[{'Custom_RAG_context': 0.67}, {'simple_context': 0.42}]
HTML chunking 60.0% improvement over Simple chunking.


Unnamed: 0,question,contexts,answer,ground_truth,context_recall,context_precision,context_f1,evaluated
0,What do the parameters for HNSW mean?,[HNSW (Hierarchical Navigable Small World Grap...,- M defines the maximum number of outgoing con...,"* M: maximum degree, or number of connections ...",1.0,1.0,1.0,Custom_RAG_context
1,What are good default values for HNSW paramete...,[HNSW (Hierarchical Navigable Small World Grap...,- M = 32 - efConstruction = 100,"M=16, efConstruction=32, ef=32",1.0,1.0,1.0,Custom_RAG_context
2,What does nlist vs nprobe mean in ivf_flat?,[IVF_FLAT divides vector data into nlist clust...,- nlist in IVF_FLAT refers to the number of cl...,# nlist: controls how the vector data is part...,0.5,1.0,0.666667,Custom_RAG_context
3,What is the default AUTOINDEX index and vector...,"[""AUTOINDEX"", metric_type: ""COSINE"", ...",The default AUTOINDEX index in Milvus is IVF_S...,Index type = HNSW and distance metric=IP Inner...,0.0,0.0,0.0,Custom_RAG_context
4,What do the parameters for HNSW mean?,"[In order to improve performance, HNSW limits ...",- M defines the maximum number of outgoing con...,"* M: maximum degree, or number of connections ...",1.0,1.0,1.0,simple_context
5,What are good default values for HNSW paramete...,[Why Milvus Docs Tutorials Tools Blog Communit...,- M = 32 - efConstruction = 100,"M=16, efConstruction=32, ef=32",0.0,1.0,0.0,simple_context
6,What does nlist vs nprobe mean in ivf_flat?,[`nlist` in IVF-Flat represents the number of ...,- nlist in IVF_FLAT refers to the number of cl...,# nlist: controls how the vector data is part...,0.5,1.0,0.666667,simple_context
7,What is the default AUTOINDEX index and vector...,[Index parameters Index parameters dictate how...,The default AUTOINDEX index in Milvus is IVF_S...,Index type = HNSW and distance metric=IP Inner...,0.0,1.0,0.0,simple_context


In [49]:
####################################################
# Avg Context Precision htmlsplitter score = 0.67 (46% improvement)
# Avg Context Precision simple score = 0.46
####################################################

####################################################
# Avg mistralai mixtral_8x7b_instruct score = 0.7031 (6% improvement over gpt-3.5-turbo)
# Avg llama3_70b_anyscale_chat score = 0.6888
# Avg llama3_70b_groq_instruct score = 0.6867
# Avg llama_3_70b_octoai_instruct score = 0.6863
# Avg llama_3_8b_ollama_instruct score = 0.6783
# Avg openai gpt-3.5-turbo score = 0.665 
####################################################

In [50]:
# Drop collection
# utility.drop_collection(COLLECTION_NAME)
mc.drop_collection(COLLECTION_NAME)

In [51]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p unstructured,lxml,torch,pymilvus,langchain,ollama,octoai,groq,openai --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.11.8
IPython version      : 8.22.2

unstructured: 0.14.4
lxml        : 5.1.0
torch       : 2.3.0
pymilvus    : 2.4.4
langchain   : 0.2.2
ollama      : 0.1.8
octoai      : 1.0.2
groq        : 0.8.0
openai      : 1.35.0

conda environment: py311-unum

