<p style="font-size:18px; color:#3F51B5">Import Modules</p>

In [27]:
import pymupdf4llm
import chromadb
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from sentence_transformers import SentenceTransformer, util
from openai import OpenAI
import re
from rapidfuzz import process, fuzz 
import pandas as pd
from pathlib import Path
import csv
import json
import os
from dotenv import load_dotenv

load_dotenv()

True

<p style="font-size:18px; color:#3F51B5">Define document source directory</p>

This is where I stored the municipal asset management plan pdf files that will be parsed into multiple segments of text. I will refer to these segments of text as documents.

In [28]:
pdffolder = Path("./sourcePDF/") 

<p style="font-size:18px; color:#3F51B5">Metadata Functions</p>

I created functions to add metadata to each document to help relate the document back to the municipal government structure.

In [30]:
# Function to look up metadata from source file
def lookupMetaData(municipality):
    # Load the metadata CSV file
    df = pd.read_csv('./sourceData/Municipality_MetaData.csv')
    name_list = df['Name'].dropna().tolist()

    # Find best match using fuzzy logic
    best_match, score, index = process.extractOne(
        municipality, name_list, scorer=fuzz.token_sort_ratio
    )
    if score >= 50:
        matched_row = df[df['Name'] == best_match].iloc[0]
        output = matched_row.to_dict()
        return output
    else:
        return f"No good match found for '{municipality}' (best guess: '{best_match}', score: {score})"


# Function to create metadata from lookup in source file
def createMetaData(fname):
    fnameSeg = re.findall(r'(\d+)(\_)(\w+)', fname)
    year = fnameSeg[0][0]
    municipality = fnameSeg[0][2]
    metadata = lookupMetaData(municipality)
    docData = {"File Name": fname,
                "Name" : metadata["Name"], 
                "year" : year, 
                "Municipal status" : metadata["Municipal status"], 
                "Geographic Area" : metadata["Geographic Area"],
                "Upper Tier" : metadata["Upper Tier"],
                "website" : metadata["website"]
                }
    return docData

<p style="font-size:18px; color:#3F51B5">Initialize Chromadb Client and Functions to Create Collections</p>

I abstracted the creation of a chromadb collection and the addition of documents to that collection. This makes it easy to create a new collection for each embedding model and compare the results.


In [54]:
chromaClient = chromadb.PersistentClient(path="./chromaDB")

# Function to create a Chroma Collection 
def initializeChromaCollection(collection_name):
    try:
        collection = chromaClient.get_collection(name=collection_name)
        print(f"Collection {collection_name} found with {collection.count()} documents")
    except Exception as e:
        print(f"{e} Creating new collection...")
        collection = chromaClient.create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
        print(f"Created collection {collection_name} with {collection.count()} documents")

# Function to add documents to a Chroma Collection        
def addDocumentsToChromaCollection(collection_name, documents, embedding_function):
    initializeChromaCollection(collection_name)
    collection = chromaClient.get_collection(name=collection_name)
    x=0
    for doc in documents:
        x+=1
        text= doc.text
        id= doc.doc_id
        metadata = doc.metadata
        embed_model = embedding_function(text)

        collection.add(
            documents=text,
            embeddings=embed_model,
            ids=id,
            metadatas=[metadata]
        )
    print(f"There were {x} documents added to {collection_name} collection.")
    print(f"The number of documents in {collection_name } is {collection.count()}")

<p style="font-size:18px; color:#3F51B5">Define Function To Compare Expected and Recieved Response</p>

The test model uses cosine similarity measure the similarity between the expected and recieved responses. It ranges from -1 opposite to 1 exactly the same, with 0 indicating no similarity.

In [32]:
def embedding_similarity_score(expected, response, embedding_function):
    expected_emb = embedding_function(expected)
    response_emb = embedding_function(response)
    return util.cos_sim(expected_emb, response_emb).item()

<p style="font-size:18px; color:#3F51B5">OpenAI: LLM Client and API Call</p>

The test model passes the context to the LLM to generate a response.  The test model evaluates the context provided from the vector database alone and also the response provided by the LLM with the assistance of the provided context.

In general, the LLM scores better than the documents retrived from the vector database alone.

In [87]:
# Create OpenAI Client
openAIClient = OpenAI(api_key=os.environ['OPENAI_API_KEY'], organization=os.environ['ORGANIZATION'], project=os.environ['PROJECT'])

def generate_response(question, context, chat_id=None, chat_mgr=None):
    # Build base message list
    messages = [{"role": "user", "content": question},
                {"role": "system", "content": context}]
    
    # Add history if enabled
    if chat_mgr and chat_id:
        history = chat_mgr.get_history(chat_id)
        messages.append({"role": "system", "content": str(history)})
        chat_mgr.add_message(chat_id, "user", question)

    # Call OpenAI
    response = openAIClient.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True)

    # Collect response
    full_response = ""
    for chunk in response:
        if chunk.choices[0].delta.content:
            full_response += chunk.choices[0].delta.content

    # Record and summarize if needed
    if chat_mgr and chat_id:
        chat_mgr.add_message(chat_id, "system", full_response)
        chat_mgr.summarize_history(chat_id)

    return full_response

<p style="font-size:18px; color:#3F51B5">Chat History Manager</p>

The chat manager class manages chat history, passing it to the llm as context, when available. 

While chat history isn't shown in this version of the notebook.  I did run some tests on a smaller vector database and was surprised to find that adding chat history made the results worse! 

My explanation for this is that the test questions are not related and so the additional chat history context adds additional unrelated context that dillutes the responses.

In [85]:
class ChatHistoryManager:
    def __init__(self):
        self.histories = {}  # key = chat_id, value = list of messages
        self.save_file = "chat_histories.json"

    def add_message(self, chat_id, role, content):
        if chat_id not in self.histories:
            self.histories[chat_id] = []
        self.histories[chat_id].append({"role": role, "content": content})

    def get_history(self, chat_id):
        return self.histories.get(chat_id, [])

    def summarize_history(self, chat_id, max_history=4):
        history = self.get_history(chat_id)
        if len(history) > max_history:
            summarized_text = "\n".join([f"{msg['role']}: {msg['content']}" for msg in history[:max_history]])
            context = "You are creating a succinct summary of a past conversation for reference."
            summary = generate_response(question=summarized_text, context=context, chat_history=None)
            self.histories[chat_id] = [{"role": "system", "content": summary}] + history[max_history:]

    def save_histories(self):
        with open(self.save_file, "w") as f:
            json.dump(self.histories, f, indent=2)
        print("save success")

    def load_histories(self):
        try:
            with open(self.save_file, "r") as f:
                self.histories = json.load(f)
        except FileNotFoundError:
            self.histories = {}
        print("load success")
        
    def print_history(self, chat_id):
        for msg in self.get_history(chat_id):
            print(f"{msg['role']}: {msg['content']}")

<p style="font-size:18px; color:#3F51B5">Test Model</p>

This class manages the test runs for each model.  It has the ability to print summarized results and save the full results to a file.

NOTE: The variable result['documents'][0] contains the n_results which in this case is 3
    print(len(result['documents'][0])) # =3
    if you turn the results into a string that you can evaluate them together, 
    no need to loop through and then combine the results.

In [68]:
class TestRunner:
    def __init__(self, collection_name, embedding_function, chat_id=None, chat_mgr=None, n_results=5):
        self.collection_name = collection_name
        self.embedding_function = embedding_function
        self.chat_id=chat_id
        self.chat_mgr = chat_mgr
        self.test_results = []
        self.count = 0
        self.sum_vec = 0
        self.sum_file = 0
        self.sum_llm = 0
        self.n_results = n_results

    def run_test_case(self, case):
        self.count += 1
        query = case["Query"]
        expected = case["Expected Result"]
        source_file = case["Source File"]

        print(f"Running query: {query}")
        collection = chromaClient.get_collection(name=self.collection_name)
        vec = self.embedding_function(query)
        result = collection.query(query_embeddings=[vec], n_results=self.n_results)

        vec_response = str(result['documents'][0])
        file_name = str(result['metadatas'][0][0]['File Name'])

        llm_response = generate_response(query, vec_response, chat_id=self.chat_id, chat_mgr=self.chat_mgr)

        vec_score = embedding_similarity_score(expected, vec_response, self.embedding_function)
        file_score = embedding_similarity_score(source_file, file_name, self.embedding_function)
        llm_score = embedding_similarity_score(expected, llm_response, self.embedding_function)

        self.sum_vec += vec_score
        self.sum_file += file_score
        self.sum_llm += llm_score

        self.test_results.append({
            "Query": query,
            "Expected": expected,
            "Response": vec_response,
            "Embedding Similarity": vec_score,
            "Source File Similarity": file_score,
            "LLM Response": llm_response,
            "LLM Embedding Similarity": llm_score
        })

    def run_all(self, test_cases):
        for case in test_cases:
            self.run_test_case(case)

    def save_results(self, fname):
        with open(fname, "w", newline='') as f:
            writer = csv.DictWriter(f, fieldnames=self.test_results[0].keys())
            writer.writeheader()
            for r in self.test_results:
                writer.writerow(r)
        print(f"Results saved to {fname}")

    def summarize(self):
        print(f"Total cases: {self.count}")
        print(f"Avg Embedding Similarity: {self.sum_vec / self.count:.3f}")
        print(f"Avg Source File Similarity: {self.sum_file / self.count:.3f}")
        print(f"Avg LLM Similarity: {self.sum_llm / self.count:.3f}")

<p style="font-size:18px; color:#3F51B5">Create Llama Documents from the Source PDF Files<p>

Llama documents are a convenient construct for use with vector databases.  Llama extracts text considering document structure like section headings and paragraphs keeping relevant text together.  Llama also auto generates and attaches metadata like document name, page number, and keywords. 

Unfortunately you still need to use a combination of pdf extraction tools if you want to extract images and tables in a more embedding friendly format.

This is where I could likely most easily improve my results.

In [39]:
llama_reader = pymupdf4llm.LlamaMarkdownReader()
all_llama_docs = []
remove_list = ['format', 'author', 'creator', 'producer', 'creationDate', 'modDate', 'trapped', 'encryption', 'file_path']

for pdf in pdffolder.iterdir():
    fname = str(pdf)
    if fname.lower().endswith(".pdf"):
        docData = createMetaData(fname)
        
        llama_docs = llama_reader.load_data(pdf)
        for i, doc in enumerate(llama_docs):
            doc.metadata = {k: v for k, v in doc.metadata.items() if k not in remove_list and v is not None}
            for key, value in docData.items():
                doc.metadata.update({
                key:value
            })   

        #print(doc.metadata)
        all_llama_docs.extend(llama_docs) 
    
print(f"Created {len(all_llama_docs)} documents (text segments) from the source PDFs.")

Successfully imported LlamaIndex
Created 2925 documents (text segments) from the source PDFs.


<p style="font-size:18px; color:#3F51B5">Load Test Cases</p>

I created a number of questions and expected responses from the Source PDF documents. The test model applies the embedding model to the questions and expected responses.  It compares the encoding of the expected response to the retrieved response using the embedding similarity score.

In [65]:
#test_case_file = "TestCases.csv"
test_case_file = "RAG_tests.csv"
test_cases = []


with open(test_case_file, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        test_cases.append(row)

<p style="font-size:18px; color:#3F51B5;">Model Comparison</p>

I compared retrievals using four different embedding models:
<ol style="line-height: 1.5;">
<li>Sentence Transformer all-MiniLM-L6-v2 Embedding Model</li>
<li>HuggingFace Sentence Transformers all-mpnet-base-v2</li>
<li>OpenAI text-embedding-3-small</li>
<li>OpenAI text-embedding-3-large
</ol>
<br>

For each test case, I followed a 3-step process:
<ol style="line-height: 1.5;">
<li style="font-weight: bold; color: #555;">Load the Embedding Model and Set the Embedding Function:</li> 
    <ul><li>For the best results the generally accepted practice is to use the same model to embed the document segments in the vector database, the question, the expected results, the retrieved context and the llm response.</li>
    <li>The model needs to be loaded in advance to avoid reloading every time you call the embedding function.</li>
    <li>Models have varied function calls for creating an embedding.</li>
</ul>
<li style="font-weight: bold; color: #555;">Create Chroma Database Collection</li>
    <ul><li>I created a chroma collection for each embedding model in the same database.  This enables you to easily query the databases and compare the embedding models.</li></ul>
<li style="font-weight: bold; color: #555;">Run Test Model</li>
    <ul><li>I then use the test cases to evaluate and compare the performance of the embedding models.</li></ul>
</ol>


<p style="font-size:18px; color:#3F51B5">TEST CASE 1: Sentence Transformer all-MiniLM-L6-v2 Embedding Model</p>

<p style="font-size:14px; color:#3F51B5">1: Load Embedding Model and Set Function</p>

In [None]:
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

def embedding_function(text):
    return embed_model.encode(text)

The number of documents in all-MiniLM-L6-v2: 0


<p style="font-size:14px; color:#3F51B5">2: Create Collection</p>

In [73]:
collection_name = "all-MiniLM-L6-v2"
documents = all_llama_docs
addDocumentsToChromaCollection(collection_name, documents, embedding_function)

collection_name = "all-MiniLM-L6-v2"
collection = chromaClient.get_collection(name=collection_name)
print(f"The number of documents in {collection_name }: {collection.count()}")

Collection all-MiniLM-L6-v2 found with 0 documents
There were 2925 documents added to all-MiniLM-L6-v2 collection.
The number of documents in all-MiniLM-L6-v2: 2925


<p style="font-size:14px; color:#3F51B5">3: Run Test Model</p>

In [74]:
collection_name = "all-MiniLM-L6-v2"

st_runner = TestRunner(collection_name=collection_name, embedding_function=embedding_function)
st_runner.run_all(test_cases)
st_runner.summarize()
st_runner.save_results(fname="results/all-MiniLM-L6-v2")


Running query: What is the estimated total value of the City of Waterloo's infrastructure?
Running query: Which regulation does the City of Waterloo's Asset Management Plan comply with?
Running query: Who contributed to the development of the Asset Management Plan?
Running query: What is the forecasted decline in performance of tax-base funded assets over 25 years?
Running query: What is the primary funding source mentioned for infrastructure renewal?
Running query: What is the Waterloo Decision Support System (DSS)?
Running query: What is the city's current target for overall road PQI?
Running query: What is the estimated cost of replacing the city's transportation assets?
Running query: What funding increase was approved by the council for 2024-2026?
Running query: What are some of the key asset classes identified in the plan?
Total cases: 10
Avg Embedding Similarity: 0.290
Avg Source File Similarity: 0.719
Avg LLM Similarity: 0.483
Results saved to all-MiniLM-L6-v2


<p style="font-size:18px; color:#3F51B5">TEST CASE 2: HuggingFace Sentence Transformers all-mpnet-base-v2</p>

<p style="font-size:18px; color:#3F51B5">1: Load Embedding Model and Set Function</p>

In [78]:
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-mpnet-base-v2")

def embedding_function_mpnet(text):
    return embed_model.get_text_embedding(text)


<p style="font-size:18px; color:#3F51B5">2: Create Collection</p>

In [56]:
collection_name = "all-mpnet-base-v2"
documents = all_llama_docs
addDocumentsToChromaCollection(collection_name, documents, embedding_function_mpnet)

Collection all-mpnet-base-v2 found with 0 documents
There were 2925 documents added to all-mpnet-base-v2 collection.


In [76]:
collection_name = "all-mpnet-base-v2"
collection = chromaClient.get_collection(name=collection_name)
print(f"The number of documents in {collection_name }: {collection.count()}")

The number of documents in all-mpnet-base-v2: 2925


<p style="font-size:18px; color:#3F51B5">3: Run Test Model</p>

In [79]:
collection_name = "all-mpnet-base-v2"

mpnet_runner = TestRunner(collection_name=collection_name, embedding_function=embedding_function_mpnet)
mpnet_runner.run_all(test_cases)
mpnet_runner.summarize()
st_runner.save_results(fname="results/all-mpnet-base-v2")

Running query: What is the estimated total value of the City of Waterloo's infrastructure?
Running query: Which regulation does the City of Waterloo's Asset Management Plan comply with?
Running query: Who contributed to the development of the Asset Management Plan?
Running query: What is the forecasted decline in performance of tax-base funded assets over 25 years?
Running query: What is the primary funding source mentioned for infrastructure renewal?
Running query: What is the Waterloo Decision Support System (DSS)?
Running query: What is the city's current target for overall road PQI?
Running query: What is the estimated cost of replacing the city's transportation assets?
Running query: What funding increase was approved by the council for 2024-2026?
Running query: What are some of the key asset classes identified in the plan?
Total cases: 10
Avg Embedding Similarity: 0.480
Avg Source File Similarity: 0.745
Avg LLM Similarity: 0.553
Results saved to results/all-mpnet-base-v2


<p style="font-size:18px; color:#3F51B5">TEST CASE 3: OpenAI text-embedding-3-small</p>

<p style="font-size:18px; color:#3F51B5">1: Set Embedding Model and Function</p>

In [None]:
embed_model = "text-embedding-3-small"

def embedding_function_openai_sm(text):
    embedding = openAIClient.embeddings.create(input = text, model=embed_model)
    return embedding.data[0].embedding

<p style="font-size:18px; color:#3F51B5">2: Create Collection</p>

In [94]:
collection_name = "openai-text-embedding-3-small"
documents = all_llama_docs
addDocumentsToChromaCollection(collection_name, documents, embedding_function_openai_sm)

Collection openai-text-embedding-3-small found with 0 documents
There were 2925 documents added to openai-text-embedding-3-small collection.


<p style="font-size:18px; color:#3F51B5">3: Run Test Model</p>

In [95]:
collection_name = "openai-text-embedding-3-small"

openai_sm_runner = TestRunner(collection_name=collection_name, embedding_function=embedding_function_openai_sm)
openai_sm_runner.run_all(test_cases)
openai_sm_runner.summarize()
openai_sm_runner.save_results(fname="results/openai-text-embedding-3-small")

Running query: What is the estimated total value of the City of Waterloo's infrastructure?
Running query: Which regulation does the City of Waterloo's Asset Management Plan comply with?
Running query: Who contributed to the development of the Asset Management Plan?
Running query: What is the forecasted decline in performance of tax-base funded assets over 25 years?
Running query: What is the primary funding source mentioned for infrastructure renewal?
Running query: What is the Waterloo Decision Support System (DSS)?
Running query: What is the city's current target for overall road PQI?
Running query: What is the estimated cost of replacing the city's transportation assets?
Running query: What funding increase was approved by the council for 2024-2026?
Running query: What are some of the key asset classes identified in the plan?
Total cases: 10
Avg Embedding Similarity: 0.428
Avg Source File Similarity: 0.758
Avg LLM Similarity: 0.530
Results saved to results/openai-text-embedding-3-sm

<p style="font-size:18px; color:#3F51B5">TEST CASE 4: OpenAI text-embedding-3-large</p>

<p style="font-size:18px; color:#3F51B5">1: Set Embedding Model and Function</p>

In [98]:
embed_model = "text-embedding-3-large"

def embedding_function_openai_lg(text):
    embedding = openAIClient.embeddings.create(input = text, model=embed_model)
    return embedding.data[0].embedding


<p style="font-size:18px; color:#3F51B5">2: Create Collection</p>

In [99]:
collection_name = "openai-text-embedding-3-large"
documents = all_llama_docs
addDocumentsToChromaCollection(collection_name, documents, embedding_function_openai_lg)

Collection [openai-text-embedding-3-large] does not exists Creating new collection...
Created collection openai-text-embedding-3-large with 0 documents
There were 2925 documents added to openai-text-embedding-3-large collection.


<p style="font-size:18px; color:#3F51B5">3: Run Test Model</p>

In [100]:
collection_name = "openai-text-embedding-3-large"

openai_lg_runner = TestRunner(collection_name=collection_name, embedding_function=embedding_function_openai_sm)
openai_lg_runner.run_all(test_cases)
openai_lg_runner.summarize()
openai_lg_runner.save_results(fname="results/openai-text-embedding-3-large")

Running query: What is the estimated total value of the City of Waterloo's infrastructure?
Running query: Which regulation does the City of Waterloo's Asset Management Plan comply with?
Running query: Who contributed to the development of the Asset Management Plan?
Running query: What is the forecasted decline in performance of tax-base funded assets over 25 years?
Running query: What is the primary funding source mentioned for infrastructure renewal?
Running query: What is the Waterloo Decision Support System (DSS)?
Running query: What is the city's current target for overall road PQI?
Running query: What is the estimated cost of replacing the city's transportation assets?
Running query: What funding increase was approved by the council for 2024-2026?
Running query: What are some of the key asset classes identified in the plan?
Total cases: 10
Avg Embedding Similarity: 0.432
Avg Source File Similarity: 0.769
Avg LLM Similarity: 0.524
Results saved to results/openai-text-embedding-3-la