# 🏗️ Activity #1: Enhanced RAG App

Enhance your RAG application in some way! 

Suggestions are: 

- Allow it to work with PDF files
- Implement a new distance metric
- Add metadata support to the vector database

While these are suggestions, you should feel free to make whatever augmentations you desire! 

> NOTE: These additions might require you to work within the `aimakerspace` library - that's expected!

> NOTE: If you're not sure where to start - ask Cursor (CMD/CTRL+L) to guide you through the changes!

## ✅ Add PDF Files

**Plan:**

1. Copy original RAG application because this is intended to **enhance** original version.
    - NOTE; I am only including the code steps in this notebook for brevity, see `Pythonic_RAG_Assignment.ipynb` for the detailed walk through.
2. Create new class that handles loading, embedding, and insert into VectorDatabase for multiple PDF files.
    - NOTE: This class is added to `aimakerspace` library. It is best practice in data engineering to keep like file formats together, so I all PDFs are together in a directory.
3. Do the standard workflow for the PDFs: collect, chunk, add to vector database.
    - NOTE: I leave the prompts as-is to assess initial performace with the PDF data addition only.
4. Test retrieval.
5. Assess result and conclusions.

--- 

For simplicity, we will merge some cells and provide quick comments on the steps.

In [1]:
# import dependencies
from aimakerspace.text_utils import TextFileLoader, CharacterTextSplitter
from aimakerspace.vectordatabase import VectorDatabase
import asyncio
import nest_asyncio

nest_asyncio.apply()

In [2]:
# load text document
text_loader = TextFileLoader("data/PMarcaBlogs.txt")
documents = text_loader.load_documents()

# sanity check
print(f"Loaded {len(documents)} document.")
print(f"\nHere's an excerpt: {documents[0][:100]}")

Loaded 1 document.

Here's an excerpt: ﻿
The Pmarca Blog Archives
(select posts from 2007-2009)
Marc Andreessen
copyright: Andreessen Horow


In [3]:
# chunk the loaded text
text_splitter = CharacterTextSplitter()
split_documents = text_splitter.split_texts(documents)

# sanity check
print(f"Split into {len(split_documents)} chunks.")
split_documents[0:1]

Split into 373 chunks.


['\ufeff\nThe Pmarca Blog Archives\n(select posts from 2007-2009)\nMarc Andreessen\ncopyright: Andreessen Horowitz\ncover design: Jessica Hagy\nproduced using: Pressbooks\nContents\nTHE PMARCA GUIDE TO STARTUPS\nPart 1: Why not to do a startup 2\nPart 2: When the VCs say "no" 10\nPart 3: "But I don\'t know any VCs!" 18\nPart 4: The only thing that matters 25\nPart 5: The Moby Dick theory of big companies 33\nPart 6: How much funding is too little? Too much? 41\nPart 7: Why a startup\'s initial business plan doesn\'t\nmatter that much\n49\nTHE PMARCA GUIDE TO HIRING\nPart 8: Hiring, managing, promoting, and Dring\nexecutives\n54\nPart 9: How to hire a professional CEO 68\nHow to hire the best people you\'ve ever worked\nwith\n69\nTHE PMARCA GUIDE TO BIG COMPANIES\nPart 1: Turnaround! 82\nPart 2: Retaining great people 86\nTHE PMARCA GUIDE TO CAREER, PRODUCTIVITY,\nAND SOME OTHER THINGS\nIntroduction 97\nPart 1: Opportunity 99\nPart 2: Skills and education 107\nPart 3: Where to go and wh

In [4]:
# input our OpenAI API key to do the API calls
import os
import openai

from getpass import getpass

openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

In [5]:
# instantiate the vector database and insert embeddings
vector_db = VectorDatabase()
vector_db = asyncio.run(vector_db.abuild_from_list(split_documents))

In [6]:
# sanity check
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)

[('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don’t hire someone weak on purpose.\nThis sounds silly, but you wouldn’t believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct management executive. The CEO who used to be in\nsales who has a weak sales executive. The CEO who used to be\nin marketing who has a weak marketing executive.\nI call this the “Michael Eisner Memorial Weak Executive Problem” — aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it\npromptly fell to fourth place. His response? “If I had an extra\ntwo days a week, I could turn around ABC myself.” Well, guess\nwhat, he didn’t have an extra two days a week.\nA CEO — or a startup founder — oaen has a hard time letting\ngo of the function that brought him to the party. The result: you\nhire someone weak into the executive role for that function so\nthat you can continue to b

In [7]:
# set up for communication with OpenAI chat
from aimakerspace.openai_utils.prompts import (
    UserRolePrompt,
    SystemRolePrompt,
)

from aimakerspace.openai_utils.chatmodel import ChatOpenAI

# instantiate OpenAI client
chat_openai = ChatOpenAI()


In [8]:
RAG_SYSTEM_TEMPLATE = """You are a knowledgeable assistant that answers questions based strictly on provided context.

Instructions:
- Only answer questions using information from the provided context
- If the context doesn't contain relevant information, respond with "I don't know"
- Be accurate and cite specific parts of the context when possible
- Keep responses {response_style} and {response_length}
- Only use the provided context. Do not use external knowledge.
- Only provide answers when you are confident the context supports your response."""

RAG_USER_TEMPLATE = """Context Information:
{context}

Number of relevant sources found: {context_count}
{similarity_scores}

Question: {user_query}

Please provide your answer based solely on the context above."""

rag_system_prompt = SystemRolePrompt(
    RAG_SYSTEM_TEMPLATE,
    strict=True,
    defaults={
        "response_style": "concise",
        "response_length": "brief"
    }
)

rag_user_prompt = UserRolePrompt(
    RAG_USER_TEMPLATE,
    strict=True,
    defaults={
        "context_count": "",
        "similarity_scores": ""
    }
)

In [9]:
# RAG pipeline
class RetrievalAugmentedQAPipeline:
    def __init__(self, llm: ChatOpenAI, vector_db_retriever: VectorDatabase,
                 response_style: str = "detailed", include_scores: bool = False) -> None:
        self.llm = llm
        self.vector_db_retriever = vector_db_retriever
        self.response_style = response_style
        self.include_scores = include_scores

    def run_pipeline(self, user_query: str, k: int = 4, **system_kwargs) -> dict:
        # Retrieve relevant contexts
        context_list = self.vector_db_retriever.search_by_text(user_query, k=k)

        context_prompt = ""
        similarity_scores = []

        for i, (context, score) in enumerate(context_list, 1):
            context_prompt += f"[Source {i}]: {context}\n\n"
            similarity_scores.append(f"Source {i}: {score:.3f}")

        # Create system message with parameters
        system_params = {
            "response_style": self.response_style,
            "response_length": system_kwargs.get("response_length", "detailed")
        }

        formatted_system_prompt = rag_system_prompt.create_message(**system_params)

        user_params = {
            "user_query": user_query,
            "context": context_prompt.strip(),
            "context_count": len(context_list),
            "similarity_scores": f"Relevance scores: {', '.join(similarity_scores)}" if self.include_scores else ""
        }

        formatted_user_prompt = rag_user_prompt.create_message(**user_params)

        return {
            "response": self.llm.run([formatted_system_prompt, formatted_user_prompt]),
            "context": context_list,
            "context_count": len(context_list),
            "similarity_scores": similarity_scores if self.include_scores else None,
            "prompts_used": {
                "system": formatted_system_prompt,
                "user": formatted_user_prompt
            }
        }

In [10]:
# define it
rag_pipeline = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db,
    llm=chat_openai,
    response_style="detailed",
    include_scores=True
)

In [11]:
# run it with a query
result = rag_pipeline.run_pipeline(
    "What is the 'Michael Eisner Memorial Weak Executive Problem'?",
    k=3,
    response_length="comprehensive",
    include_warnings=True,
    confidence_required=True
)

print(f"Response: {result['response']}")
print(f"\nContext Count: {result['context_count']}")
print(f"Similarity Scores: {result['similarity_scores']}")

Response: The 'Michael Eisner Memorial Weak Executive Problem' refers to the phenomenon where a CEO or startup founder, due to their background in a specific functional area (like product management, sales, or marketing), hires a weak executive for that area on purpose. This often occurs because the CEO has difficulty letting go of the function that initially contributed to their success. As a result, the CEO may opt for a less competent executive in order to maintain their role as the dominant figure in that area. The term is named after Michael Eisner, the former CEO of Disney, who faced challenges when he bought ABC and saw it fall to fourth place, despite believing he could turn it around if he had more time (Source 1).

Context Count: 3
Similarity Scores: ['Source 1: 0.658', 'Source 2: 0.509', 'Source 3: 0.479']


### 🏗️ Activity #1: Enhance RAG Application

**Add PDFs to App**

I found a [GitHub repo](https://github.com/tpn/pdfs) with a bunch of old PDFs. I selected 10 with the smallest file size to keep things simple, and saved them into the `/data/pdfs` directory. In data engineering it's best practice to keep like data types together if we can.

Next, I used `ask` mode with Cursor to create the `PDFFileLoader` class in the `text_utils.py` file. I added a new library `PyPDF2` which is referenced in `pyproject.toml` as a dependency to handle the processing of PDFs. I also incremented the project to version `0.2.0` since the ability to ingest PDFs is a minor enhancement.

The assignment is an exercise to augment the existing app and so we add the embeddings to our vector database so they are available for retrieval.

In [12]:
# import my new class and instantiate it then load the pdfs
from aimakerspace.text_utils import PDFFileLoader

pdf_loader = PDFFileLoader("data/pdfs")  # directory containing PDF files
pdf_documents = pdf_loader.load_documents()

# sanity check
print(f"Loaded {len(pdf_documents)} documents.") # Number of PDFs loaded
print(f"\nHere's an excerpt:\n{pdf_documents[2][:100]}") # Show first hundred chars of pdf 3

Loaded 10 documents.

Here's an excerpt:
COMP 423 lecture 6 Jan. 16, 2008
Codes for the positive integers
There are many situations in which 


In [13]:
# chunk the pdfs
pdf_text_splitter = CharacterTextSplitter()
split_pdfs = text_splitter.split_texts(pdf_documents)
len(split_pdfs)

# how many chunks?
print(f"Split into {len(split_pdfs)} chunks.")

# add chunks to the vector_db
vector_db = asyncio.run(vector_db.abuild_from_list(split_pdfs))

Split into 289 chunks.


In [14]:
# sanity check with direct db query
vector_db.search_by_text("What do http status codes starting with 4 mean?", k=3)

[('st time the client has cached it\n305 Use Proxy The resource should be accessed through a speci\x0ced proxy\n307 Temporary Redirect The request should be repeated with the same request method at the given address. Added in HTTP/1.1 to clarify the ambiguity in the behavior of status\n302. See 302 and 303\n4XX Client Error\nStatus Code Description\n400 Bad Request The request can not be ful\x0clled because the request contained bad syntax\n401 Unauthorized The client needs to authenticate in order to access this resource\n402 Payment Required This code is intended to be used for a micropayment system, but the speci\x0ccs for this system are unspeci\x0ced and this code is rarely used\n403 Forbidden The client is not allowed to access this resource. Generally, the client is authenticated and does not have su\x0ecient permission\n404 Not Found The resource was not found, though its existence in the future is possible\n405 Method Not Allowed The method used in the request is not supported

In [15]:
# sanity check
# ensure we are augmenting and did not remove the existing text!
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)

[('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don’t hire someone weak on purpose.\nThis sounds silly, but you wouldn’t believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct management executive. The CEO who used to be in\nsales who has a weak sales executive. The CEO who used to be\nin marketing who has a weak marketing executive.\nI call this the “Michael Eisner Memorial Weak Executive Problem” — aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it\npromptly fell to fourth place. His response? “If I had an extra\ntwo days a week, I could turn around ABC myself.” Well, guess\nwhat, he didn’t have an extra two days a week.\nA CEO — or a startup founder — oaen has a hard time letting\ngo of the function that brought him to the party. The result: you\nhire someone weak into the executive role for that function so\nthat you can continue to b

In [16]:
# run enhanced RAG pipeline with a query
result = rag_pipeline.run_pipeline(
    "What do http status codes starting with 4 mean?",
    k=3,
    response_length="comprehensive",
    include_warnings=True,
    confidence_required=True
)

print(f"Response: {result['response']}")
print(f"\nContext Count: {result['context_count']}")
print(f"Similarity Scores: {result['similarity_scores']}")

Response: HTTP status codes starting with 4 indicate a "Client Error." These codes represent situations where the client's request contains an error or the client does not have the required permissions for the resource. The relevant codes from the provided context are:

- **400 Bad Request**: The request cannot be fulfilled because it contained bad syntax.
- **401 Unauthorized**: The client needs to authenticate to access this resource.
- **402 Payment Required**: This code is intended for a micropayment system, but the specifics are unspecified, and it is rarely used.
- **403 Forbidden**: The client is not allowed to access this resource, typically because they are authenticated but do not have sufficient permission.
- **404 Not Found**: The resource was not found, though its existence in the future is possible.
- **405 Method Not Allowed**: The method used in the request is not supported by the resource.

This classification is specifically outlined in Source 1 of the provided contex

## Wrap Up and Learnings

With just a few functions, we have a powerful application that is pulling in useful results and respecting the context provided. The structure and annotations on the original notebook made it easy to understand and extend the functionality.

### 3 Lessons Learned

1. Make informed choices: Choices made at the level of data ingestion and embedding have real effects on the output and should be made with informed knowledge of how your app is to be used. For example: what types of queries do you expect and what are the characteristics of optimal response output?

2. Chunk size: Chunking directly affects the quality, accuracy and relevance of the app responses. A good rule of thumb is that chunk size should match the expected query length. More complex queries benefit from larger chunk sizes. We want to strike a balance between gathering semantic meaning of sentences vs. paragraphs for example. In the case of our application, chunk size is set at 1000 tokens which should be appropriate for a page or several long paragraphs of content. 
    - More thoughts here.[^1]

3. Similarity scores: We can use other types of similarity scores besides cosine similarity determine which chunks to retrieve from our vector database for the model to consider. For the purpose of this example app, cosine similarity seems like a good choice.
    - Claude gave me lots to read up on in this area! In general the advice seems to be starting with cosine similarity and experimenting with other types if you have an appropriate use case.[^2]

### 3 Lessons Not Yet Learned

1. Adding and using metadata: I can see how useful it would be to capture data such as author name, document name, date published, category, etc. Once that info is added, I'd want to update the prompt to include this information in a structured way in the response. Can I "weight" metadata more heavily to boost scores like one does in search. I bet it's possible!

2. Evaluating and improving: The similarity scores are all about the same in this example. How do I know if that is good? What is the process and some metrics to benchmark and continue to improve the results?

3. Repeated data: What if I mess up and re-insert data that is already there? Need to read docs to see if this is handled in vector db implementation or if code needs to ensure non-repetitive data.

4. Code modularity: I piggybacked off of Chris' awesome library. I need to gain more experience in the right way to think about structuring RAG projects for reuse with Python language. If I were to containerize this, how would I split out the functionality? For example, one change I think I'd make is to move the prompts to their own directory and make several more that could be swapped in/out for experimentation.

[^1]: More about chunking:

- From @Hee-Meng in discord: [Chunking Strategies](https://www.youtube.com/watch?v=ZTOtxiWb2bE) was so helpful! There are at least 5 different strategies for chunking (fixed size, context aware, recursive, specialized, semantic) and the advice given is to first clean your data, select the technique most appropriate for your use case, experiment with chunks of various sizes, and then assess the quality of responses before deciding on the optimal size.

[^2]: More similarity metrics:
- **Cosine Similarity**
    Measures the angle between vectors, ignoring magnitude. Values range from -1 to 1.
    _Best for:_ Text embeddings, semantic search, and cases where document length varies significantly. It's the most common choice for RAG because it focuses on meaning rather than word count. Works well when you want documents with similar topics regardless of their length.

- **Euclidean Distance (L2)**
    Measures straight-line distance between points in vector space. Smaller distances indicate higher similarity.
    _Best for:_ When both direction and magnitude matter, such as image embeddings or structured data. Less common in text RAG but useful when the embedding model produces vectors where magnitude carries semantic meaning.

- **Dot Product**
    Multiplies corresponding vector components and sums them. Higher values indicate greater similarity.
    _Best for:_ When vectors are normalized or when you want to consider both similarity and magnitude. Often used in neural networks and when working with embeddings that have been specifically designed for dot product similarity.

- **Manhattan Distance (L1)**
    Sums absolute differences between corresponding vector components.
    _Best for:_ High-dimensional sparse vectors or when you want to reduce the impact of outliers. Less sensitive to extreme values than Euclidean distance. Sometimes used with count-based embeddings.

- **Jaccard Similarity**
    Measures overlap between sets, calculated as intersection over union.
    _Best for:_ Sparse binary vectors, keyword matching, or when working with bag-of-words representations. Useful for document similarity based on shared terms rather than semantic meaning.

- **Hamming Distance**
    Counts differing positions between binary vectors.
    _Best for:_ Binary embeddings, locality-sensitive hashing, or when memory/speed constraints require binary representations. Often used in similarity search optimizations.