# üèóÔ∏è Activity #1: Enhanced RAG App

Enhance your RAG application in some way! 

Suggestions are: 

- Allow it to work with PDF files
- Implement a new distance metric
- Add metadata support to the vector database

While these are suggestions, you should feel free to make whatever augmentations you desire! 

> NOTE: These additions might require you to work within the `aimakerspace` library - that's expected!

> NOTE: If you're not sure where to start - ask Cursor (CMD/CTRL+L) to guide you through the changes!

## ‚úÖ Add PDF Files

**Plan:**

1. Copy original RAG application because this is intended to **enhance** original version.
    - NOTE; I am only including the code steps in this notebook for brevity, see `Pythonic_RAG_Assignment.ipynb` for the detailed walk through.
2. Create new class that handles loading, embedding, and insert into VectorDatabase for multiple PDF files.
    - NOTE: This class is added to `aimakerspace` library. It is best practice in data engineering to keep like file formats together, so I all PDFs are together in a directory.
3. Do the standard workflow for the PDFs: ingest, chunk, add to vector database.
    - NOTE: I leave the prompts as-is to assess initial performace with the PDF data addition only.
4. Test retrieval.
5. Assess result and conclusions.

--- 

For simplicity, we will merge some cells and provide quick comments on the steps.

In [1]:
# import dependencies
from aimakerspace.text_utils import TextFileLoader, CharacterTextSplitter
from aimakerspace.vectordatabase import VectorDatabase
import asyncio
import nest_asyncio

nest_asyncio.apply()

In [2]:
# load text document
text_loader = TextFileLoader("data/PMarcaBlogs.txt")
documents = text_loader.load_documents()

# sanity check
print(f"Loaded {len(documents)} document.")
print(f"\nHere's an excerpt: {documents[0][:100]}")

Loaded 1 document.

Here's an excerpt: Ôªø
The Pmarca Blog Archives
(select posts from 2007-2009)
Marc Andreessen
copyright: Andreessen Horow


In [3]:
# chunk the loaded text
text_splitter = CharacterTextSplitter()
split_documents = text_splitter.split_texts(documents)

# sanity check
print(f"Split into {len(split_documents)} chunks.")
split_documents[0:1]

Split into 373 chunks.


['\ufeff\nThe Pmarca Blog Archives\n(select posts from 2007-2009)\nMarc Andreessen\ncopyright: Andreessen Horowitz\ncover design: Jessica Hagy\nproduced using: Pressbooks\nContents\nTHE PMARCA GUIDE TO STARTUPS\nPart 1: Why not to do a startup 2\nPart 2: When the VCs say "no" 10\nPart 3: "But I don\'t know any VCs!" 18\nPart 4: The only thing that matters 25\nPart 5: The Moby Dick theory of big companies 33\nPart 6: How much funding is too little? Too much? 41\nPart 7: Why a startup\'s initial business plan doesn\'t\nmatter that much\n49\nTHE PMARCA GUIDE TO HIRING\nPart 8: Hiring, managing, promoting, and Dring\nexecutives\n54\nPart 9: How to hire a professional CEO 68\nHow to hire the best people you\'ve ever worked\nwith\n69\nTHE PMARCA GUIDE TO BIG COMPANIES\nPart 1: Turnaround! 82\nPart 2: Retaining great people 86\nTHE PMARCA GUIDE TO CAREER, PRODUCTIVITY,\nAND SOME OTHER THINGS\nIntroduction 97\nPart 1: Opportunity 99\nPart 2: Skills and education 107\nPart 3: Where to go and wh

In [4]:
# input our OpenAI API key to do the API calls
import os
import openai

from getpass import getpass

openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

In [5]:
# instantiate the vector database and insert embeddings
vector_db = VectorDatabase()
vector_db = asyncio.run(vector_db.abuild_from_list(split_documents))

In [6]:
# sanity check
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)

[('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don‚Äôt hire someone weak on purpose.\nThis sounds silly, but you wouldn‚Äôt believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct management executive. The CEO who used to be in\nsales who has a weak sales executive. The CEO who used to be\nin marketing who has a weak marketing executive.\nI call this the ‚ÄúMichael Eisner Memorial Weak Executive Problem‚Äù ‚Äî aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it\npromptly fell to fourth place. His response? ‚ÄúIf I had an extra\ntwo days a week, I could turn around ABC myself.‚Äù Well, guess\nwhat, he didn‚Äôt have an extra two days a week.\nA CEO ‚Äî or a startup founder ‚Äî oaen has a hard time letting\ngo of the function that brought him to the party. The result: you\nhire someone weak into the executive role for that function so\nthat y

In [7]:
from aimakerspace.openai_utils.prompts import (
    UserRolePrompt,
    SystemRolePrompt,
)

from aimakerspace.openai_utils.chatmodel import ChatOpenAI

# instantiate OpenAI client
chat_openai = ChatOpenAI()


In [8]:
RAG_SYSTEM_TEMPLATE = """You are a knowledgeable assistant that answers questions based strictly on provided context.

Instructions:
- Only answer questions using information from the provided context
- If the context doesn't contain relevant information, respond with "I don't know"
- Be accurate and cite specific parts of the context when possible
- Keep responses {response_style} and {response_length}
- Only use the provided context. Do not use external knowledge.
- Only provide answers when you are confident the context supports your response."""

RAG_USER_TEMPLATE = """Context Information:
{context}

Number of relevant sources found: {context_count}
{similarity_scores}

Question: {user_query}

Please provide your answer based solely on the context above."""

rag_system_prompt = SystemRolePrompt(
    RAG_SYSTEM_TEMPLATE,
    strict=True,
    defaults={
        "response_style": "concise",
        "response_length": "brief"
    }
)

rag_user_prompt = UserRolePrompt(
    RAG_USER_TEMPLATE,
    strict=True,
    defaults={
        "context_count": "",
        "similarity_scores": ""
    }
)

In [9]:
# RAG pipeline
class RetrievalAugmentedQAPipeline:
    def __init__(self, llm: ChatOpenAI, vector_db_retriever: VectorDatabase,
                 response_style: str = "detailed", include_scores: bool = False) -> None:
        self.llm = llm
        self.vector_db_retriever = vector_db_retriever
        self.response_style = response_style
        self.include_scores = include_scores

    def run_pipeline(self, user_query: str, k: int = 4, **system_kwargs) -> dict:
        # Retrieve relevant contexts
        context_list = self.vector_db_retriever.search_by_text(user_query, k=k)

        context_prompt = ""
        similarity_scores = []

        for i, (context, score) in enumerate(context_list, 1):
            context_prompt += f"[Source {i}]: {context}\n\n"
            similarity_scores.append(f"Source {i}: {score:.3f}")

        # Create system message with parameters
        system_params = {
            "response_style": self.response_style,
            "response_length": system_kwargs.get("response_length", "detailed")
        }

        formatted_system_prompt = rag_system_prompt.create_message(**system_params)

        user_params = {
            "user_query": user_query,
            "context": context_prompt.strip(),
            "context_count": len(context_list),
            "similarity_scores": f"Relevance scores: {', '.join(similarity_scores)}" if self.include_scores else ""
        }

        formatted_user_prompt = rag_user_prompt.create_message(**user_params)

        return {
            "response": self.llm.run([formatted_system_prompt, formatted_user_prompt]),
            "context": context_list,
            "context_count": len(context_list),
            "similarity_scores": similarity_scores if self.include_scores else None,
            "prompts_used": {
                "system": formatted_system_prompt,
                "user": formatted_user_prompt
            }
        }

In [10]:
# define it
rag_pipeline = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db,
    llm=chat_openai,
    response_style="detailed",
    include_scores=True
)

In [11]:
# run it with a query
result = rag_pipeline.run_pipeline(
    "What is the 'Michael Eisner Memorial Weak Executive Problem'?",
    k=3,
    response_length="comprehensive",
    include_warnings=True,
    confidence_required=True
)

print(f"Response: {result['response']}")
print(f"\nContext Count: {result['context_count']}")
print(f"Similarity Scores: {result['similarity_scores']}")


Response: The 'Michael Eisner Memorial Weak Executive Problem' refers to the tendency of a CEO or startup founder to hire an executive who is weak in the function that corresponds to their own strengths. This often occurs when the CEO has difficulty letting go of the role or function that originally brought them into a position of leadership. The context highlights an example of this problem with former Disney CEO Michael Eisner, who had been a successful TV network executive but struggled when he hired a weak executive for ABC after acquiring it. Instead of successfully turning around the network, Eisner claimed that he would be able to do so if he had more time, which illustrates the issue of relying on subpar talent to maintain a sense of control or superiority within a specific operational domain.

Context Count: 3
Similarity Scores: ['Source 1: 0.658', 'Source 2: 0.509', 'Source 3: 0.479']


### üèóÔ∏è Activity #1: Enhance RAG Application

In [12]:
from aimakerspace.text_utils import PDFFileLoader

pdf_loader = PDFFileLoader("data/pdfs")  # directory containing PDF files
pdf_documents = pdf_loader.load_documents()

# sanity check
print(f"Loaded {len(pdf_documents)} documents.") # Number of PDFs loaded
print(f"\nHere's an excerpt: {pdf_documents[2][:100]}") # Show first hundred chars of pdf 3

Loaded 10 document.

Here's an excerpt: COMP 423 lecture 6 Jan. 16, 2008
Codes for the positive integers
There are many situations in which 


In [13]:
# Split the documents

pdf_text_splitter = CharacterTextSplitter()
split_pdfs = text_splitter.split_texts(pdf_documents)
len(split_pdfs)

# add to the vector_db
vector_db = asyncio.run(vector_db.abuild_from_list(split_pdfs))

In [15]:
# sanity check with direct db query
vector_db.search_by_text("What do http status codes starting with 4 mean?", k=3)

[('st time the client has cached it\n305 Use Proxy The resource should be accessed through a speci\x0ced proxy\n307 Temporary Redirect The request should be repeated with the same request method at the given address. Added in HTTP/1.1 to clarify the ambiguity in the behavior of status\n302. See 302 and 303\n4XX Client Error\nStatus Code Description\n400 Bad Request The request can not be ful\x0clled because the request contained bad syntax\n401 Unauthorized The client needs to authenticate in order to access this resource\n402 Payment Required This code is intended to be used for a micropayment system, but the speci\x0ccs for this system are unspeci\x0ced and this code is rarely used\n403 Forbidden The client is not allowed to access this resource. Generally, the client is authenticated and does not have su\x0ecient permission\n404 Not Found The resource was not found, though its existence in the future is possible\n405 Method Not Allowed The method used in the request is not supported

In [16]:
# sanity check
# ensure we are augmenting and did not remove the existing text!
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)


[('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don‚Äôt hire someone weak on purpose.\nThis sounds silly, but you wouldn‚Äôt believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct management executive. The CEO who used to be in\nsales who has a weak sales executive. The CEO who used to be\nin marketing who has a weak marketing executive.\nI call this the ‚ÄúMichael Eisner Memorial Weak Executive Problem‚Äù ‚Äî aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it\npromptly fell to fourth place. His response? ‚ÄúIf I had an extra\ntwo days a week, I could turn around ABC myself.‚Äù Well, guess\nwhat, he didn‚Äôt have an extra two days a week.\nA CEO ‚Äî or a startup founder ‚Äî oaen has a hard time letting\ngo of the function that brought him to the party. The result: you\nhire someone weak into the executive role for that function so\nthat y

In [14]:
# run enhanced RAG pipeline with a query
result = rag_pipeline.run_pipeline(
    "What do http status codes starting with 4 mean?",
    k=3,
    response_length="comprehensive",
    include_warnings=True,
    confidence_required=True
)

print(f"Response: {result['response']}")
print(f"\nContext Count: {result['context_count']}")
print(f"Similarity Scores: {result['similarity_scores']}")

Response: HTTP status codes starting with 4 signify "Client Error." This category includes the following specific status codes:

- **400 Bad Request**: The request cannot be fulfilled because it contained bad syntax.
- **401 Unauthorized**: The client needs to authenticate in order to access this resource.
- **402 Payment Required**: This code is intended for a micropayment system, but specifics are unspecified and it's rarely used.
- **403 Forbidden**: The client is not allowed to access this resource, generally meaning the client is authenticated but lacks sufficient permission.
- **404 Not Found**: The resource was not found, although its existence in the future is possible.
- **405 Method Not Allowed**: The method used in the request is not supported by the resource.

These codes indicate that there was an error from the client's side, typically due to invalid requests or permissions issues.

Context Count: 3
Similarity Scores: ['Source 1: 0.556', 'Source 2: 0.551', 'Source 3: 0.52

## Wrap Up and Learnings

### 3 Lessons Learned

1. sdf
2. sdfs
3. sdf

### 3 Lessons Not Yet Learned

1. sdf
2. ds
3. es

