<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_045_langchain_RAG_pdf_reader_quality_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## **What is RAG?**

### **Introduction to RAG (Retrieval-Augmented Generation)**

**Retrieval-Augmented Generation (RAG)** is a framework that combines **information retrieval** with **language model generation**. It enables Large Language Models (LLMs) to retrieve relevant content from external sources (e.g., documents, databases, or APIs) and use that information to generate accurate and context-aware responses.

Unlike standard LLMs, which rely solely on their pre-trained knowledge (limited by a knowledge cutoff date), RAG enhances their ability to provide real-time, domain-specific, or custom responses by dynamically accessing external knowledge.

---

#### **Why is RAG Important?**

1. **Overcomes Knowledge Limitations**:
   - Pre-trained LLMs lack knowledge of events or information beyond their training data. RAG solves this by allowing the model to query external sources for up-to-date or domain-specific information.

2. **Improves Accuracy**:
   - By retrieving relevant content, RAG helps generate more factual and precise responses, reducing hallucinations (when an LLM generates incorrect or made-up information).

3. **Enables Domain-Specific Applications**:
   - RAG can be applied to specialized domains (e.g., legal, healthcare, or research) by retrieving from custom knowledge bases or document sets.

4. **Efficient Use of Resources**:
   - Rather than fine-tuning an LLM for specific tasks, RAG allows you to **retrieve and utilize knowledge dynamically**, making it more efficient and cost-effective.

---

#### **How RAG Works**:
RAG typically involves the following steps:
1. **Document Indexing**:
   - Convert a collection of documents into vector representations (embeddings) and store them in a **vector database** (e.g., FAISS, Pinecone).

2. **Query Embedding and Retrieval**:
   - Convert the user’s query into an embedding and perform a **similarity search** to find the most relevant documents from the vector store.

3. **Generation**:
   - Combine the retrieved documents with the query and feed them into an LLM to generate a response.

---

#### **Use Cases of RAG**:
- **Document-Based Question Answering**: Answer questions based on a knowledge base (e.g., company manuals, research papers).
- **Chatbots with External Knowledge**: Build chatbots that use real-time or custom data.
- **Search-Enhanced Applications**: Improve search engines by retrieving and summarizing information.
- **Content Summarization and Insight Extraction**: Summarize large documents or extract key insights.

---

This introduction sets the stage for building RAG-based applications using LangChain. In the next sections, we’ll explore how to:
1. **Load and Index Documents**.
2. **Retrieve Relevant Content**.
3. **Combine Retrieval with Language Models**.


In [19]:
# !pip install langchain
# !pip install openai
# !pip install python-dotenv
# !pip install langchain-openai
# !pip install langchain langchainhub langchain_openai faiss-cpu tiktoken pypdf langchain-community
# !pip install PyMuPDF

In [22]:
import os
from dotenv import load_dotenv
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings  # Updated import for OpenAIEmbeddings

# Load environment variables from .env file
load_dotenv('/content/API_KEYS.env')
api_key = os.getenv("OPENAI_API_KEY")
# Set the environment variable globally for libraries like LangChain
os.environ["OPENAI_API_KEY"] = api_key
# Print the API key to confirm it's loaded correctly
print("API Key loaded from .env:",os.environ["OPENAI_API_KEY"][0:30])

API Key loaded from .env: sk-proj-e1GUWruINPRnrozmiakkRM


### RAG using Langchain

In [23]:
# Step 1: Load the PDF Document
pdf_loader = PyPDFLoader("/content/Art Collector personality types_ finish.pdf")
documents = pdf_loader.load()

# Step 2: Split Text into Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

# Step 3: Generate Embeddings and Store in FAISS
embeddings = OpenAIEmbeddings(openai_api_key=api_key)  # Correct usage
vectorstore = FAISS.from_documents(texts, embeddings)

print("PDF document loaded, split, and indexed successfully!")

# Step 4: Perform Retrieval
retriever = vectorstore.as_retriever()

# Step 5: Query the Knowledge Base
query = "What is the main topic discussed in the document?"
retrieved_docs = retriever.invoke(query)

# Step 6: Display Retrieved Results
print("\nTop Retrieved Document:")
print(retrieved_docs[0].page_content)

PDF document loaded, split, and indexed successfully!

Top Retrieved Document:
Build the bridge between each artwork and it’s 
historical context. Talk to people from the academic 
field about those historical references and build 
connections out with them.
9
10



### **How to Fix the Output**
The low-quality output comes from how the **PDF text extraction** and subsequent **text chunking** behave. When extracting text from a PDF, some documents can have:
1. **Page Numbers or Hidden Artifacts**: Numbers, line breaks, or extra whitespace might get included as artifacts.
2. **Text Alignment Issues**: PDF formats are often complex, and tools like `PyPDFLoader` might extract unintended content (e.g., headers, footers, or numbers).

To clean up the retrieved results and remove unwanted numbers or artifacts, you can use one of the following approaches:

---

### **Solution 1: Post-Process Retrieved Text**
You can add a **post-processing step** to clean the extracted text before displaying it.

1. **Remove Numbers**:
   - `re.sub(r'\s*\d+\s*', '', text)` removes isolated numbers surrounded by spaces.
2. **Remove Excess Newlines**:
   - `re.sub(r'\n+', ' ', text)` replaces multiple line breaks with a single space.
3. **Strip Extra Spaces**:
   - `.strip()` ensures no extra whitespace at the start or end.



In [24]:
# Step 4: Perform Retrieval
retriever = vectorstore.as_retriever()

# Step 5: Query the Knowledge Base
query = "What is the main topic discussed in the document?"
retrieved_docs = retriever.invoke(query)

# Step 6: Post-process the Retrieved Results
def clean_text(text):
    import re
    cleaned_text = re.sub(r'\s*\d+\s*', '', text)  # Removes isolated numbers
    cleaned_text = re.sub(r'\n+', ' ', cleaned_text).strip()  # Removes excess newlines
    return cleaned_text

# Clean the first retrieved document
cleaned_result = clean_text(retrieved_docs[0].page_content)

# Step 7: Display Cleaned Results
print("\nTop Retrieved Document (Cleaned):")
print(cleaned_result)


Top Retrieved Document (Cleaned):
Build the bridge between each artwork and it’s  historical context. Talk to people from the academic  field about those historical references and build  connections out with them.


### **Solution 2: Pre-Process Text During Chunking**
You can clean the text **before indexing** it in the vector store. This ensures that the FAISS database only stores clean text chunks.

This ensures:
- Text is cleaned **before** it gets split into chunks.
- Only clean text is indexed in the FAISS database.

---

### **Solution 3: Tune the Chunking Strategy**
If numbers are artifacts from headers/footers, increasing the **`chunk_overlap`** can help capture surrounding context better and reduce fragmenting.

Try:
```python
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
```

---



In [26]:
# Clean each document page before splitting
def preprocess_document(doc):
    import re
    content = re.sub(r'\s*\d+\s*', '', doc.page_content)  # Remove numbers
    content = re.sub(r'\n+', ' ', content).strip()       # Remove newlines and strip spaces
    doc.page_content = content
    return doc

# Clean the text before splitting
cleaned_documents = [preprocess_document(doc) for doc in documents]

# Split Text into Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(cleaned_documents)
print(texts)

[Document(metadata={'source': '/content/Art Collector personality types_ finish.pdf', 'page': 0}, page_content='rt  Collector  personality  types'), Document(metadata={'source': '/content/Art Collector personality types_ finish.pdf', 'page': 3}, page_content="The Novice Collector: These are individuals just starting in the art collection  world. They are eager, curious, and often rely on others' opinions or visible  trends. For artists, understanding their tastes and guiding them can lead to  a long-term patron relationship. Their inexperience can be an advantage  because they might be more open to various art styles. The Niche Enthusiast: This collector looks for very specific art forms, be it  abstract, surrealism, or any niche genre. They’re"), Document(metadata={'source': '/content/Art Collector personality types_ finish.pdf', 'page': 3}, page_content='abstract, surrealism, or any niche genre. They’re well-informed about their  chosen niche and appreciate artists who excel within i

### **Best Approach**

### PDF Loading, Pre-Processing, Indexing, and Post-Processing Retrieval
Combining **Solution 2 (pre-processing)** and **Solution 1 (post-processing)** ensures clean text storage and retrieval.



In [38]:
# Step 1: Load the PDF Document
pdf_loader = PyPDFLoader("/content/Art Collector personality types_ finish.pdf")  # Replace with your PDF file path
documents = pdf_loader.load()

# Step 2: Split Text into Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

# Step 3: Generate Embeddings and Store in FAISS
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
vectorstore = FAISS.from_documents(texts, embeddings)

# Step 4: Initialize the Retriever and LLM
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini", temperature=0.5)

# Step 5: Define a Function for RAG
def rag_query(query):
    # Retrieve relevant documents
    retrieved_docs = retriever.invoke(query)
    context = "\n\n".join([doc.page_content for doc in retrieved_docs[:2]])  # Combine top 2 results for context

    # Pass retrieved context and query to the LLM
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    response = llm.invoke(prompt)

    return response.content

# Step 6: Ask a Question
query = "What are the personality types described in the document?"
response = rag_query(query)

# Step 7: Display the Response
print("\nQuestion:", query)
print("\nAnswer:", response)



Question: What are the personality types described in the document?

Answer: The document specifically describes the "Futurist collector" personality type. Additionally, it mentions that most collectors are a combination of several personality types, indicating that there are other personality types not explicitly detailed in the provided context. However, the document does not list or define these other personality types.


The current output is very poor. The PDF is all about the different art investor personality types, bu the respone says "the document does not list or define these other personality types."

The quality of the output depends on how well each step of the **RAG pipeline** works: **loading the document**, **splitting it into meaningful chunks**, **embedding the text**, **retrieving relevant context**, and **generating accurate responses**.

To systematically **debug and improve** the code, we’ll break the workflow into steps and verify each one. This way, we can pinpoint where the breakdown happens and ensure the pipeline works as expected.

---

### **Step-by-Step RAG Debugging Plan**
1. **Verify PDF Loading**: Ensure the text is being extracted correctly from the PDF.
2. **Inspect Text Splitting**: Check that the text is split into coherent chunks.
3. **Check Embeddings**: Verify embeddings are created and stored successfully.
4. **Test Retrieval**: Confirm the retriever returns relevant chunks.
5. **Improve Generation**: Optimize how retrieved context is passed to the LLM.

---

### **Step 1: Verify PDF Loading**
We’ll start by loading the PDF and printing the raw text to ensure the content is being extracted correctly.

### **What to Look For**:
- Is the extracted text accurate, without weird artifacts (numbers, broken sentences, etc.)?
- Does the content include the descriptions of art collector personality types?

If the output contains page numbers, extra whitespace, or formatting artifacts, we’ll clean it up before moving to the next step.




In [39]:
from langchain.document_loaders import PyPDFLoader

# Step 1: Load the PDF Document
pdf_loader = PyPDFLoader("/content/Art Collector personality types_ finish.pdf")
documents = pdf_loader.load()

# Step 2: Print Raw Content of the PDF
print("Raw PDF Content (First Page):\n")
print(documents[0].page_content[:2000])  # Print first 2000 characters of the first page

Raw PDF Content (First Page):

rt 
Collector 
personality 
types



### PyMuPDF (fitz) – Best for Text-Based PDFs
PyMuPDF is fast, lightweight, and works exceptionally well for extracting text from text-based PDFs.

In [4]:
import fitz  # PyMuPDF

# Load PDF and extract text
def extract_text_pymupdf(file_path):
    doc = fitz.open(file_path)  # Open the PDF file
    text = ""
    for page in doc:
        text += page.get_text()  # Extract text from each page
    return text

# Use the function
pdf_path = "/content/Art Collector personality types_ finish.pdf"
pdf_text = extract_text_pymupdf(pdf_path)

# Print extracted text
print("Extracted Text (First 2000 characters):\n")
print(pdf_text[:2000])

Extracted Text (First 2000 characters):

rt 
Collector 
personality 
types
The Novice Collector: These are individuals just starting in the art collection 
world. They are eager, curious, and often rely on others' opinions or visible 
trends. For artists, understanding their tastes and guiding them can lead to 
a long-term patron relationship. Their inexperience can be an advantage 
because they might be more open to various art styles.
The Niche Enthusiast: This collector looks for very specific art forms, be it 
abstract, surrealism, or any niche genre. They’re well-informed about their 
chosen niche and appreciate artists who excel within it. For artists, aligning 
with their niche preference is key to gain their attention.
a. Offer educational content about art, perhaps 
through workshops. Assist them in understanding the 
intricacies of the art world.
a. Showcase expertise within their preferred niche 
and become a go to resource to expand their 
understanding of that niche. Creat

### **Step 2: Clean and Split Text into Chunks**
If Step 1 shows issues, we’ll clean the text before splitting. We’ll also print the chunks to ensure they make sense.

### **What to Look For**:
- Are the text chunks clean and coherent?
- Do they retain enough context about art collector personality types?



In [6]:
# Step 1: Split Extracted Text into Chunks
def split_text_to_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunks = text_splitter.create_documents([text])
    return chunks

# Use the function to split extracted text
texts = split_text_to_chunks(pdf_text)

# Step 2: Print Example Chunks
print("Example Text Chunks:\n")
for i, chunk in enumerate(texts[:2]):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")


Example Text Chunks:

Chunk 1:
rt 
Collector 
personality 
types

1

The Novice Collector: These are individuals just starting in the art collection 
world. They are eager, curious, and often rely on others' opinions or visible 

trends. For artists, understanding their tastes and guiding them can lead to 

a long-term patron relationship. Their inexperience can be an advantage 

because they might be more open to various art styles.

Chunk 2:
a. Offer educational content about art, perhaps 
through workshops. Assist them in understanding the 

intricacies of the art world.

The Niche Enthusiast: This collector looks for very specific art forms, be it 
abstract, surrealism, or any niche genre. They’re well-informed about their 

chosen niche and appreciate artists who excel within it. For artists, aligning 

with their niche preference is key to gain their attention.

2



The output from **text chunking** shows some issues, but it’s close to being usable. Here's a detailed breakdown:

---

### **Issues with the Chunks**
1. **Artifacts**:
   - The first chunk contains stray characters like **`rt`** and `1` that don’t belong to the actual content.
   - These artifacts are likely remnants of formatting, headers, or footers extracted during PDF parsing.

2. **Improper Separation**:
   - The start of **Chunk 2** ("a. Offer educational content...") appears disconnected from the preceding content, making it unclear if it belongs to "The Novice Collector" or a separate section.

3. **Loss of Context**:
   - Descriptions of collector types like **"The Novice Collector"** and **"The Niche Enthusiast"** are split across chunks, which could impact retrieval and final answer generation.

---

### **How to Fix It**

We’ll improve the chunking process by:
1. **Pre-cleaning the Text**: Remove stray characters, unnecessary whitespace, and any page artifacts.
2. **Improving Chunk Boundaries**: Use paragraph-based splitting instead of character splitting to preserve context better.

---

### **Improved Text Splitting: Paragraph-Based Chunks**
We'll clean the extracted text and split it into logical chunks based on paragraphs, ensuring better context preservation.


### **What’s Improved**
1. **Artifacts Removed**: Stray characters like `rt` and `1` are cleaned out.
2. **Logical Chunks**: The text is split into **paragraph-based chunks** instead of cutting arbitrarily at character limits.
3. **Context Retained**: Descriptions of collector types (like "The Novice Collector") are preserved within the same chunk.



In [12]:
# Step 1: Pre-cleaning Function
def clean_text(text):
    text = re.sub(r'\s*\d+\s*', '', text)  # Remove isolated numbers
    text = re.sub(r'\n+', '\n', text)  # Replace multiple newlines with a single newline
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Step 2: Split Text into Paragraphs
def split_text_by_paragraphs(text):
    paragraphs = [p.strip() for p in text.split("\n") if p.strip()]
    return paragraphs

# Step 3: Split Cleaned Paragraphs into Chunks
def split_paragraphs_to_chunks(paragraphs, chunk_size=500, chunk_overlap=50):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.create_documents(paragraphs)
    return chunks

# Step 4: Apply Cleaning and Chunking
cleaned_text = clean_text(pdf_text)  # Clean the extracted text
paragraphs = split_text_by_paragraphs(cleaned_text)  # Split text into paragraphs
chunks = split_paragraphs_to_chunks(paragraphs)  # Create chunks

# Step 5: Print Example Chunks
print("Improved Text Chunks:\n")
for i, chunk in enumerate(chunks[:2]):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

Improved Text Chunks:

Chunk 1:
rt Collector personality typesThe Novice Collector: These are individuals just starting in the art collection world. They are eager, curious, and often rely on others' opinions or visible trends. For artists, understanding their tastes and guiding them can lead to a long-term patron relationship. Their inexperience can be an advantage because they might be more open to various art styles. a. Offer educational content about art, perhaps through workshops. Assist them in understanding the

Chunk 2:
workshops. Assist them in understanding the intricacies of the art world. The Niche Enthusiast: This collector looks for very specific art forms, be it abstract, surrealism, or any niche genre. They’re well-informed about their chosen niche and appreciate artists who excel within it. For artists, aligning with their niche preference is key to gain their attention.a. Showcase expertise within their preferred niche and become a go to resource to expand their under



### **Improved Approach: Identify and Split Personality Types**

1. **Define Patterns**: Identify where each personality type starts using keywords like "The [Personality Type]".
2. **Chunk by Patterns**: Split the text logically based on these patterns so that each chunk corresponds to one personality type.

---

### **What This Code Does**
1. **Clean the Text**:
   - Removes unnecessary artifacts like numbers, extra whitespace, and multiple newlines.

2. **Split by Pattern**:
   - Uses a regex pattern `r"(?=The [A-Z][a-z]+ Collector:)"` to find and split the text at each personality type starting with "The".

3. **Output Logical Chunks**:
   - Each chunk corresponds to one personality type (e.g., "The Novice Collector", "The Niche Enthusiast").



In [13]:
# Step 1: Pre-cleaning Function
def clean_text(text):
    text = re.sub(r'\s*\d+\s*', '', text)  # Remove isolated numbers
    text = re.sub(r'\n+', '\n', text)  # Replace multiple newlines with a single newline
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Step 2: Split Text into Personality Types
def split_by_personality_types(text):
    # Use regex to find each personality type starting with "The"
    pattern = r"(?=The [A-Z][a-z]+ Collector:)"  # Look for "The [Name] Collector:"
    chunks = re.split(pattern, text)

    # Add back the split pattern "The" to each chunk
    formatted_chunks = [chunk.strip() for chunk in chunks if chunk.strip()]
    return formatted_chunks

# Step 3: Apply Cleaning and Chunking
cleaned_text = clean_text(pdf_text)  # Clean the extracted text
personality_chunks = split_by_personality_types(cleaned_text)  # Split by personality types

# Step 4: Print Each Personality Type Chunk
print("Personality Type Chunks:\n")
for i, chunk in enumerate(personality_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


Personality Type Chunks:

Chunk 1:
rt Collector personality types

Chunk 2:
The Novice Collector: These are individuals just starting in the art collection world. They are eager, curious, and often rely on others' opinions or visible trends. For artists, understanding their tastes and guiding them can lead to a long-term patron relationship. Their inexperience can be an advantage because they might be more open to various art styles. a. Offer educational content about art, perhaps through workshops. Assist them in understanding the intricacies of the art world. The Niche Enthusiast: This collector looks for very specific art forms, be it abstract, surrealism, or any niche genre. They’re well-informed about their chosen niche and appreciate artists who excel within it. For artists, aligning with their niche preference is key to gain their attention.a. Showcase expertise within their preferred niche and become a go to resource to expand their understanding of that niche. Create niche spe



### **Step 3: Embed and Store Chunks**
Once the text is clean and chunked, we’ll generate embeddings and confirm the vector database is built successfully.



In [15]:
# Generate Embeddings and Store in FAISS
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
vectorstore = FAISS.from_documents(texts, embeddings)

print("Embeddings created and stored successfully!")
print(f"Number of Chunks Indexed: {len(texts)}")

Embeddings created and stored successfully!
Number of Chunks Indexed: 40




### **Step 4: Test Retrieval**
We’ll test the retriever to confirm it returns relevant chunks when queried.


### **What to Look For**:
- Do the retrieved chunks contain relevant content about the **personality types** of art collectors?
- If not, the embedding or chunking strategy may need further tuning.



In [16]:
# Initialize Retriever
retriever = vectorstore.as_retriever()

# Query the Knowledge Base
query = "What are the personality types of art collectors described in the document?"
retrieved_docs = retriever.invoke(query)

# Display Retrieved Chunks
print("Top Retrieved Chunks:\n")
for i, doc in enumerate(retrieved_docs[:2]):
    print(f"Chunk {i+1}:\n{doc.page_content}\n")

Top Retrieved Chunks:

Chunk 1:
rt 
Collector 
personality 
types

1

The Novice Collector: These are individuals just starting in the art collection 
world. They are eager, curious, and often rely on others' opinions or visible 

trends. For artists, understanding their tastes and guiding them can lead to 

a long-term patron relationship. Their inexperience can be an advantage 

because they might be more open to various art styles.

Chunk 2:
a. Offer educational content about art, perhaps 
through workshops. Assist them in understanding the 

intricacies of the art world.

The Niche Enthusiast: This collector looks for very specific art forms, be it 
abstract, surrealism, or any niche genre. They’re well-informed about their 

chosen niche and appreciate artists who excel within it. For artists, aligning 

with their niche preference is key to gain their attention.

2




### **Step 5: Improve Generation**
Once we verify retrieval is working correctly, we’ll pass the retrieved chunks as context to the LLM in a clean and structured prompt.

### **What This Does**:
1. Combines the **retrieved chunks** into a clear, structured context.
2. Passes the context and query to the LLM, ensuring it focuses on the relevant information.

---

### **Summary of Steps**
1. **Load and Verify PDF**: Ensure text is extracted accurately.
2. **Clean and Split**: Clean up artifacts and split text into coherent chunks.
3. **Embed and Store**: Generate embeddings and confirm vector storage.
4. **Test Retrieval**: Verify that relevant chunks are being retrieved.
5. **Generate Answer**: Pass clean, structured context to the LLM for accurate responses.



In [18]:
# Initialize LLM
llm = ChatOpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini", temperature=0.5)

# Combine Retrieved Chunks as Context
context = "\n\n".join([doc.page_content for doc in retrieved_docs[:]])
prompt = f"Context:\n{context}\n\nQuestion: What are the personality types of art collectors described in the document?\nAnswer:"

# Generate Answer
response = llm.invoke(prompt)

# Display the Answer
print("\nGenerated Answer:\n")
print(response.content)


Generated Answer:

The document describes the following personality types of art collectors:

1. **The Novice Collector**: Individuals just starting in the art collection world, eager and curious, often relying on others' opinions or trends.

2. **The Niche Enthusiast**: Collectors who seek specific art forms or genres, well-informed about their chosen niche and appreciate artists excelling within it.

3. **The Investment Collector**: Collectors primarily driven by the potential return on investment, looking for artworks that will appreciate in value.

4. **The Aesthete**: Collectors who focus on beauty and seek artworks that appeal to their sense of aesthetics.



The chunking now accurately splits the PDF into logical, well-separated sections for each personality type of art collector. Each chunk aligns with a specific category (e.g., *The Novice Collector*, *The Niche Enthusiast*, etc.), preserving context and readability.

---

### **Next Steps: Enhancing RAG**

Now that we have clean and meaningful chunks, let’s move forward to build the **retrieval-augmented generation (RAG)** pipeline. Here's what we’ll do:

1. **Embed the Chunks**: Convert each chunk into vector embeddings and store them in a **FAISS** vector database.
2. **Query the Chunks**: Use a retriever to fetch the most relevant personality type based on a question.
3. **Generate an Answer**: Combine the retrieved content with a natural-language query, and use an LLM to generate a clear, concise answer.

---

### **Explanation of Steps**

1. **Text Cleaning**:
   - Removes artifacts (numbers, line breaks) for clean processing.

2. **Regex-Based Chunking**:
   - Each chunk corresponds to a specific **art collector personality type**.
   - Regex: `(?=The [A-Z][a-z]+ Collector:)` detects patterns where each collector type starts.

3. **Embedding Chunks**:
   - Converts the personality type chunks into **embeddings** using `OpenAIEmbeddings`.
   - Stores the embeddings in a **FAISS** vector database for fast semantic retrieval.

4. **Retrieval and Generation**:
   - The query fetches the top 2 most relevant chunks.
   - The retrieved chunks are passed as **context** to the LLM to generate an answer.



### **Why This is Effective for RAG**
- **Accurate Retrieval**: Each chunk is logically aligned with a personality type, making retrieval precise.
- **Improved Generation**: By using the retrieved context, the LLM focuses on generating a factually accurate and concise answer.


In [14]:
# Step 1: Pre-cleaning Function
def clean_text(text):
    text = re.sub(r'\s*\d+\s*', '', text)  # Remove isolated numbers
    text = re.sub(r'\n+', '\n', text)  # Replace multiple newlines with a single newline
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Step 2: Embed and Store the Chunks
def create_vectorstore(text):
    # Split by personality types
    personality_chunks = re.split(r"(?=The [A-Z][a-z]+ Collector:)", text)
    formatted_chunks = [chunk.strip() for chunk in personality_chunks if chunk.strip()]

    # Convert into LangChain documents
    from langchain.docstore.document import Document
    documents = [Document(page_content=chunk) for chunk in formatted_chunks]

    # Create embeddings and store in FAISS
    embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
    vectorstore = FAISS.from_documents(documents, embeddings)
    return vectorstore

# Step 3: Query the Vectorstore and Generate Answers
def rag_pipeline(vectorstore, query):
    # Initialize Retriever
    retriever = vectorstore.as_retriever()
    retrieved_docs = retriever.invoke(query)

    # Combine Retrieved Content
    context = "\n\n".join([doc.page_content for doc in retrieved_docs[:2]])

    # Pass to LLM
    llm = ChatOpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o-mini", temperature=0.5)
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    response = llm.invoke(prompt)

    return response.content

# Step 4: Run the RAG Pipeline
if __name__ == "__main__":
    # Load Cleaned Text
    cleaned_text = clean_text(pdf_text)

    # Create Vectorstore
    vectorstore = create_vectorstore(cleaned_text)
    print("Chunks successfully embedded and stored in FAISS!\n")

    # Query the RAG System
    query = "What are the personality types of art collectors?"
    response = rag_pipeline(vectorstore, query)

    # Print the Generated Answer
    print("Query:\n", query)
    print("\nGenerated Answer:\n", response)


Chunks successfully embedded and stored in FAISS!

Query:
 What are the personality types of art collectors?

Generated Answer:
 The personality types of art collectors include:

1. **The Traveler Collector**: Collects art that encapsulates memories of their journeys, often seeking works that capture the essence of popular destinations.

2. **The Art Fair Regular**: Frequent visitors to art fairs who look for new talents and pieces to add to their collections. They value presence at major art fairs.

3. **The Family Legacy Collector**: Inherits and continues family art collections, seeking artworks that complement existing pieces, often aligned with historic or traditional styles.

4. **The Bargain Hunter**: Always on the lookout for undervalued pieces or emerging artists, they take pride in finding hidden gems at lower price points.

5. **The Futurist Collector**: Market mavens with a specific vision for the future of art, they collect pieces that align with their futuristic aesthetic

### **Notebook Summary: Building a Robust RAG (Retrieval-Augmented Generation) Pipeline**

---

#### **Overview of the Notebook**
In this notebook, we successfully built a **Retrieval-Augmented Generation (RAG)** pipeline to extract, retrieve, and generate answers from a PDF document. The workflow focused on ensuring **high-quality input and output** by addressing various challenges encountered along the way.

---

### **Key Steps Taken**

1. **PDF Text Extraction**:
   - We started by extracting text from a PDF using **PyMuPDF (fitz)** after identifying that LangChain's `PyPDFLoader` struggled with mixed text and image content.
   - PyMuPDF provided clean and accurate extraction of the text.

2. **Cleaning the Extracted Text**:
   - The extracted text contained artifacts like isolated numbers, unnecessary whitespace, and formatting issues.
   - We implemented a **pre-cleaning function** using regex to:
     - Remove page numbers or random artifacts.
     - Normalize newlines and spaces.

3. **Logical Text Chunking**:
   - To ensure retrieval accuracy, we split the text into meaningful **chunks**.
   - Instead of arbitrary chunking, we used a **pattern-based approach** with regex to split the text at **"The [Personality Type] Collector:"**.
   - This method ensured that each chunk corresponded to a distinct personality type, preserving context.

4. **Vectorization and Storage**:
   - Each cleaned chunk was converted into embeddings using **`OpenAIEmbeddings`** and stored in a **FAISS vector store**.
   - FAISS allowed us to efficiently retrieve the most relevant chunks based on a user query.

5. **Retrieval and Generation**:
   - We queried the vector store to retrieve the most relevant personality type descriptions.
   - The retrieved context was passed to an **LLM** (GPT-4) with a structured prompt to generate a concise and accurate answer.

---

### **Challenges Faced and Solutions**

| **Challenge**                             | **Solution**                                                                                     |
|-------------------------------------------|--------------------------------------------------------------------------------------------------|
| 1. **PDF Text Extraction Issues**          | Switched to **PyMuPDF** for reliable extraction. PyMuPDF handled complex PDFs better than `PyPDFLoader`. |
| 2. **Artifacts in Text (Numbers/Headers)** | Added a **pre-cleaning function** to remove stray characters, page numbers, and formatting artifacts. |
| 3. **Chunking Misalignment**               | Used a **regex-based pattern** to split text logically at "The [Personality Type] Collector:". This preserved semantic context. |
| 4. **Retrieval Quality**                   | Retrieved the top **2 most relevant chunks** to ensure the query had sufficient context.          |
| 5. **LLM Output Quality**                  | Combined the retrieved chunks into a clear, structured **context + question** prompt for the LLM. |

---

### **What to Watch Out For to Avoid Similar Issues**

1. **PDF Format**:
   - **Issue**: PDFs can contain a mix of text, images, and hidden formatting artifacts.
   - **Solution**: Use a robust library like **PyMuPDF** for text-based PDFs and OCR-based tools (e.g., `UnstructuredPDFLoader` with Tesseract) for image-based PDFs.

2. **Text Artifacts**:
   - **Issue**: Extracted text may contain page numbers, headers, footers, or random symbols.
   - **Solution**: Implement a **pre-cleaning step** using regex to remove unwanted content.

3. **Chunking Strategy**:
   - **Issue**: Arbitrary chunking (e.g., fixed character chunks) can split related content, reducing retrieval relevance.
   - **Solution**: Use **pattern-based splitting** for structured content (e.g., splitting at section headers or keywords).

4. **Retrieval Accuracy**:
   - **Issue**: Irrelevant or incomplete chunks can be retrieved, affecting the LLM’s output quality.
   - **Solution**:
     - Retrieve the top **2-3 relevant chunks** to provide sufficient context.
     - Verify the retrieved content before passing it to the LLM.

5. **Prompt Design for LLMs**:
   - **Issue**: Poorly structured prompts can confuse the model, leading to inaccurate responses.
   - **Solution**: Use a clear and consistent **"Context + Question + Answer"** prompt format to guide the LLM.

---

### **Lessons Learned**

1. **Quality of Input Determines Quality of Output**:
   - Garbage in, garbage out. Cleaning and preparing the input text (e.g., pre-processing and logical chunking) is **critical** for high-quality retrieval and generation.

2. **Choosing the Right Tools Matters**:
   - Switching from `PyPDFLoader` to **PyMuPDF** greatly improved text extraction reliability.
   - Always choose tools suited to your data type (e.g., OCR for image-based PDFs).

3. **Semantic Chunking is Key**:
   - Breaking text into logical sections, rather than arbitrary chunks, ensures better retrieval accuracy.

4. **Iterative Debugging**:
   - Breaking the RAG pipeline into smaller steps (e.g., loading, cleaning, chunking, embedding) allowed us to identify and fix issues incrementally.

5. **Structured Prompts Drive Better Results**:
   - Providing clear context to the LLM leads to more relevant and concise answers.

---

### **Final Pipeline Workflow**

1. **Extract**: Load and clean the text using PyMuPDF.
2. **Chunk**: Split text into logical sections based on patterns (e.g., "The [Personality Type] Collector").
3. **Embed**: Convert chunks into embeddings using `OpenAIEmbeddings`.
4. **Retrieve**: Query the vector store to fetch the most relevant chunks.
5. **Generate**: Pass the retrieved context and query to the LLM for a high-quality answer.

