# Understanding Retrievers in RAG Systems

### Introduction
Welcome to our exploration of retrievers in Retrieval Augmented Generation (RAG) systems. This notebook demonstrates how to build a simple yet powerful retrieval system that can search through documents and find relevant information based on user queries.

### What are Retrievers?
Retrievers are components in RAG systems that search through a document collection to find information relevant to a query. They serve as the "memory" for large language models, allowing them to access and reference specific information beyond their training data.

### Step 1: Setting Up Our Environment

In [1]:
from langchain_community.document_loaders import PyMuPDFLoader, TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

These imports give us access to:

- Document loaders for different file types
- FAISS for efficient similarity search
- OpenAI's embedding model to convert text to vectors
- Text splitters to break documents into manageable chunks

### Step 2: The RAG Pipeline Overview

In [2]:
# LOAD DOCUMENT --> SPLIT CHUNKS

# EMBEDDING --> EMBED CHUNKS --> VECTORS

# VECTOR CHUNKS -- SAVE DB

# "query" --> similarity search faiss db

This represents the typical workflow of a RAG system:

1. Load documents and split into chunks
2. Convert text chunks into vector embeddings
3. Store these vectors in a database
4. Perform similarity search when given a query

### Step 3: Loading and Chunking Documents

In [3]:
loader = TextLoader('../test.txt', encoding = 'UTF-8')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100,
        separators=["\n\n", "\n", ".", " "]
    )
docs = text_splitter.split_documents(documents)

Here we:

- Load a text document
- Split it into smaller chunks of 500 characters
- Use a 100-character overlap to maintain context between chunks
- Define separators to ensure chunks break at natural boundaries

In [4]:
# docs

### Step 4: Setting Up the Embedding Model

In [5]:
embedding_model = OpenAIEmbeddings()

The embedding model transforms text into numerical vectors that capture semantic meaning, allowing our system to understand and compare text based on meaning rather than just keywords.

### Step 5: Creating the Vector Database

In [6]:
vector_db = FAISS.from_documents(docs, embedding_model)

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. This code:

- Takes our document chunks
- Uses the embedding model to convert them to vectors
- Creates a searchable database

### Step 6: Saving Our Vector Database

In [7]:
# https://python.langchain.com/docs/integrations/vectorstores/faiss/#saving-and-loading
vector_db.save_local("faiss_index")

This allows us to save our work and reuse the vector database without having to recreate it from scratch each time.

### Step 7: Loading a Previously Created Vector Database

In [8]:
new_vector_store = FAISS.load_local(
    "faiss_index", embedding_model, allow_dangerous_deserialization=True
)

This is where we'd typically pick up if we already have a prepared database.

### Step 8: Creating a Retriever

In [27]:
retriever = new_vector_store.as_retriever(
    search_type="similarity", 
    search_kwargs = {"k": 3}
)

In [16]:
new_vector_store.similarity_search_with_score(query)

[(Document(id='dbbfd0f3-a413-4d32-9f5f-a43cd8c7ff0f', metadata={'source': '../test.txt'}, page_content='Decorative Styles:\nGeometric Style (900-700 BCE): Features abstract patterns and motifs.\nBlack-Figure Technique (700-500 BCE): Figures are painted in black silhouette against the natural red clay.\nRed-Figure Technique (530-300 BCE): The reverse of black-figure, allowing for greater detail and expression.\nPainting\nWhile few examples survive, Greek painting was highly esteemed, with influences seen in vase paintings and frescoes.\nTechniques: Included fresco, encaustic, and tempera.'),
  np.float32(0.26161647)),
 (Document(id='b3967691-beec-495d-bc51-276068f62ab5', metadata={'source': '../test.txt'}, page_content='Sculpture and Pottery\nGreek sculptors excelled in creating lifelike statues that captured the human form with remarkable realism and beauty. Works such as the Venus de Milo and the Discobolus exemplify the Greek pursuit of idealized proportions and expressive detail. Po

In [23]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    docs, scores = zip(*new_vector_store.similarity_search_with_score(query))
    for doc, score in zip(docs, scores):
        doc.metadata["score"] = score

    return docs

The retriever performs the actual search:

- We're using similarity search (finding semantically similar content)
- We're retrieving the top 3 most relevant chunks for each query

### Step 9: Testing Our Retriever

In [28]:
queries = [
    "Can you give some decorative styles in ancient Greek life?",
    "Can you give the last United States election results?"
]

query = queries

retrievals = retriever.invoke(query)

TypeError: argument 'text': 'list' object cannot be converted to 'PyString'

Here we test our retriever with a query about ancient Greek decorative styles.

### Step 10: Examining the Results

In [25]:
retrievals

(Document(id='dbbfd0f3-a413-4d32-9f5f-a43cd8c7ff0f', metadata={'source': '../test.txt', 'score': np.float32(0.26161647)}, page_content='Decorative Styles:\nGeometric Style (900-700 BCE): Features abstract patterns and motifs.\nBlack-Figure Technique (700-500 BCE): Figures are painted in black silhouette against the natural red clay.\nRed-Figure Technique (530-300 BCE): The reverse of black-figure, allowing for greater detail and expression.\nPainting\nWhile few examples survive, Greek painting was highly esteemed, with influences seen in vase paintings and frescoes.\nTechniques: Included fresco, encaustic, and tempera.'),
 Document(id='b3967691-beec-495d-bc51-276068f62ab5', metadata={'source': '../test.txt', 'score': np.float32(0.27237347)}, page_content='Sculpture and Pottery\nGreek sculptors excelled in creating lifelike statues that captured the human form with remarkable realism and beauty. Works such as the Venus de Milo and the Discobolus exemplify the Greek pursuit of idealized 

This displays the top 3 most relevant document chunks from our database. Notice how they all relate to Greek decorative styles, even though they might not contain the exact wording of our query.

### Advanced Implementation: Complete RAG Pipeline

In [None]:
response_schema = [
    ResponseSchema(name="summary", description="A concise summary of the wikipedia page"),
    ResponseSchema(name="key_points", description="The key points relevant to the query"),
    ResponseSchema(name="wikipedia_reference", description="Relevant information retrieved from Wikipedia")
]

parser = StructuredOutputParser.from_response_schemas(response_schema)
format_instructions = parser.get_format_instructions()

context = "\n".join([doc.page_content for doc in retrievals])

prompt = PromptTemplate.from_template("""
You are an expert assistant. Based on the following context, generate a structured response:

Context: {context}
Wikipedia: {wiki_data}
{format_instructions}
""")

In [None]:
def load_and_chunk(file_path, chunk_size=500, chunk_overlap=100):
    print("Loading and Splitting the PDF Document...")

    loader = PyMuPDFLoader(file_path)
    documents = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", " "]
    )

    chunk = text_splitter.split_documents(documents)

    print(f"Number of chunks: {len(chunk)}")
    return chunk

In [None]:
def create_vector_database(chunks):
    print("Creating FAISS Index...")
    vector_db = FAISS.from_documents(chunks, embedding_model)
    print("FAISS Index Created")
    return vector_db

In [None]:
print(" Starting Complete RAG Demo ")

file_path = "../../ai-report.pdf"
chunks = load_and_chunk(file_path)
vector_db = create_vector_database(chunks)

queries = [
    "What examples of AI-driven solutions in tutoring are given?"
]

query = queries[0]

retriever = vector_db.as_retriever(
    search_type="similarity",
    search_kwargs = {"k": 3}
)

retrievals = retriever.get_relevant_documents(query)