# Toy Experiment 2: Build a RAG Application

This notebook slightly differs from single-webpage-rag-application.ipynb, allowing multiple offline PDF documents as input.

#### Installation

In [None]:
%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph
%pip install openai
%pip install -qU "langchain[openai]"
%pip install -qU langchain-openai
%pip install -qU langchain-core

In [None]:
%pip install -qU pdfplumber
%pip install semantic-text-splitter

### Indexing

#### Loading

In [None]:
from pathlib import Path
import pdfplumber

# locate the documents directory
try:
    base_dir = Path(__file__).resolve().parent
except NameError:
    base_dir = Path.cwd()

docs_dir = (base_dir / "documents").resolve()
assert docs_dir.exists(), f"Folder not found: {docs_dir}"

pdf_paths = sorted(docs_dir.glob("*.pdf"))
assert pdf_paths, f"No PDFs found in: {docs_dir}"

all_documents = []  # list of dicts: {"source": str, "page": int, "text": str}

for pdf_path in pdf_paths:
    print(f"Processing {pdf_path.name}...")
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, start=1):
            text = page.extract_text() or ""
            text = text.strip()
            if text:
                all_documents.append({
                    "source": pdf_path.name,
                    "page": i,
                    "text": text
                })

total_chars = sum(len(d["text"]) for d in all_documents)
print(f"Loaded {len(all_documents)} page-snippets from {len(pdf_paths)} PDFs.")
print(f"Total characters: {total_chars}")
print(all_documents[0]["text"][:500] if all_documents else "No text extracted.")

Processing amsterdam-facts.pdf...
Processing amsterdam-wikipedia.pdf...


Could get FontBBox from font descriptor because None cannot be parsed as 4 floats


Loaded 49 page-snippets from 2 PDFs.
Total characters: 118625
Amsterdam, city and port, western Netherlands, located on the IJsselmeer
and connected to the North Sea. It is the capital and the principal commercial
and financial centre of the Netherlands.
To the scores of tourists who visit each year, Amsterdam is known for its
historical attractions, for its collections of great art, and for the distinctive
colour and flavour of its old sections, which have been so well preserved.
However, visitors to the city also see a crowded metropolis beset by
environ


#### Splitting

In [4]:
from semantic_text_splitter import TextSplitter

max_characters = 1000
splitter = TextSplitter(max_characters, trim=False)

chunks = []
metadatas = []

for doc in all_documents:
    sub_chunks = splitter.chunks(doc["text"])
    chunks.extend(sub_chunks)
    metadatas.extend([{"source": doc["source"], "page": doc["page"]}] * len(sub_chunks))

print(f"Split {len(all_documents)} page-snippets into {len(chunks)} chunks.")

Split 49 page-snippets into 145 chunks.


#### Storing

In [5]:
import os
from langchain_openai import OpenAIEmbeddings
import getpass

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [6]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)
document_ids = vector_store.add_texts(chunks, metadatas=metadatas)
print(f"Added {len(document_ids)} chunks to the vector store.")

Added 145 chunks to the vector store.


### Retrieval and Generation

In [7]:
question = "What is an interesting fact about Amsterdam?"

# k value as a starting point
retrieved = vector_store.similarity_search(question, k=4)

# formatting
docs_content = "\n\n".join(
    f"[{i+1}] ({d.metadata.get('source')} p.{d.metadata.get('page')}):\n{d.page_content}"
    for i, d in enumerate(retrieved)
)

In [10]:
from langchain_core.prompts import PromptTemplate

template = """You are an assistant for question-answering tasks.
Use ONLY the reference numbers provided in the context (e.g., [1], [2]) 
when citing information. Do not invent new references.
Answer in three sentences maximum.

Question: {question}

Context:
{context}

Answer:"""
prompt = PromptTemplate.from_template(template)
rendered = prompt.format(question=question, context=docs_content)

from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
answer = llm.invoke(rendered).content
print(answer)

An interesting fact about Amsterdam is that it is often referred to as the "Venice of the North" due to its extensive and well-preserved canal system, which dates back to the city's 17th-century Golden Age [1][4]. Additionally, the Amsterdam Stock Exchange, founded in 1602, is considered the oldest "modern" securities market in the world [4]. The city's combination of historical significance and modern financial prowess makes it a unique destination [2].


Let's check that the model is not hallucinating the references.

In [11]:
import re
refs = re.findall(r"\[(\d+)\]", answer)
for r in refs:
    idx = int(r) - 1
    if 0 <= idx < len(retrieved):
        meta = retrieved[idx].metadata
        print(f"[{r}] -> {meta['source']} (page {meta['page']})")

[1] -> amsterdam-wikipedia.pdf (page 1)
[4] -> amsterdam-wikipedia.pdf (page 1)
[4] -> amsterdam-wikipedia.pdf (page 1)
[2] -> amsterdam-facts.pdf (page 1)
