The objective of this capstone project is to design and implement a fully local Retrieval-Augmented Generation (RAG) system using a desktop-hosted Large Language Model (LLM) such as Ollama, LM Studio, or GPT4All.

The system ingests private documents (PDF or text), converts them into vector embeddings using Sentence Transformers, stores them in a local ChromaDB vector database, and enables users to query the document through a LangChain-based RAG agent.

The solution ensures data privacy, offline inference, and accurate grounded responses by combining semantic retrieval with LLM reasoning.

1. Load private documents (PDF / TXT)
2. Split documents into overlapping chunks
3. Generate embeddings using Sentence Transformers
4. Store embeddings in local ChromaDB
5. Configure a local LLM (Ollama / LM Studio)
6. Build a retriever over the vector store
7. Inject retrieved context into a RAG prompt
8. Generate grounded answers using the LLM

"""
This section installs and imports all required libraries for building
a fully local Retrieval-Augmented Generation (RAG) system.

Key components:
- LangChain: Orchestration framework
- Sentence Transformers: Embedding generation
- ChromaDB: Local vector database
- Ollama / LM Studio: Desktop LLM inference
"""

Step 1: Install Ollama (One-Time Setup)

Note: This step is performed outside the notebook.

Download and install Ollama from the official site:

https://ollama.com


After installation, verify:

ollama --version

ðŸ”¹ Step 2: Pull the Mistral Model
ollama pull mistral

Step 3: Run the Model Locally
ollama run mistral

In [1]:
"""
This module initializes a locally hosted LLM using Ollama or LM Studio.

Why local LLM?
- Ensures data privacy
- Avoids external API costs
- Enables offline inference

The LLM is later used only for generation,
while retrieval is handled by ChromaDB.
"""
from langchain_community.llms import Ollama

llm = Ollama(
    model="mistral",
    temperature=0.2
)


  from .autonotebook import tqdm as notebook_tqdm
  llm = Ollama(


In [2]:
"""
Loads private documents (PDF or TXT) from disk.

The loader abstracts file format differences and converts
each document into LangChain Document objects, preserving metadata
such as page numbers for traceability.
"""


from langchain_community.document_loaders import PyPDFLoader, TextLoader

def load_documents(path: str):
    if path.endswith(".pdf"):
        loader = PyPDFLoader(path)
    else:
        loader = TextLoader(path)
    return loader.load()

docs = load_documents("attention.pdf")
print(f"Loaded {len(docs)} pages")


Loaded 15 pages


In [3]:
"""
Splits documents into overlapping chunks to balance:
- Context preservation
- Embedding quality
- Retrieval accuracy

Chunk overlap ensures that important information
is not lost across chunk boundaries.
"""


from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " "]
)

chunks = splitter.split_documents(docs)
print(f"Total Chunks: {len(chunks)}")


Total Chunks: 66


In [4]:
"""
Generates dense vector embeddings using a Sentence Transformer model.

Why Sentence Transformers?
- Lightweight and fast
- High semantic similarity performance
- Suitable for local execution

Each chunk is converted into a numerical vector
for similarity-based retrieval.
"""


from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)


  embedding_model = HuggingFaceEmbeddings(


In [5]:


"""
Stores embeddings in a local ChromaDB vector database.

Advantages:
- Persistent local storage
- Fast similarity search
- No cloud dependency

This database acts as the long-term memory
for the RAG system.
"""
from langchain_community.vectorstores import Chroma

vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_store"
)

vector_db.persist()


  vector_db.persist()


In [6]:
"""
Creates a semantic retriever over the vector database.

The retriever fetches the top-k most relevant chunks
based on cosine similarity, which are later injected
into the LLM prompt as context.
"""

retriever = vector_db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)


In [7]:
"""
Defines a Retrieval-Augmented Generation prompt.

The prompt explicitly instructs the LLM to:
- Answer strictly using retrieved context
- Avoid hallucinations
- Return 'Not found in the document' if context is missing

This is critical for trustworthy RAG behavior.
"""

from langchain_core.prompts import ChatPromptTemplate

rag_prompt = ChatPromptTemplate.from_template("""
You are a highly accurate AI assistant.
Answer the question strictly using the context provided.
If the answer is not in the context, say "Not found in the document".

Context:
{context}

Question:
{question}

Answer:
""")


In [8]:
"""
Builds a modern LangChain Expression Language (LCEL) pipeline.

Pipeline steps:
1. User query
2. Context retrieval from ChromaDB
3. Context formatting
4. Prompt injection
5. LLM reasoning

LCEL ensures composability, clarity, and future extensibility.
"""

from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | rag_prompt
    | llm
)


In [9]:
"""
Executes the RAG pipeline for user queries.

This function acts as the user-facing interface
for querying private documents and receiving
grounded, context-aware answers.
"""

def ask_rag(question: str):
    return rag_chain.invoke(question)




In [10]:
response = ask_rag(
    " Explain the role of the encoder and decoder stacks."
)
print(response)

 The encoder and decoder stacks in the Transformer model architecture play crucial roles. The encoder, composed of a stack of N = 6 identical layers, takes an input sequence and transforms it into a contextualized representation that can be understood by the model. Each layer has two sub-layers: the first is a multi-head self-attention mechanism that allows the model to understand the relationships between different parts of the input sequence, and the second is a simple, position-wise fully connected feed-forward network that helps the model learn more complex patterns in the data. A residual connection and layer normalization are employed around each sub-layer to help with training and improve the model's performance.

On the other hand, the decoder also has a stack of identical layers, similar to the encoder. Its role is to generate an output sequence based on the contextualized representation provided by the encoder. The decoder uses the same sub-layers as the encoder but processes

In [11]:
response = ask_rag(
    " How does multi-head attention differ from single-head attention?"
)
print(response)

 The context provided does not detail the difference between multi-head attention and single-head attention. However, in general, multi-head attention allows a model to attend to information from different positions within an input sequence simultaneously, while single-head attention only attends to one position at a time. This allows multi-head attention to capture more complex patterns and relationships within the data.


In [12]:
response = ask_rag(
    " Why does removing recurrence improve scalability in sequence modeling?"
)
print(response)

 The provided context does not explicitly state why removing recurrence improves scalability in sequence modeling. However, it implies that the model is auto-regressive [10], which means at each step, the model consumes the previously generated symbols as additional input when generating the next. This structure might contribute to improved scalability compared to models with recurrence, as it potentially reduces the number of computations required for long sequences by not having to maintain a hidden state over time. However, this is an inference based on the context and not a direct statement.
