### Building a Local Knowledge Assistant with LangChain & OpenAI

In this project, I implemented a `Retrieval-Augmented Generation (RAG)` pipeline using `LangChain`, `Chroma`, and `OpenAI embeddings`. The system retrieves relevant chunks of information from my document collection to provide more accurate and context-aware answers. By combining retrieval with generation, I can leverage a smaller, pre-trained LLM while still achieving detailed and precise responses, without the need for costly fine-tuning. This setup allows me to experiment with AI-driven Q&A in a practical, hands-on way.

This project was inspired by [RAG + Langchain Python Project: Easy AI/Chat For Your Docs](https://www.youtube.com/watch?v=tcqEUSNCn8I) and adapted for personal learning and experimentation.

### Why RAG Matters?

`Fine-tuning` an LLM can be extremely **expensive** and **resource-heavy** — it requires access to large compute clusters, massive datasets, and careful optimization. Beyond computational costs, fine-tuning introduces **deployment complexity** (maintaining multiple model versions), **knowledge staleness** (requires retraining to update information), and **catastrophic forgetting** risks where the model loses general capabilities while adapting to specific domains.

While techniques like `LoRA (Low-Rank Adaptation)` make fine-tuning more **efficient** by training only low-rank decomposition matrices rather than full weight updates, they still **demand GPU resources**, **model-specific expertise**, and **careful hyperparameter tuning**. More importantly, fine-tuning bakes knowledge into model weights, making it **opaque and non-auditable** — you can't easily trace which training examples influenced a specific output or update individual facts without full retraining.

`RAG (Retrieval-Augmented Generation)`, on the other hand, is a **lightweight yet powerful alternative** that addresses these fundamental limitations. Instead of changing the model itself, it **augments the prompt dynamically** by retrieving relevant **external knowledge** from vector databases at inference time. This architectural choice provides several critical advantages: **knowledge remains external and auditable** (you can inspect exactly what context was retrieved), **updates are instantaneous** (add new documents without retraining), **citations are traceable** (ground responses in source material), and **domain adaptation requires no GPU compute** (just embed and index your documents).

RAG also offers **better separation of concerns**: the LLM handles reasoning and language generation while the retrieval system manages domain knowledge. This makes systems more **maintainable**, **debuggable**, and **cost-effective** — you pay only for embedding and inference costs rather than full training runs. For production applications requiring up-to-date information, compliance with data lineage requirements, or rapid iteration on domain knowledge, RAG often proves more practical than fine-tuning. The trade-off is handling **retrieval quality** (chunking strategies, embedding models, similarity metrics) and **context window management**, but these challenges are generally more tractable than the complexities of fine-tuning at scale.

### Why LangChain?
`LangChain` is an orchestration framework that dramatically **simplifies building LLM applications** by providing **high-level abstractions for common patterns**. Instead of manually handling API calls, text chunking, embeddings, and vector database operations, LangChain condenses what would typically take **100+ lines** of integration code into **10-15 lines** of declarative operations. It offers **unified interfaces** across different LLM providers and vector databases, allowing you to swap components without rewriting your application, while including production-ready features like retry logic, error handling, and observability.

For a RAG exploration project, LangChain **accelerates development from days to hours**, letting you focus on understanding core concepts like chunking strategies and retrieval methods rather than API plumbing. It **provides battle-tested components for document loading, text splitting, embeddings, and retrieval chains with integrations for major vector stores and LLM providers**, making it ideal for rapid prototyping and experimentation.

However, these abstractions come with **trade-offs**. The framework can **obscure what's happening** under the hood, making debugging more complex, and has a **learning curve to understand** its conventions. The abstraction layers can introduce **performance overhead problematic** for high-throughput production systems, and the framework has experienced breaking **changes across versions**. For production at scale, teams often replace LangChain components with leaner implementations where they need finer control.

### Why Chroma?

`Vector databases` are essential for RAG systems, but many solutions like `Pinecone` or `Weaviate` require **external infrastructure**, **API dependencies**, and **ongoing costs** that can complicate development and deployment workflows.

While cloud-based vector databases offer **scalability** and **managed services**, they introduce **network latency**, **vendor lock-in**, and **data privacy concerns** — your embeddings and documents live on third-party servers, which may not be acceptable for sensitive applications or offline use cases.

`Chroma`, on the other hand, is a **lightweight yet powerful embedded database** designed specifically for AI applications. It runs **locally in-process** with your Python application, requiring **no separate server** or external dependencies. Chroma stores embeddings and metadata on disk with optional persistence, making it perfect for development, prototyping, and production deployments where you need full control over your data.

Chroma also offers **developer-friendly simplicity**: minimal setup (just `pip install chromadb`), intuitive APIs for adding and querying documents, built-in support for metadata filtering, and seamless integration with LangChain and other frameworks. The architecture provides **flexible deployment options** — start with local embedded mode for development, then scale to client-server mode for production if needed. For exploration projects and applications requiring **fast iteration**, **data locality**, and **zero infrastructure overhead**, Chroma eliminates the operational complexity of managed vector databases while maintaining production-ready performance. The trade-off is handling **horizontal scaling** and **high-availability** yourself if you outgrow single-node deployments, but for most RAG applications, Chroma's simplicity and local-first design make it the ideal starting point.RetryClaude can make mistakes. Please double-check responses.

### Implementation


In [12]:
# Adapted from "RAG + Langchain Python Project: Easy AI/Chat For Your Docs"
# https://www.youtube.com/watch?v=tcqEUSNCn8I

from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain_chroma import Chroma
from langchain_classic.prompts import ChatPromptTemplate
import openai 
from dotenv import load_dotenv
import os
import shutil

The project begins by **converting the target documents into LangChain** `Document` **objects** using the `DirectoryLoader` function. This preserves both the **text** and **metadata**, such as the source path and start index, which are required for LangChain's downstream functions.

In [13]:
# Convert the raw files into Document
def load_documents(DATA_PATH):
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents

Next, the documents are **split into smaller chunks**. This is necessary primarily for **retrieval precision in RAG systems**. Smaller chunks allow the system to retrieve **only the most relevant information** rather than entire documents, improving semantic similarity matching. **Large chunks** can **dilute semantic meaning** and introduce **irrelevant context**, while **smaller chunks** enable more **focused retrieval**. Additionally, chunking helps manage **LLM context window limitations**, though retrieval quality is the primary consideration.

The 'RecursiveCharacterTextSplitter' controls how text is divided using two key parameters: 'chunk_size' and 'chunk_overlap'. 'chunk_size' defines the *maximum length* of each chunk, while 'chunk_overlap' *repeats a portion* of the previous chunk in the next one to **reduce the chance of splitting related information** across chunk boundaries. Choosing appropriate values for these parameters directly impacts retrieval quality, and thus the overall performance of the RAG system.

In [14]:
# Split the text into small chunks
def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,         #nb of characters in each chunk
        chunk_overlap=100,      #nb of characters to overlap between chunks
        length_function=len,    #decide how to measure the chunk, e.g., character, token, etc
        add_start_index=True,   #add the starting index of the chunk
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.\n")

    print(f"Print chunk 10 content: ")
    document = chunks[10]
    print(f"Content: \"{document.page_content}\"")
    print(f"Metadata: {document.metadata}\n")

    return chunks

Once the documents are split, each chunk is converted into a **vector embedding** using OpenAI's 'text-embedding-3-small' model. These **embeddings** are then stored in a 'Chroma' vector database along with the **original chunk text and metadata**. The 'Chroma' database enables efficient **similarity search** by **comparing query embeddings against stored chunk embeddings**.

It is crucial to use the **same embedding model** for both **indexing chunks** and **embedding queries** because different models produce **incompatible vector representations** that exist in different semantic spaces, making retrieval unreliable or impossible. For this project, I use the default embedding model `text-embedding-ada-002`.

In [15]:
# Apply vector embedding to chunks and save the embedding vector along with the content and metadata to database
def save_to_chroma(chunks: list[Document], CHROMA_PATH):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    db = Chroma.from_documents(
        documents=chunks, embedding=OpenAIEmbeddings(model="text-embedding-ada-002"), persist_directory=CHROMA_PATH
    )
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")
    return db

Main

In [None]:
# Load environment variables that contains the OpenAI API key, LangChain can access it automatically
load_dotenv()

# Set OpenAI API key 
# openai.api_key = os.environ['OPENAI_API_KEY']
client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])

# Set path for where to get the original file and where to safe the chunks
CHROMA_PATH = "chroma"
DATA_PATH = "data/books"

# split the document in chunks and save it to database along with its embedded vector
documents = load_documents(DATA_PATH)
chunks = split_text(documents)
db = save_to_chroma(chunks, CHROMA_PATH)

Split 1 documents into 818 chunks.

Print chunk 10 content: 
Content: "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."
Metadata: {'source': 'data\\books\\alice_in_wonderland.md', 'start_index': 1653}

Saved 818 chunks to chroma.


The **prompt template** determines how **retrieved context** and the **user query** are **structured** for the **language model**. While prompt quality is difficult to quantify precisely, adhering to **established prompting principles significantly improves model responses**. Effective LLM prompting requires **clear, specific instructions** - similar to providing detailed directions to a new team member. Well-defined prompts guide the model toward desired outputs, while vague or ambiguous instructions increase output unpredictability, often resulting in irrelevant or inaccurate responses.

In [18]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

Once the Chroma database and prompt template are configured, the system can process user queries through the following pipeline: First, the query is embedded using the **same embedding model** applied to the document chunks, ensuring vector space consistency. The system then retrieves the **top-k most similar chunks** from the vector database (using L2 (Euclidean) distance by default). These retrieved chunks are combined with the original query according to the prompt template structure and passed to the LLM (default LLM: `gpt-3.5-turbo`) for response generation.

Implementing **quality safeguards** is essential for production systems. This includes **rejecting empty** or **malformed queries** and filtering results when similarity scores fall **below a confidence threshold**, as low-similarity retrievals typically indicate insufficient relevant context and lead to unreliable outputs. With these components in place, the RAG pipeline can effectively retrieve pertinent information and generate well-informed responses.

In [19]:
# query_text = input("Enter your query: ")
query_text = "How does Alice meet the Mad Hatter?"

# Search the DB.
results = db.similarity_search_with_relevance_scores(query_text, k=3)
if len(results) == 0 or results[0][1] < 0.7:
    print(f"Unable to find matching results.")
    # return

print("Top k similarity:")
for i, k in enumerate(results):
    print(f"Top {i + 1}:\nContent: {k[0].page_content}\nL2 similarity: {k[1]}\n")

# Generate the prompt template with context and query
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)
print(prompt)

# Implement the LLM and feed it with the prompt
model = ChatOpenAI(model="gpt-3.5-turbo")
response_text = model.invoke(prompt)

# Print the formatted response
sources = [doc.metadata.get("source", None) for doc, _score in results]
formatted_response = f"Response: {response_text.content}\nSources: {sources}"
print(formatted_response)


Top k similarity:
Top 1:
Content: So Alice began telling them her adventures from the time when she first saw the White Rabbit. She was a little nervous about it just at first, the two creatures got so close to her, one on each side, and opened their eyes and mouths so very wide, but she gained courage as she went on. Her listeners
L2 similarity: 0.8063262538137091

Top 2:
Content: “In that direction,” the Cat said, waving its right paw round, “lives a Hatter: and in that direction,” waving the other paw, “lives a March Hare. Visit either you like: they’re both mad.”

“But I don’t want to go among mad people,” Alice remarked.
L2 similarity: 0.8054134795155087

Top 3:
Content: “Is that the way you manage?” Alice asked.

The Hatter shook his head mournfully. “Not I!” he replied. “We quarrelled last March—just before he went mad, you know—” (pointing with his tea spoon at the March Hare,) “—it was at the great concert given by the Queen of Hearts, and I had to sing
L2 similarity: 0.790282

### Evaluation

In [None]:
# from ragas.llms import LangchainLLMWrapper
from ragas.llms import llm_factory
from ragas.embeddings import OpenAIEmbeddings
import tqdm as notebook_tqdm

# generator_llm = LangchainLLMWrapper(model)
client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])
# openai_client = openai.OpenAI()
generator_llm = llm_factory(model)
# generator_embeddings = OpenAIEmbeddings(client=openai_client)
generator_embeddings = OpenAIEmbeddings(client=openai)

ValueError: llm_factory() requires a client instance. Text-only mode has been removed.

To migrate:
  from openai import OpenAI
  client = OpenAI(api_key='...')
  llm = llm_factory('gpt-4o-mini', client=client)

For more details: https://docs.ragas.io/en/latest/llm-factory