# 🔍 Document Question Answering with Local LLMs and Embeddings

This notebook demonstrates how to build a **local, privacy-friendly document question answering (QA) system** using [LangChain](https://www.langchain.com/), [Ollama](https://ollama.com/), and [Chroma](https://www.trychroma.com/). It loads a PDF document, splits it into chunks, generates embeddings locally using the `nomic-embed-text` model, and stores them in a persistent Chroma vector database. A retriever fetches relevant chunks in response to a query, and a locally running LLM (e.g., `gemma:2b`) is used to generate an accurate answer with citations. This workflow avoids cloud APIs, making it ideal for secure and offline environments.


In [6]:
# === Imports ===
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain.chains import RetrievalQA

# Import the necessary modules for working with documents, splitting text,
# creating embeddings, managing vector storage, and performing question-answering.

# === Load a PDF file ===
loader = PyPDFLoader("./data/react-paper.pdf")  # Load the PDF file specified in the path
docs = loader.load()  # Read the content of the PDF and prepare it for processing

# === Split the document into manageable chunks ===
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Set the maximum size of each chunk to 1000 characters
    chunk_overlap=200  # Ensure 200 characters overlap between chunks for context
)
splits = text_splitter.split_documents(docs)  # Split the loaded document into smaller chunks

# === Use Ollama's nomic-embed-text to create embeddings ===
embeddings = OllamaEmbeddings(model="nomic-embed-text:latest")  # Convert text into numerical representations (embeddings)

# === Create a Chroma vectorstore with persistence ===
persist_directory = './data/db/chroma/'  # Set the directory to save the vectorstore for later use

# Store the embeddings and text chunks in a Chroma vectorstore, enabling persistent storage
vectorstore = Chroma.from_documents(
    documents=splits,  # The document chunks to store
    embedding=embeddings,  # The embedding function to use for processing text
    persist_directory=persist_directory  # Directory for storing data persistently
)

# === Load the persisted Chroma vectorstore ===
vector_store = Chroma(
    persist_directory=persist_directory,  # Load the saved vectorstore from the specified directory
    embedding_function=embeddings  # Use the same embedding function as before
)

# === Set up a retriever from the vectorstore ===
retriever = vector_store.as_retriever(search_kwargs={"k": 2})  # Configure the retriever to return the top 2 most relevant results

# === Use Ollama's local chat model ===
llm = ChatOllama(model="gemma3:1b", temperature=0.0)  # Initialize a chat model with deterministic responses (temperature set to 0.0)

# === Set up a RetrievalQA chain for Q&A over documents ===
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,  # Use the language model for generating answers
    chain_type="stuff",  # Specify the type of chain to process retrieved document chunks
    retriever=retriever,  # The retriever to fetch relevant document parts
    verbose=True,  # Enable detailed logging for debugging
    return_source_documents=True  # Include source documents in the response for transparency
)

# === Pretty-printing helper to show results and sources ===
def process_llm_response(llm_response):
    # Print the main result (answer) from the language model
    print(llm_response['result'])
    print('\n\nSources:')
    # Print the sources used to generate the response
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

# === Ask a question! ===
query = "tell me more about ReAct prompting"  # Define a question to query the chain
llm_response = qa_chain.invoke(query)  # Send the question to the RetrievalQA chain and get the response

# === Display the result ===
process_llm_response(llm_response)  # Format and print the answer along with its sources



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Okay, let’s dive deeper into the ReAct prompting approach. Here’s a breakdown of what it is and why it’s significant:

**What is ReAct?**

ReAct is a prompting technique designed to improve language models’ ability to solve complex tasks by combining reasoning and acting. It’s built around a core idea: **“Reasoning-Acting”**. Instead of just asking the model to *generate* an answer, ReAct encourages it to *first reason* about the problem, then *act* to try different solutions.

**Here’s a breakdown of the key components:**

1. **Reasoning Phase:** The model is given a task and a starting state. It then engages in a "reasoning loop" – it generates a series of intermediate steps, thoughts, and hypotheses.  This is crucial because it moves beyond simply providing the answer and forces the model to *think* through the problem.

2. **Acting Phase:** Based on the reasoning loop, the model *acts* – it tries out different