<a href="https://colab.research.google.com/github/micah-shull/RAG-LangChain/blob/main/Copy_of_LC_017_RAG_RetrieverEVAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pip Installs


In [None]:
!pip install --upgrade --quiet \
    langchain \
    langchain-huggingface \
    langchain-openai \
    langchain-community \
    chromadb \
    python-dotenv \
    transformers \
    accelerate \
    sentencepiece \
    ragas datasets

## Load Libraries

In [None]:
# 🌿 Environment setup
import os                                 # File paths and OS interaction
from dotenv import load_dotenv            # Load environment variables from .env file
import langchain; print(langchain.__version__)  # Check LangChain version
import itertools

# 📄 Document loading and preprocessing
from langchain_core.documents import Document                   # Base document type
from langchain_community.document_loaders import TextLoader     # Loads plain text files
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Splits long docs into smaller chunks

# 🔢 Embeddings + vector storage
from langchain_huggingface import HuggingFaceEmbeddings         # HuggingFace embedding model
from langchain.vectorstores import Chroma                       # Persistent vector DB (Chroma)
import chromadb

# 💬 Prompting + output
from langchain_openai.chat_models.base import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate           # Chat-style prompt templates
from langchain_core.output_parsers import StrOutputParser       # Converts model output to string

# 🔗 Chains / pipelines
from langchain_core.runnables import Runnable, RunnableLambda   # Compose custom pipelines

# 🧠 (Optional) Hugging Face LLM client setup
# from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace  # For HF inference API

# 🧾 Pretty printing
import textwrap                         # Format long strings for printing
from pprint import pprint               # Nicely format nested data structures

# Pydantic
from pydantic import BaseModel, ValidationError

0.3.26


## Model Params

In [None]:
# Load API key
from openai import OpenAI

# Load token from .env.
load_dotenv("/content/API_KEYS.env", override=True)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# SET MODEL PARAMS
EMBED_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 200
CHUNK_OVERLAP = 50
K = 2

LLM_MODEL = ChatOpenAI(
    model_name="gpt-4o-mini",
    temperature=0.4  # Moderate creativity; adjust as needed
)

## Load & Clean Docs

## ✅ Why We Tag Chunks with `source` Metadata

In a Retrieval-Augmented Generation (RAG) system, it’s critical to know **where each chunk comes from** so you can measure how well your retriever works.  
By adding a `source` tag (the filename) to each document’s metadata before splitting it into chunks, we:

- ✅ Preserve the connection between each chunk and its original document
- ✅ Make it possible to check if the retriever returned the correct chunks for a given question
- ✅ Enable meaningful metrics like **Recall@k**, **Precision@k**, and **F1@k**

---

## ✅ Why We Calculate Retrieval Metrics

Good retrieval is the backbone of a good RAG system.  
**Precision and Recall** help us understand:
- **Recall@k:** *Did the retriever return all the relevant chunks?*  
- **Precision@k:** *Are the retrieved chunks actually relevant, or is there extra noise?*  
- **F1@k:** *Balances precision and recall into a single score.*

High recall means we rarely miss relevant info; high precision means we avoid irrelevant context that might confuse the LLM.  
Without these metrics, you can’t tell if poor answers come from bad retrieval or bad generation.

---

## ✅ How This Works in Our Pipeline

- When we chunk each document, we add its filename to `metadata["source"]`.
- When we evaluate, we compare the **retrieved chunks’ sources** to the **expected sources** we define in our gold-standard QA dataset.
- This lets us calculate retrieval metrics programmatically, track weak spots, and improve both chunking and retriever settings over time.

**Result:** A robust, trustworthy RAG system that retrieves relevant information and generates factual answers.


In [None]:
# Path to documents
docs_path = "/content/CFFC"

# Step 1: Load all .txt files in the folder
raw_documents = []

for filename in os.listdir(docs_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(docs_path, filename)
        loader = TextLoader(file_path, encoding="utf-8")
        docs = loader.load()

        for doc in docs:
            doc.metadata["source"] = filename   # <<--- add file name!
            raw_documents.append(doc)


print(f"Loaded {len(raw_documents)} documents.")

# Step 2 (optional): Clean up newlines and extra whitespace
def clean_doc(doc: Document) -> Document:
    cleaned = " ".join(doc.page_content.split())  # Removes newlines & extra spaces
    return Document(page_content=cleaned, metadata=doc.metadata)

cleaned_documents = [clean_doc(doc) for doc in raw_documents]

# Step 3: Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunked_documents = splitter.split_documents(cleaned_documents)

print(f"Split into {len(chunked_documents)} total chunks.")

# Preview the first 5 chunks
print(f"Showing first 5 of {len(chunked_documents)} chunks:\n")

for i, doc in enumerate(chunked_documents[:5]):
    print(f"--- Chunk {i+1} ---")
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
    print(textwrap.fill(doc.page_content[:500], width=100))  # limit preview to 500 characters
    print("\n")

Loaded 5 documents.
Split into 83 total chunks.
Showing first 5 of 83 chunks:

--- Chunk 1 ---
Source: CFFC_Proof_MultiStoreAccuracy.txt

# 📊 Consistency That Builds Confidence Forecasting accuracy isn’t just about one-time wins — it’s
about **reliable, repeatable performance** across your business. Cashflow4cast delivers consistent


--- Chunk 2 ---
Source: CFFC_Proof_MultiStoreAccuracy.txt

your business. Cashflow4cast delivers consistent forecasting accuracy, store after store. --- ## 🏪
Proven Across 20+ Locations We tested our machine learning model across **20+ different store


--- Chunk 3 ---
Source: CFFC_Proof_MultiStoreAccuracy.txt

learning model across **20+ different store locations**, each with unique demand patterns.
**Result:** In every case, the ML model **reduced forecasting errors by at least 50%**,
outperforming


--- Chunk 4 ---
Source: CFFC_Proof_MultiStoreAccuracy.txt

errors by at least 50%**, outperforming traditional Excel-style models (including Prophet). --- 

## ✅ What Does the `@k` Mean in Retrieval Metrics?

When you see metrics like **Recall@k** or **Precision@k**, the **`@k`** (read as “at k”) means we’re calculating the metric based on the **top *k* chunks** returned by the retriever.

---

### 🔍 Why This Matters

In a Retrieval-Augmented Generation (RAG) system:
- The retriever pulls the *k* most relevant chunks for a given question.
- The LLM uses only these chunks to generate its final answer.
- If the right information isn’t in the top *k*, the answer quality suffers — so we need to know how well the retriever does *within that limit*.

---

### ✅ What Recall@k Means

- **Recall@k** measures how many of the *relevant* chunks were successfully retrieved in the top *k*.
- Example: If 2 chunks are truly relevant and both are in the top 5 → Recall@5 = 1.0 (100%).
- If only 1 is found → Recall@5 = 0.5 (50%).

---

### ✅ What Precision@k Means

- **Precision@k** measures how many of the top *k* retrieved chunks are actually relevant.
- Example: If the retriever returns 5 chunks and only 2 are correct → Precision@5 = 0.4 (40%).

---

### 🎯 Why `@k` Is Critical for RAG

- It reflects how well your retriever performs given the constraints your LLM works with.
- Tuning *k* helps you balance **recall** (finding all useful info) vs **precision** (avoiding irrelevant info that might confuse the model).

---

**Key takeaway:**  
Adding `@k` makes your retrieval evaluation realistic and reproducible — you know exactly how deep your retriever had to go to find the right context!


## Embed & Persist

In [None]:
# Step 1: Set up Hugging Face embedding model
embedding_model = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

# Step 2: Create your Chroma client with telemetry OFF
persist_dir = "chroma_db"

client = chromadb.PersistentClient(
    path=persist_dir,
    settings=chromadb.config.Settings(
        anonymized_telemetry=False
    )
)

# Step 3: Build vector store using your client
vectorstore = Chroma.from_documents(
    documents=chunked_documents,
    embedding=embedding_model,
    client=client,
    persist_directory=persist_dir
)

print(f"✅ Stored {len(chunked_documents)} chunks in Chroma at '{persist_dir}'")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


✅ Stored 83 chunks in Chroma at 'chroma_db'


## Retriever Prompt

### ✅ How `k` Affects Answer Quality

- **Higher `k`** → More context chunks → Higher chance all relevant info is included → But can make answers longer, repetitive, or off-topic if irrelevant chunks are pulled.
- **Lower `k`** → Fewer chunks → Shorter, more focused input → But you risk missing important details if the retriever doesn't find everything.
- This is why tuning `k` and measuring **Recall@k** and **Precision@k** are critical: they help you balance *coverage* vs *focus* to get accurate, concise answers.


In [None]:
retriever = vectorstore.as_retriever(search_kwargs={"k": K})

prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant that uses business documents to answer questions.
Use the following context to answer the question as accurately as possible.

Context:
{context}

Question:
{question}

Answer:
""")

def generate_answer(question: str):
    """
    1. Retrieve K chunks
    2. Concatenate into a single context string
    3. Feed context + question to the generator LLM
    4. Return (answer, context)   <-- key!
    """
    # Step-1 ➜ retrieve top K document chunks
    docs = retriever.invoke(question)
    retrieved_ids = [doc.metadata["source"] for doc in docs] # doc ids are now included


    # Step-2 ➜ combine those into a context string
    context = "\n\n".join(doc.page_content for doc in docs)

    # Step-3 ➜ fill the prompt template with context and question
    llm_input = prompt_template.format(context=context, question=question)

    # Step-4 ➜ call the LLM and get its plain text output
    answer = StrOutputParser().invoke(LLM_MODEL.invoke(llm_input))

    # Step-5 ➜ return the answer AND the context used
    return answer, context


## Define RAG chain

In [None]:
rag_chain = (
    RunnableLambda(lambda d: {
        "question": d["question"],
        "docs": retriever.invoke(d["question"])
    })
    | RunnableLambda(lambda d: {
        "context": "\n\n".join([doc.page_content for doc in d["docs"]]),
        "sources": [doc.metadata["source"] for doc in d["docs"]],
        "question": d["question"]
    })
    | RunnableLambda(lambda d: {
        "answer": LLM_MODEL.invoke(prompt_template.format(
            context=d["context"], question=d["question"]
        )).content,
        "sources": d["sources"]
    })
)

# Invoke RAG
response = rag_chain.invoke({
    "question": "What are the recent economic indicators in Gainesville that affect local businesses?"
})

# Print response nicely
print("\n" + textwrap.fill(response["answer"], width=100))
print("\nRetrieved sources:", response["sources"])


The provided context does not include specific recent economic indicators for Gainesville or any
local data. To accurately answer your question about recent economic indicators in Gainesville that
affect local businesses, you would need to refer to local economic reports, government publications,
or business surveys that detail the current economic conditions in that area. Typically, indicators
such as unemployment rates, consumer spending, housing market trends, and local business growth
rates are relevant.

Retrieved sources: ['/content/CFFC/CFFC_EconomicIndicators_Federal.txt', '/content/CFFC/CFFC_EconomicIndicators_Federal.txt']


## ✅ Why the LLM Sometimes Says “No Data in Context”

In a Retrieval-Augmented Generation (RAG) pipeline:
- The LLM can only answer questions using the **retrieved context**.
- If the retriever does not find chunks that contain the specific fact needed, the LLM must rely only on what it sees.

---

### 🔍 Why This Happens

When the knowledge base does **not contain** the information a user is asking for (e.g., local economic indicators for Gainesville):
- The retriever pulls the closest matching chunks it can find (in this case, federal indicators).
- The LLM reads the context and realizes the specific detail is missing.
- The LLM correctly returns an answer like:
  > “The provided context does not include this information…”

This is exactly how a well-designed RAG system should behave — it avoids **hallucinating** or making up facts.

---

### 🎯 What This Tells Us

- **Recall@k = 0:** The relevant information does not exist in the top *k* retrieved chunks — because it doesn’t exist in the knowledge base at all.
- **Precision@k:** May also be low, since the retrieved chunks are only partially relevant.
- This signals a **content gap**, not a technical failure. The pipeline worked properly!

---

### ✅ What To Do About It

- Keep these “no result” queries in your test set — they highlight real **knowledge gaps**.
- If these topics matter to your users (e.g., local Gainesville stats), add trusted local data to your corpus.
- This improves your retrieval recall in the future — because the fact now exists to be found.

---

**Key takeaway:**  
A fallback answer means your RAG system is trustworthy.  
You should fill content gaps — not force the LLM to guess.


## ✅ Adding Missing Local Data: What We Learned

### 🔍 How We Discovered the Gap

During testing, we asked our RAG pipeline:
> "What are the recent economic indicators in Gainesville that affect local businesses?"

The system returned:
> "The provided context does not include specific recent economic indicators for Gainesville..."

This meant:
- The retriever did not return relevant local chunks.
- Checking our vector database showed that we never embedded our Gainesville-specific content — even though this info *did exist* on our website.

---

### 🗝️ Lesson Learned

**A RAG pipeline can only retrieve what’s in the knowledge base.**  
If a user asks about a topic not in your corpus, your LLM will either:
- Return an empty or fallback answer (✅ better than hallucinating!).
- Or hallucinate — if you haven’t prompted it to stick to context only.

---

### ✅ How We Closed the Gap

1️⃣ We located the missing Gainesville-specific page on our website.

2️⃣ We added clear **Markdown headings** for each local indicator.

3️⃣ We expanded bullet points with short, clear explanations to help our embedding model capture more semantic meaning.

4️⃣ We saved the updated version as a **Markdown file** in our `/content` folder.
- Using Markdown headings ensures each chunk has a clear semantic anchor when we split it for embeddings.
- This makes the retriever more likely to match user queries to the right chunks.

---

### ✨ Why This Matters

- Filling the gap improves **Recall@k** — the retriever can now find and return relevant local data.
- Using clear headings and explanations improves **Precision@k** — chunks are more focused and less noisy.
- Our pipeline stays trustworthy: no hallucinations, and better coverage of real customer questions.

---

**Key takeaway:**  
Regularly test real questions, look for fallback answers, and add missing high-value content — *optimized with good structure* — to strengthen your RAG system over time!





## ✅ **Key Next Steps**

### **1️⃣ Add Retrieval Metrics**

You want to test: **Did the retriever return the “right” chunks for the question?**

🔑 **What you need:**

* A small **ground truth mapping**: For each test question, note what chunk(s) should be retrieved.
* Evaluate metrics like **Recall\@k**, **Precision\@k**, and **F1\@k** for retrieval.

📌 **How to do it:**

* Add an “expected\_chunks” field to your `qa_test` list (or store in a CSV).
* When you retrieve `docs` for a question, compare the returned chunk IDs to the expected ones.
* Count matches:

  * **Recall\@k** = (# relevant chunks retrieved) / (# relevant chunks in ground truth)
  * **Precision\@k** = (# relevant chunks retrieved) / (k)

✅ Example:

```python
def evaluate_retrieval(retrieved_docs, expected_chunk_ids):
    retrieved_ids = [doc.metadata["source"] for doc in retrieved_docs]
    relevant = set(expected_chunk_ids)
    retrieved = set(retrieved_ids)

    true_positives = len(retrieved & relevant)
    recall = true_positives / len(relevant)
    precision = true_positives / len(retrieved)

    return recall, precision
```

---

### **2️⃣ Automate End-to-End QA Loop**

You already loop through test questions. Enhance this by:

* Storing each: question, retrieved docs, generated answer, context, retriever metrics, generation acceptability, and feedback.
* Save results to a CSV or Pandas DataFrame for easy review.

---

### **3️⃣ Add `RAGAS` or `LlamaIndex` for Reference**

You could replicate this manually (which you’re halfway doing), or use **RAGAS**:

* It uses LLMs to judge **Faithfulness**, **Answer Relevancy**, and **Context Recall**.
* It’s basically what you wrote — just more automated.

---

### **4️⃣ Track Failure Cases**

When a question scores low on retrieval or generation, log:

* Which ground truth chunks were missed?
* Why did the LLM hallucinate? Was the context incomplete?
* Use this to refine chunk size, overlap, or even the prompt.

---

## ✅ **5️⃣ Continuous Improvement**

* Keep a growing test set of real user questions.
* Automate logging and store failed cases.
* Fine-tune your chunking and retriever based on low Recall\@k or low faithfulness.

---

## ✅ Here’s What We’ll Do with RAGAS

RAGAS can automatically score:

* **Context Recall** (similar to Recall\@k)
* **Faithfulness** (how much the generated answer sticks to the retrieved context)
* **Answer Relevance** (how well the answer matches the question)
* **Answer Correctness** (if you have ground truth answers — optional but best)

You already have:

* A retriever that returns chunks.
* A generator that returns the answer + context.
* An evaluator loop — which we’ll replace or extend with `ragas`.

---

## ✅ Questions for You (to plan this right)

1️⃣ **Do you have ground truth answers?**
RAGAS’s `Answer Correctness` metric compares the generated answer to a known good answer.

* If you don’t have these yet, we can use only `Context Recall` + `Faithfulness` + `Answer Relevance`.

2️⃣ **Do you want to evaluate on a small test set for now?**
Let’s start with 5–10 examples so you don’t burn credits or wait too long.

3️⃣ **Are you okay with using OpenAI as the backend LLM for the RAGAS eval?**
It needs an LLM to do the judgment, same as your custom evaluator — you just need your `OPENAI_API_KEY` set up.

---

## ✅ Next Steps (High-Level)

Here’s the flow for your Colab:

```plaintext
1. Install or update RAGAS and dependencies.
2. Create a Dataset:
   Each row = { question, answer, contexts, (optional) ground_truth }
3. Run RAGAS pipeline:
   - ragas.evaluate(dataset, metrics=[...])
4. Inspect results: see scores for each metric.
5. Save to CSV or visualize.
```



