<a href="https://colab.research.google.com/github/micah-shull/RAG-LangChain/blob/main/LC_015_RAG_EVAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## üß† RAG Pipeline Evaluation

### üéØ Objective

This notebook sets out to evaluate the performance of a **Retrieval-Augmented Generation (RAG)** pipeline designed to answer questions using a curated set of business and economic documents. Our goal was to ensure that the system not only retrieves relevant content but also generates factually accurate and helpful answers grounded in that content.

---

### ‚ö†Ô∏è Key Challenges

1. **Knowledge Cutoff in LLM Evaluation**
   Our evaluator LLM (GPT-4o-mini) has a knowledge cutoff in 2023, while our documents include facts and data from 2024‚Äì2025. This caused the evaluator to incorrectly reject some answers as ‚Äúfactually incorrect‚Äù despite being accurate and retrieved from trusted sources.

2. **Ensuring Context-Aware Judging**
   We needed to ensure the evaluator judged answers based **only on the retrieved document chunks**, not on the LLM‚Äôs prior training knowledge or assumptions.

---

### ‚úÖ Solutions Implemented

* Built a clean RAG chain that retrieves top-`k` chunks and generates answers using GPT-4.
* Captured both the **generated answer** and the **retrieved context** for each question.
* Designed an evaluation function that passed the **same context** to an LLM-based evaluator.
* Updated the evaluator prompt to trust document context and ignore its own outdated knowledge.

---

### üìà Results

* **All test questions produced factually grounded, well-written answers.**
* Evaluator marked all answers **"acceptable"** once the context-aware judgment fix was applied.
* The RAG system successfully explained economic indicators and their impacts on local businesses in a clear, actionable manner.

---

### üîç Conclusion

This notebook demonstrates a full, working example of how to evaluate a modern RAG system with:

* Document-grounded generation
* Faithfulness-aware evaluation
* Scalable automation using OpenAI models

The pipeline is now well-suited for expansion, logging, QA review, or integration into production systems.




## Pip Install Packages

In [None]:
!pip install --upgrade --quiet \
    langchain \
    langchain-huggingface \
    langchain-openai \
    langchain-community \
    chromadb \
    python-dotenv \
    transformers \
    accelerate \
    sentencepiece

[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m67.3/67.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m69.2/69.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.5/2.5 MB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚

## Load Libaries

In [None]:
# üåø Environment setup
import os                                 # File paths and OS interaction
from dotenv import load_dotenv            # Load environment variables from .env file
import langchain; print(langchain.__version__)  # Check LangChain version
import itertools

# üìÑ Document loading and preprocessing
from langchain_core.documents import Document                   # Base document type
from langchain_community.document_loaders import TextLoader     # Loads plain text files
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Splits long docs into smaller chunks

# üî¢ Embeddings + vector storage
from langchain_huggingface import HuggingFaceEmbeddings         # HuggingFace embedding model
from langchain.vectorstores import Chroma                       # Persistent vector DB (Chroma)

# üí¨ Prompting + output
from langchain_openai.chat_models.base import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate           # Chat-style prompt templates
from langchain_core.output_parsers import StrOutputParser       # Converts model output to string

# üîó Chains / pipelines
from langchain_core.runnables import Runnable, RunnableLambda   # Compose custom pipelines

# üß† (Optional) Hugging Face LLM client setup
# from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace  # For HF inference API

# üßæ Pretty printing
import textwrap                         # Format long strings for printing
from pprint import pprint               # Nicely format nested data structures

0.3.26


## SET MODEL PARAMS

In [None]:
# Load API key
from openai import OpenAI

# Load token from .env.
load_dotenv("/content/API_KEYS.env", override=True)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# SET MODEL PARAMS
EMBED_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 200
CHUNK_OVERLAP = 50
K = 2

LLM_MODEL = ChatOpenAI(
    model_name="gpt-4o-mini",
    temperature=0.4  # Moderate creativity; adjust as needed
)



## üßæ Document Cleaning

### üßæ 1. **Load the `.txt` files**

We‚Äôll loop through all files in the folder using `TextLoader`.

### üßπ 2. **Cleaning**

Basic cleaning (e.g. stripping newlines, extra whitespace) is often helpful **before splitting**, especially if the files came from exports or copy-paste.

### ‚úÇÔ∏è 3. **Split into chunks**

We‚Äôll use `RecursiveCharacterTextSplitter` to chunk documents (typically 500‚Äì1000 characters with slight overlap for context continuity).

---

### üßº Why Basic Cleaning Helps

* Removes linebreaks and blank lines that confuse LLMs
* Avoids splitting chunks in weird places
* Standardizes format before embedding

Later you can add more advanced cleaning (e.g., remove boilerplate, normalize headers), but this is a solid default.





In [None]:
# Path to documents
docs_path = "/content/CFFC"

# Step 1: Load all .txt files in the folder
raw_documents = []
for filename in os.listdir(docs_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(docs_path, filename)
        loader = TextLoader(file_path, encoding="utf-8")
        docs = loader.load()
        raw_documents.extend(docs)

print(f"Loaded {len(raw_documents)} documents.")

# Step 2 (optional): Clean up newlines and extra whitespace
def clean_doc(doc: Document) -> Document:
    cleaned = " ".join(doc.page_content.split())  # Removes newlines & extra spaces
    return Document(page_content=cleaned, metadata=doc.metadata)

cleaned_documents = [clean_doc(doc) for doc in raw_documents]

# Step 3: Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunked_documents = splitter.split_documents(cleaned_documents)

print(f"Split into {len(chunked_documents)} total chunks.")

# Preview the first 5 chunks
print(f"Showing first 5 of {len(chunked_documents)} chunks:\n")

for i, doc in enumerate(chunked_documents[:5]):
    print(f"--- Chunk {i+1} ---")
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
    print(textwrap.fill(doc.page_content[:500], width=100))  # limit preview to 500 characters
    print("\n")

Loaded 5 documents.
Split into 83 total chunks.
Showing first 5 of 83 chunks:

--- Chunk 1 ---
Source: /content/CFFC/CFFC_Features_UseCases.txt

# üìä Forecasting You Can Trust in Uncertain Times We help mid-sized businesses stay ahead of sales
volatility and protect their bottom line by cutting forecast errors in half ‚Äî using powerful machine


--- Chunk 2 ---
Source: /content/CFFC/CFFC_Features_UseCases.txt

forecast errors in half ‚Äî using powerful machine learning models. --- ## üöÄ Key Benefits at a Glance
- Cut forecasting errors by **50% or more** - Detect changes in demand **before they impact cash


--- Chunk 3 ---
Source: /content/CFFC/CFFC_Features_UseCases.txt

changes in demand **before they impact cash flow** - Respond to opportunities with **confidence and
speed** - Replace spreadsheets with **smarter, adaptive forecasting** --- ## ‚ö†Ô∏è 1. Risk: Can I


--- Chunk 4 ---
Source: /content/CFFC/CFFC_Features_UseCases.txt

adaptive forecasting** --- ## ‚ö†Ô∏è 1. Risk: 

## ‚úÖ Embed + Persist in Chroma




In [None]:
# Step 1: Set up Hugging Face embedding model
embedding_model = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

# Step 2: Set up Chroma with persistence
persist_dir = "chroma_db"

vectorstore = Chroma.from_documents(
    documents=chunked_documents,
    embedding=embedding_model,
    persist_directory=persist_dir
)

print(f"‚úÖ Stored {len(chunked_documents)} chunks in Chroma at '{persist_dir}'")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Stored 83 chunks in Chroma at 'chroma_db'


## ‚úÖ Create the Retriever & Prompt Template

In [None]:
retriever = vectorstore.as_retriever(search_kwargs={"k": K})

# prompt template
prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant that uses business documents to answer questions.
Use the following context to answer the question as accurately as possible.

Context:
{context}

Question:
{question}

Answer:
""")


## ‚úÖ Step 3: Create the RAG Chain & Run a Query!

In [None]:
# Define RAG chain
rag_chain = (
    RunnableLambda(lambda d: {
        "question": d["question"],
        "docs": retriever.invoke(d["question"])
    })
    | RunnableLambda(lambda d: {
        "context": "\n\n".join([doc.page_content for doc in d["docs"]]),
        "question": d["question"]
    })
    | prompt_template
    | LLM_MODEL
    | StrOutputParser()
)

# Invoke RAG
response = rag_chain.invoke({
    "question": "What are the recent economic indicators in Gainesville that affect local businesses?"
})

# Print response nicely
import textwrap
print("\n" + textwrap.fill(response, width=100))



Recent economic indicators in Gainesville that affect local businesses include declining consumer
confidence, softening retail sales, and rising unemployment. These factors contribute to revenue
uncertainty for even steady businesses, making it crucial for business owners to improve their
forecasting to navigate these challenges effectively.


#EVALUATION

In [None]:
# !pip install pydantic



In [None]:
from pydantic import BaseModel, ValidationError

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str

evaluator_system_prompt = """
You are an evaluator. Your job is to determine if an AI assistant's response to a user question is acceptable.

You must check:
- ‚úÖ Is it factually correct?
- ‚úÖ Is it clear and well-written?
- ‚úÖ Is it relevant to the user question?

If the response is unclear, incorrect, or unhelpful, mark it unacceptable.

Respond in **JSON only**:
{
  "is_acceptable": true or false,
  "feedback": "explanation of your reasoning"
}
"""

In [None]:
def evaluate_response(user_question, agent_reply):
    user_prompt = f"""
User Question:
{user_question}

Agent Response:
{agent_reply}

Please evaluate the agent's response.
"""

    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0
    )

    try:
        parsed = Evaluation.model_validate_json(response.choices[0].message.content)
        return parsed
    except ValidationError as e:
        print("‚ùå Failed to parse response:", e)
        print("Raw response:\n", response.choices[0].message.content)
        return Evaluation(is_acceptable=False, feedback="Parsing failed.")

In [None]:
Fquestion = "What is the capital of France?"
correct_reply = "The capital of France is Paris."
bad_reply = "France is in Europe, so it might be Berlin or Paris or Rome."

def print_evaluation(label, result: Evaluation):
    print(f"\n{label}")
    print("‚úÖ Acceptable:" if result.is_acceptable else "‚ùå Not acceptable.")
    print("üí¨ Feedback:")
    print(textwrap.fill(result.feedback, width=80))

# Run and print both examples
result1 = evaluate_response(question, correct_reply)
print_evaluation("‚úÖ Good Reply Test:", result1)

result2 = evaluate_response(question, bad_reply)
print_evaluation("‚ùå Bad Reply Test:", result2)

In [None]:
qa_test = [
    "What are the recent economic indicators in Gainesville that affect local businesses?",
    "What is the recent trend in the Consumer Price Index (CPI) and how might it affect small businesses?",
    "How has the Consumer Confidence Index changed since December 2024?",
    # "What challenges might a business face if the Consumer Confidence Index drops?"
    # "How can inflation affect staffing or wage budgets for small businesses?",
    # "ow are CPI and retail sales different in their impact on business operations?",
    # "Why should a local business owner monitor federal economic indicators like CPI, retail sales, and consumer confidence?",
    # "What are three key actions a business owner can take in response to recent economic trends?"
]

for i, q in enumerate(qa_test, 1):
    answer = rag_chain.invoke({"question": q})
    result = evaluate_response(q, answer)

    print("\n" + "="*100)
    print(f"üî¢ Question {i}")
    print(f"üìå Question:\n{textwrap.fill(q, width=100)}\n")
    print(f"üìù Answer:\n{textwrap.fill(answer, width=100)}\n")
    print(f"‚úÖ Acceptable: {result.is_acceptable}")
    print(f"üí¨ Feedback:\n{textwrap.fill(result.feedback, width=100)}")
    print("="*100 + "\n")


üî¢ Question 1
üìå Question:
What are the recent economic indicators in Gainesville that affect local businesses?

üìù Answer:
The recent economic indicators in Gainesville that affect local businesses include declining
consumer confidence, softening retail sales, and rising unemployment. These factors contribute to
revenue uncertainty for even steady businesses, making it challenging for them to forecast their
financial performance accurately.

‚úÖ Acceptable: True
üí¨ Feedback:
The response is factually correct, as it mentions relevant economic indicators that can affect local
businesses. It is clear and well-written, providing a concise explanation of how these indicators
impact financial performance. Additionally, it is relevant to the user's question about recent
economic indicators in Gainesville.


üî¢ Question 2
üìå Question:
What is the recent trend in the Consumer Price Index (CPI) and how might it affect small businesses?

üìù Answer:
The recent trend in the Consumer



## üß™ Evaluating RAG Answers: Why We Include the Retrieved Context

### üõë **Problem: Misjudged Answers Due to LLM Cutoff**

Our Retrieval-Augmented Generation (RAG) pipeline uses *up-to-date documents* (e.g., 2024‚Äì2025 economic data) to generate answers. However, when we evaluate these answers using a language model (LLM) as a judge (e.g., GPT-4o), we hit a key issue:

> üîç The evaluator LLM has a knowledge cutoff (e.g., October 2023) and cannot verify newer facts retrieved from our documents.

As a result:

* **Accurate, document-based answers may be wrongly flagged as hallucinations.**
* This leads to **false negatives** in our quality assessments.

---

### ‚úÖ **Solution: Include the Retrieved Context in Evaluation**

To address this, we modify our evaluation process so the evaluator LLM has the **same retrieved context** the answering LLM used.

**Specifically, we:**

* Pass the retrieved context as part of the evaluation prompt.
* Instruct the evaluator to judge the answer **based only on that context.**

This makes the evaluation **fair**, **faithfulness-focused**, and **aligned with real-world RAG behavior**.

---

### ‚úçÔ∏è Example Evaluation Prompt (Simplified)

```yaml
System prompt:
You are an evaluator. Judge if the AI's answer is acceptable based ONLY on the provided context.

User input:
Question: What is the recent trend in the Consumer Price Index?
Context: CPI rose from 315.56 to 319.77 since October 4, 2024...
Answer: The CPI increased by 1.33% since October 2024, leading to higher costs...
```

By giving the model the relevant source information, we ensure it can **accurately assess** whether the answer is grounded and relevant ‚Äî even if it's beyond the model‚Äôs original knowledge.

---

### üéØ Outcome

This adjustment:

* Prevents unfair rejections of factually correct answers
* Ensures the evaluation focuses on **faithfulness to retrieved content**
* Aligns with best practices used in tools like **RAGAS**, **OpenAI evals**, and **LangChain eval chains**



In [None]:
#-------------------
# prompt template
#-------------------

retriever = vectorstore.as_retriever(search_kwargs={"k": K})

prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant that uses business documents to answer questions.
Use the following context to answer the question as accurately as possible.

Context:
{context}

Question:
{question}

Answer:
""")

def generate_answer(question: str):
    """
    1. Retrieve K chunks
    2. Concatenate into a single context string
    3. Feed context + question to the generator LLM
    4. Return (answer, context)   <-- key!
    """
    # Step-1 ‚ûú retrieve top K document chunks
    docs = retriever.invoke(question)

    # Step-2 ‚ûú combine those into a context string
    context = "\n\n".join(doc.page_content for doc in docs)

    # Step-3 ‚ûú fill the prompt template with context and question
    llm_input = prompt_template.format(context=context, question=question)

    # Step-4 ‚ûú call the LLM and get its plain text output
    answer = StrOutputParser().invoke(LLM_MODEL.invoke(llm_input))

    # Step-5 ‚ûú return the answer AND the context used
    return answer, context

# ---------------------
# Evaluator
# ---------------------

# Evaluator: accept context as an argument

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str

evaluator_system_prompt = """
You are an evaluator. Judge whether the AI's answer is acceptable **based ONLY on the provided context**.

Check:
- factual correctness w.r.t. context
- clarity
- relevance

Respond in JSON:
{
  "is_acceptable": true/false,
  "feedback": "short rationale"
}
"""

def evaluate_response(question: str, answer: str, context: str) -> Evaluation:
    user_block = f"""
User Question:
{question}

Retrieved Context:
{context}

Agent Answer:
{answer}

Please evaluate the answer.
"""

    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user",   "content": user_block},
    ]

    raw = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0,
    ).choices[0].message.content

    return Evaluation.model_validate_json(raw)

# ---------------------
# Questions
# ---------------------

qa_test = [
    "What are the recent economic indicators in Gainesville that affect local businesses?",
    "What is the recent trend in the Consumer Price Index (CPI) and how might it affect small businesses?",
    "How has the Consumer Confidence Index changed since December 2024?",
    # "What challenges might a business face if the Consumer Confidence Index drops?"
    # "How can inflation affect staffing or wage budgets for small businesses?",
    # "ow are CPI and retail sales different in their impact on business operations?",
    # "Why should a local business owner monitor federal economic indicators like CPI, retail sales, and consumer confidence?",
    # "What are three key actions a business owner can take in response to recent economic trends?"
]

# ---------------------
# Evaluation
# ---------------------

for i, q in enumerate(qa_test, 1):
    answer, context = generate_answer(q)          # <‚Äî same context for both models
    result          = evaluate_response(q, answer, context)

    print("\n" + "="*120)
    print(f"üî¢ Question {i}")
    print("üìå", textwrap.fill(q, width=100), "\n")
    print("üìù Answer:\n", textwrap.fill(answer, 100), "\n")
    print("‚úÖ Acceptable:", result.is_acceptable)
    print("üí¨ Feedback:\n", textwrap.fill(result.feedback, 100))
    print("="*120)



üî¢ Question 1
üìå What are the recent economic indicators in Gainesville that affect local businesses? 

üìù Answer:
 Recent economic indicators in Gainesville that affect local businesses include declining consumer
confidence, softening retail sales, and rising unemployment. These factors contribute to revenue
uncertainty for even steady businesses, making better forecasting essential for navigating the
current economic landscape. 

‚úÖ Acceptable: True
üí¨ Feedback:
 The answer accurately reflects the economic indicators mentioned in the context, is clear, and
directly addresses the user's question about recent economic indicators affecting local businesses.

üî¢ Question 2
üìå What is the recent trend in the Consumer Price Index (CPI) and how might it affect small businesses? 

üìù Answer:
 The recent trend in the Consumer Price Index (CPI) shows a rise from **315.56 to 319.77**,
representing a **1.33% increase** since October 4, 2024. This upward trend in CPI indicates tha

Your pipeline is **working exactly as it should**. What you're seeing now is the **last remaining artifact** of a common RAG evaluation mistake, and you're ready to fix it fully.

---

## üîç Let‚Äôs break down what‚Äôs happening in Question 3

### ‚ùå Problem

> **Evaluator rejected a factually accurate answer** because:

* It saw **"December 2024"** as a future date (relative to its own cutoff),
* And it assumed the LLM must be hallucinating.

Even though:

* ‚úÖ The answer was retrieved from your documents (so it's **grounded**),
* ‚úÖ The RAG model did the right thing.

---

## ‚úÖ Solution

To avoid this exact error, we need to **make the evaluator explicitly trust the context** ‚Äî not its own internal memory.

This addresses the root cause directly:

* üîí ‚Äúbased ONLY on context‚Äù ensures it stops judging by its own memory
* üìÖ ‚ÄúAssume the context is reliable even if it contains future dates‚Äù handles your 2024‚Äì2025 data






In [None]:
#-------------------
# prompt template
#-------------------

retriever = vectorstore.as_retriever(search_kwargs={"k": K})

prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant that uses business documents to answer questions.
Use the following context to answer the question as accurately as possible.

Context:
{context}

Question:
{question}

Answer:
""")

def generate_answer(question: str):
    """
    1. Retrieve K chunks
    2. Concatenate into a single context string
    3. Feed context + question to the generator LLM
    4. Return (answer, context)   <-- key!
    """
    # Step-1 ‚ûú retrieve top K document chunks
    docs = retriever.invoke(question)

    # Step-2 ‚ûú combine those into a context string
    context = "\n\n".join(doc.page_content for doc in docs)

    # Step-3 ‚ûú fill the prompt template with context and question
    llm_input = prompt_template.format(context=context, question=question)

    # Step-4 ‚ûú call the LLM and get its plain text output
    answer = StrOutputParser().invoke(LLM_MODEL.invoke(llm_input))

    # Step-5 ‚ûú return the answer AND the context used
    return answer, context

# ---------------------
# Evaluator
# ---------------------

# Evaluator: accept context as an argument

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str

evaluator_system_prompt = """
You are an evaluator. Judge whether the AI's answer is acceptable based ONLY on the retrieved context provided.

You are not limited by your own knowledge. Assume the context is reliable, even if it contains facts or dates beyond your training cutoff.

Evaluate based on:
- ‚úÖ Factual correctness based on context
- ‚úÖ Clarity and coherence
- ‚úÖ Relevance to the question

Respond in strict JSON format:
{
  "is_acceptable": true or false,
  "feedback": "short explanation"
}
"""


def evaluate_response(question: str, answer: str, context: str) -> Evaluation:
    user_block = f"""
User Question:
{question}

Retrieved Context:
{context}

Agent Answer:
{answer}

Please evaluate the answer.
"""

    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user",   "content": user_block},
    ]

    raw = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0,
    ).choices[0].message.content

    return Evaluation.model_validate_json(raw)

# ---------------------
# Questions
# ---------------------

qa_test = [
#     "What are the recent economic indicators in Gainesville that affect local businesses?",
#     "What is the recent trend in the Consumer Price Index (CPI) and how might it affect small businesses?",
    "How has the Consumer Confidence Index changed since December 2024?",
    # "What challenges might a business face if the Consumer Confidence Index drops?"
    # "How can inflation affect staffing or wage budgets for small businesses?",
    # "ow are CPI and retail sales different in their impact on business operations?",
    # "Why should a local business owner monitor federal economic indicators like CPI, retail sales, and consumer confidence?",
    # "What are three key actions a business owner can take in response to recent economic trends?"
]

# ---------------------
# Evaluation
# ---------------------

for i, q in enumerate(qa_test, 1):
    answer, context = generate_answer(q)          # <‚Äî same context for both models
    result          = evaluate_response(q, answer, context)

    print("\n" + "="*120)
    print(f"üî¢ Question {i}")
    print("üìå", textwrap.fill(q, width=100), "\n")
    print("üìù Answer:\n", textwrap.fill(answer, 100), "\n")
    print("‚úÖ Acceptable:", result.is_acceptable)
    print("üí¨ Feedback:\n", textwrap.fill(result.feedback, 100))
    print("="*120)



üî¢ Question 1
üìå How has the Consumer Confidence Index changed since December 2024? 

üìù Answer:
 The Consumer Confidence Index has dropped from **74.0 to 64.7** since December 1, 2024, reflecting a
**12.6% decline**. 

‚úÖ Acceptable: True
üí¨ Feedback:
 The answer accurately reflects the information from the context regarding the change in the Consumer
Confidence Index since December 2024, providing both the numerical values and the percentage
decline.


In [None]:
# ---------------------
# Questions
# ---------------------

qa_test = [
    "What challenges might a business face if the Consumer Confidence Index drops?",
    "How can inflation affect staffing or wage budgets for small businesses?",
    "ow are CPI and retail sales different in their impact on business operations?",
    "Why should a local business owner monitor federal economic indicators like CPI, retail sales, and consumer confidence?",
    "What are three key actions a business owner can take in response to recent economic trends?"
]

# ---------------------
# Evaluation
# ---------------------

for i, q in enumerate(qa_test, 1):
    answer, context = generate_answer(q)          # <‚Äî same context for both models
    result          = evaluate_response(q, answer, context)

    print("\n" + "="*120)
    print(f"üî¢ Question {i}")
    print("üìå", textwrap.fill(q, width=100), "\n")
    print("üìù Answer:\n", textwrap.fill(answer, 100), "\n")
    print("‚úÖ Acceptable:", result.is_acceptable)
    print("üí¨ Feedback:\n", textwrap.fill(result.feedback, 100))
    print("="*120)



üî¢ Question 1
üìå What challenges might a business face if the Consumer Confidence Index drops? 

üìù Answer:
 If the Consumer Confidence Index drops, a business might face several challenges, including:  1.
**Delayed Purchases**: Customers may postpone buying decisions, leading to a decrease in immediate
sales.  2. **Price Sensitivity**: Consumers may start shopping for lower prices, which can pressure
businesses to reduce prices or offer discounts, potentially impacting profit margins.  3. **Slower
Sales**: A general decline in consumer confidence often results in slower overall sales as people
become more cautious about spending.  4. **Hesitation to Spend**: Customers may exhibit hesitation
in making purchases, which can lead to longer sales cycles and reduced conversion rates.  5.
**Smaller Average Transactions**: Consumers may spend less per transaction, resulting in lower
revenue per sale.  6. **Reduced Store Visits**: There may be a decrease in the frequency of visits
to st

üí• That‚Äôs an excellent run ‚Äî and these results show your RAG pipeline is **not just functioning**, but actually **producing informative, grounded, high-quality responses**.

Let‚Äôs take a moment to reflect on what you‚Äôve achieved ‚Äî and what your output tells us.

---

## ‚úÖ Summary of Evaluation Results (Questions 1‚Äì3)

| Question | Focus                                        | Evaluator Verdict | Notes                                                             |
| -------- | -------------------------------------------- | ----------------- | ----------------------------------------------------------------- |
| **Q1**   | Challenges from dropping Consumer Confidence | ‚úÖ True            | Answer listed clear, specific effects directly from the document  |
| **Q2**   | Inflation's effect on staffing/wages         | ‚úÖ True            | Comprehensive explanation with correct logic and business framing |
| **Q3**   | Difference between CPI and Retail Sales      | ‚úÖ True            | Nuanced comparison ‚Äî very strong response                         |

These are not trivial questions. They require:

* Understanding **economic indicators**
* Linking macro-level trends to **local business outcomes**
* Using **retrieved source material** without hallucination

Your system passed all three with flying colors. üéØ

---

## üöÄ What This Means

You now have a pipeline that:

* Retrieves relevant, modern content from your own data
* Uses that context to answer nuanced business questions
* Evaluates the answer with another LLM *that understands the limits of its own training data*
* Outputs structured, human-readable QA logs

That‚Äôs essentially the full architecture used in **real-world, production-grade RAG applications** like:

* Search assistants
* Enterprise chatbots
* Knowledge assistants for decision-making

---

## üîß Optional Next Steps

If you're curious, here are a few things we could add next:

### 1. **Log + Save Results**

Capture everything (question, answer, context, evaluation) into a CSV or JSONL so you can:

* Review failed examples
* Train new models
* Share results with stakeholders

### 2. **Retry/Improve**

For any future `"is_acceptable": false` evaluations:

* Automatically re-prompt or retry with a revised question
* Or increase `k` in your retriever to see if better context helps

### 3. **Score + Track Over Time**

Add metrics:

* % of answers marked acceptable
* Track by topic (e.g., inflation vs. retail sales)
* Build dashboards in pandas or Streamlit




#### Remove Widgets

In [1]:
import json
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

notebook_path ="/content/drive/My Drive/AI | LANGCHAIN | RAG/LC_015_RAG_EVAL.ipynb"

# Load the notebook JSON
with open(notebook_path, 'r', encoding='utf-8') as f:
    nb = json.load(f)

# 1. Remove widgets from notebook-level metadata
if "widgets" in nb.get("metadata", {}):
    del nb["metadata"]["widgets"]
    print("‚úÖ Removed notebook-level 'widgets' metadata.")

# 2. Remove widgets from each cell's metadata
for i, cell in enumerate(nb.get("cells", [])):
    if "metadata" in cell and "widgets" in cell["metadata"]:
        del cell["metadata"]["widgets"]
        print(f"‚úÖ Removed 'widgets' from cell {i}")

# Save the cleaned notebook
with open(notebook_path, 'w', encoding='utf-8') as f:
    json.dump(nb, f, indent=2)

print("‚úÖ Notebook deeply cleaned. Try uploading to GitHub again.")

Mounted at /content/drive
‚úÖ Notebook deeply cleaned. Try uploading to GitHub again.
