<a href="https://colab.research.google.com/github/micah-shull/LangChain/blob/main/LC_006_RAG_MetaData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## 🧪 Experiment Agenda: Using Metadata in RAG

### 🎯 Objective

To evaluate how adding **structured metadata** to our document loader affects the **quality, precision, and trustworthiness** of responses generated by our Retrieval-Augmented Generation (RAG) pipeline.

---

### 🧱 Key Steps

1. **Modify Document Loader**

   * Add metadata to each `Document` object (e.g., filename, date, region, category).
   * Example:

     ```python
     Document(page_content=text, metadata={"source": filename, "date": "2024-05-01"})
     ```

2. **Store Metadata in Vector Store**

   * Ensure metadata is preserved during embedding and storage with Chroma.

3. **Optional: Use Metadata in Prompt**

   * Inject source/date into context passed to the LLM (e.g., `[report1.txt - May 1, 2024]`).

4. **Compare Responses**

   * Run the same question with and without metadata.
   * Evaluate based on specificity, grounding, and perceived trust.

---

### 🧪 Hypotheses

* **H1:** Adding metadata will lead to more contextual and reliable answers.
* **H2:** Including metadata in the prompt will increase trustworthiness and reduce hallucinations.
* **H3:** Explicit filtering by metadata can improve response relevance.


## Pip Install Packages

In [1]:
!pip install --upgrade --quiet langchain langchain-huggingface chromadb python-dotenv transformers accelerate sentencepiece bitsandbytes langchain-community

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m113.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m94.9 MB/s[0m eta [36m0:00:00

## SET PARAMS

In [20]:
# SET MODEL PARAMS
EMBED_MODEL = "all-MiniLM-L6-v2"
LLM_MODEL = "gpt-3.5-turbo"
CHUNK_SIZE = 200
CHUNK_OVERLAP = 50
K = 2



## Load Libraries 🧾 Document Cleaning






In [3]:
# 🌿 Environment
import os
from dotenv import load_dotenv
import langchain; print(langchain.__version__)

# 📄 Document loading + text splitting
from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 🔢 Embeddings + vector store
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# 🧠 Open-source LLM (Zephyr via Hugging Face Inference)
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

# 💬 Prompt & output
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate

# 🔗 Chaining
from langchain_core.runnables import Runnable
from langchain_core.runnables import RunnableLambda

# Embeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# 🧾 Pretty output
import textwrap
from pprint import pprint

# Load token from .env.
load_dotenv("/content/API_KEYS.env", override=True)

# # Path to your documents
# docs_path = "/content/sample_data/CFFC_docs"

# # Step 1: Load all .txt files in the folder
# raw_documents = []
# for filename in os.listdir(docs_path):
#     if filename.endswith(".txt"):
#         file_path = os.path.join(docs_path, filename)
#         loader = TextLoader(file_path, encoding="utf-8")
#         docs = loader.load()
#         raw_documents.extend(docs)

# print(f"Loaded {len(raw_documents)} documents.")

# # Step 2 (optional): Clean up newlines and extra whitespace
# def clean_doc(doc: Document) -> Document:
#     cleaned = " ".join(doc.page_content.split())  # Removes newlines & extra spaces
#     return Document(page_content=cleaned, metadata=doc.metadata)

# cleaned_documents = [clean_doc(doc) for doc in raw_documents]

# for i, doc in enumerate(cleaned_documents[:5]):
#     print(f"--- Chunk {i+1} ---")
#     print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
#     print(textwrap.fill(doc.page_content[:500], width=100))  # limit preview to 500 characters
#     print("\n")

0.3.25


True



## 🧪 **Experiment Summary: Metadata and RAG Output Quality**

### 🧭 Objective:

Evaluate how enriching documents with metadata affects the quality, specificity, and grounding of responses from a Retrieval-Augmented Generation (RAG) system.

---

## 🔧 What We’re Changing:

Instead of just loading plain documents, we’ll **add custom metadata** to each `Document` object. This metadata may include:

* `title`: Headline or theme of the document
* `date`: When it was written or published
* `author`: If known
* `doc_type`: Blog, Report, Testimonial, White Paper, etc.
* `tags`: Keywords or business domains (e.g., forecasting, machine learning, retail)

---

## 🎯 Evaluation Plan:

1. Add metadata to `Document` objects during load time.
2. Inject some of that metadata into the RAG prompt (or let the retriever use it for filtering, later).
3. Compare:

   * Output **specificity** (Do answers cite details from the document?)
   * Output **grounding** (Do they "feel" more informed?)
   * Output **variation** (Does metadata help differentiate doc types?)


## Doc MetaData

In [4]:
# Metadata mapping for each file
DOC_METADATA = {
    "CFFC_Consistency That Builds Confidence.txt": {
        "title": "Consistency That Builds Confidence",
        "date": "2025-03-25",
        "doc_type": "case_study",
        "tags": ["forecasting", "machine learning", "retail", "cashflow"]
    },
    "CFFC_Federal Economic Indicators That Impact Gainesville Businesses.txt": {
        "title": "Federal Economic Indicators That Impact Gainesville Businesses",
        "date": "2025-04-02",
        "doc_type": "economic_report",
        "tags": ["federal", "economic indicators", "macroeconomics", "small business"]
    },
    "CFFC_Forecasting You Can Trust in Uncertain Times.txt": {
        "title": "Forecasting You Can Trust in Uncertain Times",
        "date": "2025-03-25",
        "doc_type": "product_page",
        "tags": ["forecasting", "machine learning", "sales", "uncertainty", "30-day forecast"]
    },
    "CFFC_Gainesville Economic Indicators That Matter to Local Businesses.txt": {
        "title": "Gainesville Economic Indicators That Matter to Local Businesses",
        "date": "2025-04-02",
        "doc_type": "economic_report",
        "tags": ["gainesville", "employment", "unemployment", "retail", "weekly earnings", "consumer spending"]
    },
    "CFFC_Pricing.txt": {
        "title": "Cashflow 4Cast Pricing & Service Tiers",
        "date": "2025-03-24",
        "doc_type": "pricing_sheet",
        "tags": ["forecasting", "pricing", "services", "cashflow", "plans", "machine learning"]
    },
    "CFFC_Store Forecasting Accuracy - Store Summary Results.txt": {
        "title": "Store Forecasting Accuracy - Summary Results",
        "date": "2025-03-24",
        "doc_type": "benchmark_report",
        "tags": [
            "forecasting", "store_performance", "ML_comparison", "metrics",
            "cashflow", "model_evaluation", "accuracy", "error_reduction"]
    },
    "CFFC_What If You Could Cut Cash Flow Forecasting Errors by 50%?.txt": {
        "title": "Cut Cash Flow Forecasting Errors by 50%",
        "date": "2025-03-28",
        "doc_type": "marketing_material",
        "tags": ["forecasting", "cashflow", "machine learning", "retail", "business planning"]
    }
}


##Load Docs & Metadata

In [5]:
# 📂 Path to your documents
docs_path = "/content/CFFC_docs"

# 📄 Load documents and apply metadata
raw_documents = []
for filename in os.listdir(docs_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(docs_path, filename)
        loader = TextLoader(file_path, encoding="utf-8")
        loaded_docs = loader.load()

        # Apply metadata
        metadata = DOC_METADATA.get(filename, {})
        for doc in loaded_docs:
            doc.metadata.update(metadata)

        raw_documents.extend(loaded_docs)

print(f"✅ Loaded {len(raw_documents)} documents with metadata.")

# 🧼 Clean up text (optional)
def clean_doc(doc: Document) -> Document:
    cleaned = " ".join(doc.page_content.split())
    return Document(page_content=cleaned, metadata=doc.metadata)

cleaned_documents = [clean_doc(doc) for doc in raw_documents]

# 🔍 Preview a few documents
for i, doc in enumerate(cleaned_documents[:5]):
    print(f"--- Chunk {i+1} ---")
    print(f"Metadata: {doc.metadata}\n")
    print(textwrap.fill(doc.page_content[:500], width=100))
    print("\n")

✅ Loaded 7 documents with metadata.
--- Chunk 1 ---
Metadata: {'source': '/content/CFFC_docs/CFFC_What If You Could Cut Cash Flow Forecasting Errors by 50%?.txt', 'title': 'Cut Cash Flow Forecasting Errors by 50%', 'date': '2025-03-28', 'doc_type': 'marketing_material', 'tags': ['forecasting', 'cashflow', 'machine learning', 'retail', 'business planning']}

Cashflow 4Cast What If You Could Cut Cash Flow Forecasting Errors by 50%? on March 28, 2025 What If
You Could Cut Cash Flow Forecasting Errors by 50%? Every business lives or dies by its ability to
manage cash flow. Whether it’s covering payroll, restocking inventory, or preparing for a seasonal
dip — having reliable numbers makes all the difference. And yet, most small business owners are
flying blind with clunky spreadsheets or outdated tools that leave them guessing. That’s where
CashFlow4Cas


--- Chunk 2 ---
Metadata: {'source': '/content/CFFC_docs/CFFC_Consistency That Builds Confidence.txt', 'title': 'Consistency That Builds 

## Chunk Data

In [6]:
# Step 3: Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunked_documents = splitter.split_documents(cleaned_documents)

print(f"Split into {len(chunked_documents)} total chunks.")

# Preview the first 5 chunks
print(f"Showing first 5 of {len(chunked_documents)} chunks:\n")

for i, doc in enumerate(chunked_documents[:5]):
    print(f"--- Chunk {i+1} ---")
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
    print(textwrap.fill(doc.page_content[:500], width=100))  # limit preview to 500 characters
    print("\n")

Split into 174 total chunks.
Showing first 5 of 174 chunks:

--- Chunk 1 ---
Source: /content/CFFC_docs/CFFC_What If You Could Cut Cash Flow Forecasting Errors by 50%?.txt

Cashflow 4Cast What If You Could Cut Cash Flow Forecasting Errors by 50%? on March 28, 2025 What If
You Could Cut Cash Flow Forecasting Errors by 50%? Every business lives or dies by its ability to


--- Chunk 2 ---
Source: /content/CFFC_docs/CFFC_What If You Could Cut Cash Flow Forecasting Errors by 50%?.txt

Every business lives or dies by its ability to manage cash flow. Whether it’s covering payroll,
restocking inventory, or preparing for a seasonal dip — having reliable numbers makes all the


--- Chunk 3 ---
Source: /content/CFFC_docs/CFFC_What If You Could Cut Cash Flow Forecasting Errors by 50%?.txt

dip — having reliable numbers makes all the difference. And yet, most small business owners are
flying blind with clunky spreadsheets or outdated tools that leave them guessing. That’s where


--- Chunk 4 ---
So

You're now *one step away* from building a powerful **metadata-aware agent** that dynamically adapts to user input. Here's how it could work:

---

### 🧠 Metadata-Aware Agent Flow

#### 🔹 1. **User Input**

User asks:

> “Can you summarize recent Gainesville economic trends from April 2025?”

#### 🔹 2. **Agent Interprets Query**

You parse the query to extract **intent + filters**:

```python
metadata_filter = {
    "doc_type": "economic_report",
    "date": "2025-04-01"  # or a date range
}
```

#### 🔹 3. **Pass to Retriever**

Dynamically set the search like this:

```python
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 3, "filter": metadata_filter}
)
```

#### 🔹 4. **RAG Response**

Only relevant, filtered documents are retrieved and fed into the LLM.

---

### ✅ Benefits

* 🔍 **Targeted Retrieval** — Users get only what they care about.
* 🧩 **Smarter Interaction** — Adapts to date ranges, topics, and formats.
* 🔒 **Context Compression** — Reduces noise, improves LLM performance.

---

### 🛠️ You Can Extend This With:

* **LangChain Agents** with tools like `search_docs`, `get_report_summary`, `filter_by_tags`.
* **Form-based or natural language query parsing** (with regex or LLM).
* **Metadata-aware routing**: route financial vs. product questions differently.




**Chroma**: it **does not support lists** (like your `"tags"` field) in its metadata values.

---

### ❌ The Problem:

You're passing metadata like:

```python
"tags": ["forecasting", "machine learning", "sales"]
```

But Chroma only accepts metadata values that are:

> `str`, `int`, `float`, `bool`, or `None` — **not lists**

---

### ✅ Solution: Convert the list to a string



In [8]:
def flatten_metadata(doc: Document) -> Document:
    flat_metadata = {}
    for k, v in doc.metadata.items():
        if isinstance(v, list):
            flat_metadata[k] = ", ".join(map(str, v))  # Convert list to comma-separated string
        else:
            flat_metadata[k] = v
    return Document(page_content=doc.page_content, metadata=flat_metadata)

flat_chunked_documents = [flatten_metadata(doc) for doc in chunked_documents]

## ✅ Embed + Persist in Chroma




In [9]:
# Step 1: Set up Hugging Face embedding model
embedding_model = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

# Step 2: Set up Chroma with persistence
persist_dir = "chroma_db"

vectorstore = Chroma.from_documents(
    documents=flat_chunked_documents,  # ← Use the flattened version
    embedding=embedding_model,
    persist_directory=persist_dir
)

print(f"✅ Stored {len(flat_chunked_documents)} chunks in Chroma at '{persist_dir}'")


✅ Stored 174 chunks in Chroma at 'chroma_db'


**Metadata filtering** is a powerful and often underused parameter you can (and should) test. It directly affects what documents your retriever surfaces, which ultimately shapes the LLM’s output.

---

### 🔍 **How Metadata Filtering Becomes a Tunable Parameter**

Just like `chunk_size`, `overlap`, or `k`, you can experiment with:

#### 1. **Filtering by `doc_type`**

```python
{"doc_type": "economic_report"}
{"doc_type": {"$in": ["economic_report", "benchmark_report"]}}
```

#### 2. **Filtering by `tags`**

```python
{"tags": {"$in": ["forecasting"]}}  # Must flatten tags to string if using Chroma
```

> *Note:* Chroma requires flattened metadata values — no lists. So for tag-based filtering, you'll need to convert tags to comma-separated strings (e.g., `"forecasting, cashflow"`).

#### 3. **Filtering by `date` (if supported)**

Not natively supported by Chroma unless you manually convert date strings to numeric timestamps and implement a filtering strategy outside of `.as_retriever()`. But you **can simulate** date prioritization using:

* Pre-filtering before vector search
* Adding date bias in a reranker

---

### 🧪 **Testing Plan**

| Test Name         | Filter                               | Purpose                                           |
| ----------------- | ------------------------------------ | ------------------------------------------------- |
| `no_filter`       | None                                 | Baseline for retrieval                            |
| `economic_only`   | `{"doc_type": "economic_report"}`    | See if narrower focus improves grounding          |
| `marketing_only`  | `{"doc_type": "marketing_material"}` | Test how LLM reacts to softer business insights   |
| `forecasting_tag` | tag match                            | Test if semantic similarity can be guided by tags |

---

### ✅ Why It’s Worth Testing

Filtering can:

* Improve **response quality** by limiting irrelevant info
* Improve **speed** by reducing search set
* Help **tailor tone** or domain specificity




## ✅ Create the Retriever

In [10]:
# No Filter (Baseline)
retriever = vectorstore.as_retriever(search_kwargs={"k": K})

# # Filter by Document Type
# retriever = vectorstore.as_retriever(
#     search_kwargs={"k": K, "filter": {"doc_type": "economic_report"}}
# )

# # Filter by Single Tag
# retriever = vectorstore.as_retriever(
#     search_kwargs={"k": K, "filter": {"tags": "forecasting"}}
# )

## ✅ Create the Prompt Template

In [11]:
# prompt template
prompt_template = ChatPromptTemplate.from_template("""
You are a Goldman Sachs economist tasked with briefing Gainesville business owners.
Use the following economic documents to assess the local business climate.

Each document includes metadata such as title, date, type, and tags.

Instructions:
- Analyze the content and consider the metadata to understand the context and reliability.
- Present the key trends and their implications for local businesses.
- Be clear, concise, and strategic in tone — as if presenting to a room of professionals.

Context:
{context}

Question:
{question}

Answer:
""")


## ✅ Step 3: Create the RAG Chain & Run a Query!

In [12]:
!pip install -U langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.3.19-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.19-py3-none-any.whl (64 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.5/64.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.19


In [13]:
from langchain_openai.chat_models.base import ChatOpenAI

chat_model = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0.4  # Moderate creativity; adjust as needed
)

## RETRIEVER TESTING

### TEST 1: No Metadata

In [14]:
from langchain_core.runnables import RunnableLambda
from langchain_core.output_parsers import StrOutputParser
import textwrap

# Build the RAG chain
rag_chain = (
    RunnableLambda(lambda d: {
        "question": d["question"],
        "docs": retriever.invoke(d["question"])
    })
    | RunnableLambda(lambda d: {
        "context": "\n\n".join([
            f"[{doc.metadata.get('title')}, {doc.metadata.get('date')}] {doc.page_content}"
            for doc in d["docs"]
        ]),
        "question": d["question"]
    })
    | prompt_template
    | chat_model
    | StrOutputParser()
)

# Run the chain
response = rag_chain.invoke({
    "question": "What are the recent economic indicators in Gainesville that affect local businesses?"
})

# Pretty print
print("\n" + textwrap.fill(response, width=100))



Recent economic indicators in Gainesville that affect local businesses include a shift in how people
feel about the economy, as well as the local job market and unemployment rate. These indicators can
create ripple effects throughout the community and put businesses under pressure. It is crucial for
Gainesville business owners to closely monitor these indicators to make informed decisions and adapt
their strategies accordingly to navigate the challenging business climate.


### TEST 2: Only retrieve economic reports

In [15]:
# Filtered Retriever: Only retrieve economic reports
retriever = vectorstore.as_retriever(
    search_kwargs={"k": K, "filter": {"doc_type": "economic_report"}}
)

# Build the RAG chain
rag_chain = (
    RunnableLambda(lambda d: {
        "question": d["question"],
        "docs": retriever.invoke(d["question"])
    })
    | RunnableLambda(lambda d: {
        "context": "\n\n".join([
            f"[{doc.metadata.get('title')}, {doc.metadata.get('date')}] {doc.page_content}"
            for doc in d["docs"]
        ]),
        "question": d["question"]
    })
    | prompt_template
    | chat_model
    | StrOutputParser()
)

# Run the chain
response = rag_chain.invoke({
    "question": "What are the recent economic indicators in Gainesville that affect local businesses?"
})

# Pretty print
print("\n" + textwrap.fill(response, width=100))



Recent economic indicators in Gainesville that affect local businesses include a shift in how people
feel about the economy on a federal level, as well as local indicators such as the Gainesville
Unemployment Rate. These indicators suggest a community under pressure, which can impact consumer
spending, business investment, and overall economic activity in Gainesville. Business owners should
closely monitor these indicators to make informed decisions regarding their operations, marketing
strategies, and financial planning in response to the changing economic climate.


### TEST 1 vs TEST 2

### 🔍 **TEST 1: No Metadata Filtering**

* **Content:** Generic, high-level.
* **Tone:** Safe but vague.
* **Drawback:** Lacks specific reference points — likely because the retriever pulled from a wider variety of sources (e.g., marketing copy, product descriptions, etc.), diluting economic signal.

---

### 📁 **TEST 2: Filtered by `doc_type: economic_report`**

* **Content:** Still concise, but noticeably **more grounded**.
* **Mentions:** Specific concepts like the *Gainesville Unemployment Rate* and *federal-level trends*.
* **Upside:** Clearer tie to actual economic indicators — which is what your question targeted.

---

### ✅ **Conclusion**

Filtering the retriever with `{"doc_type": "economic_report"}` **improves topical precision** without additional engineering — a strong case for metadata-aware retrieval.



###**TEST 3: Filter by Tag `"forecasting"`**

* **Content Quality:** Rich, highly detailed.
* **Structure:** Professional briefing with titles, dates, bullet points, and implications — a format that mirrors how analysts communicate.
* **Tone:** Strategic and proactive, fitting the “Goldman Sachs economist” persona perfectly.
* **Specificity:** Pulls in **GDP growth**, **employment rate changes**, and **business confidence survey data** — all highly relevant to the question.

---

### ✅ **Why This Worked So Well**

1. **Tag Filtering → Topical Focus**
   By retrieving only docs tagged `"forecasting"`, you likely restricted the context to forecasting reports and analyses — which tend to include hard data and trend interpretation.

2. **Prompt Persona**
   The combination of the economist persona + filtered content seems to *activate* the model’s analytical behavior — delivering both insight and structure.

---

### ⚖️ **Comparison to Earlier Results**

| Test | Filter               | Quality   | Specificity | Style                   |
| ---- | -------------------- | --------- | ----------- | ----------------------- |
| 1    | None                 | Basic     | Low         | Generic summary         |
| 2    | `doc_type`           | Focused   | Medium      | Better economic content |
| 3    | `tags="forecasting"` | Excellent | High        | Structured & strategic  |



In [16]:
# Filter by Single Tag
retriever = vectorstore.as_retriever(
    search_kwargs={"k": K, "filter": {"tags": "forecasting"}}
)

# Build the RAG chain
rag_chain = (
    RunnableLambda(lambda d: {
        "question": d["question"],
        "docs": retriever.invoke(d["question"])
    })
    | RunnableLambda(lambda d: {
        "context": "\n\n".join([
            f"[{doc.metadata.get('title')}, {doc.metadata.get('date')}] {doc.page_content}"
            for doc in d["docs"]
        ]),
        "question": d["question"]
    })
    | prompt_template
    | chat_model
    | StrOutputParser()
)

# Run the chain
response = rag_chain.invoke({
    "question": "What are the recent economic indicators in Gainesville that affect local businesses?"
})

# Pretty print
print("\n" + textwrap.fill(response, width=100))



Good afternoon Gainesville business owners,  I am here to brief you on the recent economic
indicators in Gainesville that may impact your businesses. I have analyzed several economic
documents to provide you with a comprehensive overview of the local business climate.  Document 1:
Title: "Gainesville Economic Report - Q2 2021" Date: July 15, 2021 Type: Report Tags: GDP,
employment, consumer spending  Key Trends: - The GDP in Gainesville saw a 3% growth in the second
quarter of 2021, indicating a recovering economy. - Employment rates have increased by 2% compared
to the previous quarter, suggesting a growing job market. - Consumer spending has also shown a
slight uptick, particularly in the retail and hospitality sectors.  Implications: - With a growing
GDP and employment rates, local businesses can expect an increase in consumer demand. - Businesses
in the retail and hospitality sectors should prepare for higher foot traffic and adjust their
operations accordingly. - This economic gr


#### 🔍 Metadata Dilution

When you **inject metadata into the chunk’s content**, it **uses up part of your available token space**. If your `chunk_size` is too small, a significant portion of the chunk becomes metadata rather than **substantive content**, like this:

```text
Title: Forecasting You Can Trust in Uncertain Times  
Date: 2025-03-25  
Type: product_page  
Tags: forecasting, machine learning, sales, uncertainty, 30-day forecast  

[Actual content here... but only 100–150 tokens]
```

So the model ends up seeing mostly **labels**, not **information-rich text**.

---

## 🎯 Consequences of Too-Small Chunks with Metadata

| Problem                      | Why It Happens                                  |
| ---------------------------- | ----------------------------------------------- |
| ✅ Metadata is retrieved      | Good — but too much of it                       |
| ❌ Very little real context   | Metadata dominates the chunk                    |
| ❌ Hallucinations             | Model lacks enough content to ground its answer |
| ❌ Poor use of context window | You're underutilizing available capacity        |

---

## ✅ Recommended Fix

Increase both:

* **`chunk_size`** to something like **500–800 tokens**
* **`chunk_overlap`** to **50–100** tokens (to preserve continuity)

This ensures:

* Each chunk contains **enough metadata to orient the model**
* But also includes **a meaningful slice of real document content**

---

## 🛠 Example

```python
CHUNK_SIZE = 600
CHUNK_OVERLAP = 100
```

This gives:

* Room for \~75–100 tokens of metadata
* Plus \~500 tokens of meaningful document text
* With enough overlap to maintain cohesion across boundaries




###TEST 4: Increase Chunk & Overlap Size

In [19]:
# SET MODEL PARAMS
EMBED_MODEL = "all-MiniLM-L6-v2"
LLM_MODEL = "gpt-3.5-turbo"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
K = 2

# 📂 Path to your documents
docs_path = "/content/CFFC_docs"

# 📄 Load documents and apply metadata
raw_documents = []
for filename in os.listdir(docs_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(docs_path, filename)
        loader = TextLoader(file_path, encoding="utf-8")
        loaded_docs = loader.load()

        # Apply metadata
        metadata = DOC_METADATA.get(filename, {})
        for doc in loaded_docs:
            doc.metadata.update(metadata)

        raw_documents.extend(loaded_docs)

print(f"✅ Loaded {len(raw_documents)} documents with metadata.")

# 🧼 Clean up text (optional)
def clean_doc(doc: Document) -> Document:
    cleaned = " ".join(doc.page_content.split())
    return Document(page_content=cleaned, metadata=doc.metadata)

cleaned_documents = [clean_doc(doc) for doc in raw_documents]

# 🔍 Preview a few documents
for i, doc in enumerate(cleaned_documents[:5]):
    print(f"--- Chunk {i+1} ---")
    print(f"Metadata: {doc.metadata}\n")
    print(textwrap.fill(doc.page_content[:500], width=100))
    print("\n")

# Step 3: Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunked_documents = splitter.split_documents(cleaned_documents)

print(f"Split into {len(chunked_documents)} total chunks.")

# Preview the first 5 chunks
print(f"Showing first 5 of {len(chunked_documents)} chunks:\n")

for i, doc in enumerate(chunked_documents[:5]):
    print(f"--- Chunk {i+1} ---")
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
    print(textwrap.fill(doc.page_content[:500], width=100))  # limit preview to 500 characters
    print("\n")



def flatten_metadata(doc: Document) -> Document:
    flat_metadata = {}
    for k, v in doc.metadata.items():
        if isinstance(v, list):
            flat_metadata[k] = ", ".join(map(str, v))  # Convert list to comma-separated string
        else:
            flat_metadata[k] = v
    return Document(page_content=doc.page_content, metadata=flat_metadata)

flat_chunked_documents = [flatten_metadata(doc) for doc in chunked_documents]

# Step 1: Set up Hugging Face embedding model
embedding_model = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

# Step 2: Set up Chroma with persistence
persist_dir = "chroma_db"

vectorstore = Chroma.from_documents(
    documents=flat_chunked_documents,  # ← Use the flattened version
    embedding=embedding_model,
    persist_directory=persist_dir
)

print(f"✅ Stored {len(flat_chunked_documents)} chunks in Chroma at '{persist_dir}'")

#=========   RETRIEVER TESTING ===========

# No Filter (Baseline)
# retriever = vectorstore.as_retriever(search_kwargs={"k": K})
# filter tags
# Filter by Single Tag
retriever = vectorstore.as_retriever(
    search_kwargs={"k": K, "filter": {"tags": "forecasting"}}
)

# prompt template
prompt_template = ChatPromptTemplate.from_template("""
You are a Goldman Sachs economist tasked with briefing Gainesville business owners.
Use the following economic documents to assess the local business climate.

Each document includes metadata such as title, date, type, and tags.

Instructions:
- Analyze the content and consider the metadata to understand the context and reliability.
- Present the key trends and their implications for local businesses.
- Be clear, concise, and strategic in tone — as if presenting to a room of professionals.

Context:
{context}

Question:
{question}

Answer:
""")


chat_model = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0.4  # Moderate creativity; adjust as needed
)

# Build the RAG chain
rag_chain = (
    RunnableLambda(lambda d: {
        "question": d["question"],
        "docs": retriever.invoke(d["question"])
    })
    | RunnableLambda(lambda d: {
        "context": "\n\n".join([
            f"[{doc.metadata.get('title')}, {doc.metadata.get('date')}] {doc.page_content}"
            for doc in d["docs"]
        ]),
        "question": d["question"]
    })
    | prompt_template
    | LLM_MODEL
    | StrOutputParser()
)

# Run the chain
response = rag_chain.invoke({
    "question": "What are the recent economic indicators in Gainesville that affect local businesses?"
})

# Pretty print
print("\n" + textwrap.fill(response, width=100))

✅ Loaded 7 documents with metadata.
--- Chunk 1 ---
Metadata: {'source': '/content/CFFC_docs/CFFC_What If You Could Cut Cash Flow Forecasting Errors by 50%?.txt', 'title': 'Cut Cash Flow Forecasting Errors by 50%', 'date': '2025-03-28', 'doc_type': 'marketing_material', 'tags': ['forecasting', 'cashflow', 'machine learning', 'retail', 'business planning']}

Cashflow 4Cast What If You Could Cut Cash Flow Forecasting Errors by 50%? on March 28, 2025 What If
You Could Cut Cash Flow Forecasting Errors by 50%? Every business lives or dies by its ability to
manage cash flow. Whether it’s covering payroll, restocking inventory, or preparing for a seasonal
dip — having reliable numbers makes all the difference. And yet, most small business owners are
flying blind with clunky spreadsheets or outdated tools that leave them guessing. That’s where
CashFlow4Cas


--- Chunk 2 ---
Metadata: {'source': '/content/CFFC_docs/CFFC_Consistency That Builds Confidence.txt', 'title': 'Consistency That Builds 



You're feeding in **real documents with 2025 dates** via metadata, and yet the model is inventing **fictional documents from 2021** with detailed contents that don’t exist in your corpus. This is a classic case of **hallucination**, and here’s exactly why it's happening — and how to fix or mitigate it:

---

### 🔍 Why It's Still Hallucinating Dates Like "2021"

Despite your metadata being correct (`"date": "2025-04-02"` etc.), the model is:

1. **Generating fictional documents**
   It's not summarizing what’s in your chunks — it's fabricating reports that **sound plausible**, but **don’t exist** in the vectorstore. This means:

   * Either the chunks are too small or generic.
   * Or the metadata is present but **not emphasized enough** in the prompt or context.
   * Or your prompt nudges the model to act like it's reading a report even if it's just guessing.

2. **Too little true data per chunk**
   You were right earlier: **flattened metadata eats into the chunk size**, which means a 500-character chunk might include 300 characters of metadata and only 200 of document content. This imbalance starves the model of real content and encourages it to guess.

---

### ✅ Solutions

Here’s what you should do next:

#### 1. **Boost chunk size again** (e.g. 750 or even 1000 characters)

* Keep overlap at 20%–25% (e.g. 150–200) to preserve continuity.
* This will allow full metadata **and** substantial document content to co-exist in a single chunk.

#### 2. **Tweak your context formatter to enforce metadata visibility**

Right now, the model might skim or ignore the metadata header. Instead, wrap it with clarity:

```python
"context": "\n\n".join([
    f"[TITLE: {doc.metadata.get('title')}, DATE: {doc.metadata.get('date')}, TYPE: {doc.metadata.get('doc_type')}] \n{doc.page_content}"
    for doc in d["docs"]
])
```

This makes the metadata stand out and reinforces that it’s meaningful.

#### 3. **Anchor the prompt to real documents**

In your prompt, you might add a constraint like:

```
- Do not fabricate new reports; only summarize the information provided in the context.
- Refer to documents by title and date as shown.
```

This tells the model: **"Don't invent. Use what's in context."**

#### 4. **Lower temperature slightly**

If you're using `temperature=0.4`, you might want to try `0.2` for more grounded, conservative completions.

---

### Optional Bonus: Add a `source` field to each summary

If you want, you can ask the model to **cite which document** each point came from — since you gave it `title`, `date`, and `doc_type`, it could include something like:

> "According to *Gainesville Economic Indicators That Matter to Local Businesses* (April 2, 2025)..."

