<a href="https://colab.research.google.com/github/micah-shull/LangChain/blob/main/LC_002_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🧭 Step 1: What Is RAG (Retrieval-Augmented Generation)?

RAG is a system design that enhances LLM outputs by:

1. **Retrieving** relevant information from an external knowledge base (like a PDF, Word doc, etc.)
2. **Feeding that information into the prompt** sent to the LLM
3. Letting the LLM generate a response **grounded** in retrieved facts

### 🔁 Pipeline Summary:

```
User Query → Retrieve Relevant Docs → Inject into Prompt → Generate Answer
```

This solves a key problem:

> LLMs don’t know about *your* business — RAG injects your domain knowledge at runtime.

---

## 💼 How You’ll Use It for Business

With your business documents (like manuals, reports, contracts, etc.), a RAG pipeline will:

* 🔎 Search your document corpus for relevant snippets
* 💬 Feed that context into a prompt
* 🧠 Let the LLM generate a **grounded, relevant answer**
* 🔄 Repeat for any question about your internal knowledge

For example:

> ❓ *"What is our refund policy on international orders?"*
> → Retrieves your company’s policy PDF
> → Generates a clear, policy-grounded answer

---

## 🧰 How We'll Build It in LangChain

Here’s the **LangChain-based architecture** we’ll start with:

### 🧱 RAG Components in LangChain

| Component          | What it does                    | LangChain Tool                                    |         |
| ------------------ | ------------------------------- | ------------------------------------------------- | ------- |
| 📄 Document Loader | Load your business files        | `DirectoryLoader`, `PyPDFLoader`, etc.            |         |
| 🧹 Text Splitter   | Breaks long docs into chunks    | `RecursiveCharacterTextSplitter`                  |         |
| 📚 Embeddings      | Turns text into vectors         | `OpenAIEmbeddings`, `HuggingFaceEmbeddings`, etc. |         |
| 🏭 Vector Store    | Stores and searches text chunks | `FAISS`, `Chroma`, `Weaviate`, etc.               |         |
| 🔎 Retriever       | Finds relevant docs per query   | `vectorstore.as_retriever()`                      |         |
| 💬 Prompt          | Wraps context + question        | `ChatPromptTemplate`                              |         |
| 🧠 LLM             | Generates answer                | `ChatOpenAI`, `HuggingFaceEndpoint`, etc.         |         |
| 🔗 Chain           | Glues it all together           | `Runnable` chain via \`                           | \` pipe |

---

## 🔨 Example Flow We'll Build

```text
📄 Load + chunk documents
   ↓
🔢 Create embeddings
   ↓
🧠 Store in vector database (Chroma or FAISS)
   ↓
🔍 At runtime:
   → User query
   → Retrieve relevant chunks
   → Feed into prompt template
   → LLM generates grounded answer
```

---

## 🏁 Next Steps

Now that you have the “what and why,” here's how we’ll start building:

1. **Set up the doc loader and splitter**
2. **Embed and store in vector DB**
3. **Create retriever + prompt template**
4. **Pipe: retriever → prompt → LLM → parser**
5. **Query the chain with a real question**






## 🧱 Plan for RAG Notebook (Step-by-Step Build)

We'll build the following stages:

1. **📂 Load your `.txt` documents**
2. **🧹 Split text into chunks** (for better retrieval)
3. **🔢 Embed chunks using Hugging Face embeddings**
4. **📚 Store them in Chroma vector DB**
5. **🔎 Set up retriever**
6. **💬 Build prompt with retrieved docs + user query**
7. **🧠 Generate response using open-source model**
8. **✅ Pretty-print the result**



## Pip Install Packages

In [1]:
# !pip install -q langchain langchain-openai python-dotenv
# !pip install --upgrade --quiet  langchain-huggingface text-generation transformers google-search-results numexpr langchainhub sentencepiece jinja2 bitsandbytes accelerate

!pip install --upgrade --quiet langchain langchain-huggingface chromadb python-dotenv transformers accelerate sentencepiece bitsandbytes langchain-community

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m12.6 MB/s[0m eta [36m0:00:0



## 🔍 LangChain Package Breakdown for RAG

---

### 📁 `from langchain_community.document_loaders import TextLoader`

* **What it does:**
  Reads `.txt` files from disk and converts them into `Document` objects — LangChain’s internal format for chunks of text with metadata.

* **Example:**
  A single `.txt` file becomes:

  ```python
  Document(page_content="This is your text...", metadata={"source": "filename.txt"})
  ```

* **Why it matters:**
  RAG needs a structured format to store, retrieve, and inject chunks into LLM prompts.

---

### 🔤 `from langchain.text_splitter import RecursiveCharacterTextSplitter`

* **What it does:**
  Breaks long texts into smaller chunks (like 500-character segments) while trying to split intelligently (at sentence or paragraph boundaries if possible).

* **Why it matters:**
  Most LLMs have a token limit. Breaking text into smaller overlapping chunks improves:

  * Retrieval accuracy
  * Context relevance
  * Overall LLM output quality

---

### 🧠 `from langchain.embeddings import HuggingFaceEmbeddings`

* **What it does:**
  Converts each chunk of text into a **vector representation** (a list of numbers that capture semantic meaning).

* **Why it matters:**
  These embeddings are what allow you to later “search” for relevant chunks using cosine similarity or other vector math.

---

### 🧠 `from langchain_huggingface import HuggingFaceEndpoint`

* **What it does:**
  A wrapper that lets LangChain send prompts to a hosted model on Hugging Face (like Zephyr or Mistral) via Hugging Face Inference API.

* **Why it matters:**
  This is your LLM — the component that actually generates the response based on the prompt + retrieved context.

---

### 📚 `from langchain.vectorstores import Chroma`

* **What it does:**
  Stores the embeddings and allows similarity search (retrieval) later.

* **Why it matters:**
  This is your **knowledge base**. When a user asks a question, LangChain will:

  1. Embed the question
  2. Compare it to stored vectors
  3. Retrieve the most relevant text chunks

---

### 💬 `from langchain_core.prompts import ChatPromptTemplate`

* **What it does:**
  Lets you define reusable, fill-in-the-blank-style prompt templates.

* **Why it matters:**
  This controls what the LLM sees — and how the retrieved documents and user question are formatted into a coherent prompt.

---

### 📤 `from langchain_core.output_parsers import StrOutputParser`

* **What it does:**
  After the LLM returns a result, this turns the raw output into a clean string (or list, or dict depending on what you're doing).

* **Why it matters:**
  It’s how you get **clean answers** instead of messy LLM responses.

---

### 🔗 `from langchain_core.runnables import Runnable`

* **What it does:**
  Provides the ability to chain steps together using the `|` (pipe) syntax.

* **Why it matters:**
  This is how you connect:

  ```
  retriever → prompt → LLM → parser
  ```

  Into a clean, readable, and reusable pipeline.

---

### 🧾 `import textwrap`

* **What it does:**
  Nicely formats long text outputs in the notebook by wrapping lines to fit the screen.

---

## ✅ Summary

| Category        | Package                          | Role                                |
| --------------- | -------------------------------- | ----------------------------------- |
| 📂 Load Text    | `TextLoader`                     | Loads `.txt` files                  |
| 🧹 Preprocess   | `RecursiveCharacterTextSplitter` | Chunks text                         |
| 🔢 Vectorize    | `HuggingFaceEmbeddings`          | Turns chunks into vectors           |
| 🧠 LLM          | `HuggingFaceEndpoint`            | Calls an open-source LLM            |
| 📚 Store/Search | `Chroma`                         | Stores and retrieves chunks         |
| 💬 Prompt       | `ChatPromptTemplate`             | Formats the final LLM prompt        |
| 📤 Parse        | `StrOutputParser`                | Cleans up LLM output                |
| 🔗 Chain        | `Runnable`                       | Combines components into a pipeline |



## Load Libraries

In [7]:
# 🌿 Environment
import os
from dotenv import load_dotenv
import langchain; print(langchain.__version__)

# 📄 Document loading + text splitting
from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 🔢 Embeddings + vector store
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# 🧠 Open-source LLM (Zephyr via Hugging Face Inference)
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

# 💬 Prompt & output
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# 🔗 Chaining
from langchain_core.runnables import Runnable

# 🧾 Pretty output
import textwrap

0.3.25


In [None]:
# import getpass
# import os

# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()



## ✅ Step-by-Step Plan for Document Loading & Cleaning

### 🧾 1. **Load the `.txt` files**

We’ll loop through all files in the folder using `TextLoader`.

### 🧹 2. **Optional Cleaning**

Basic cleaning (e.g. stripping newlines, extra whitespace) is often helpful **before splitting**, especially if the files came from exports or copy-paste.

### ✂️ 3. **Split into chunks**

We’ll use `RecursiveCharacterTextSplitter` to chunk documents (typically 500–1000 characters with slight overlap for context continuity).

---

### 🧼 Why Basic Cleaning Helps

* Removes linebreaks and blank lines that confuse LLMs
* Avoids splitting chunks in weird places
* Standardizes format before embedding

Later you can add more advanced cleaning (e.g., remove boilerplate, normalize headers), but this is a solid default.





In [11]:
import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from pprint import pprint

# Path to your documents
docs_path = "/content/sample_data/CFFC_docs"

# Step 1: Load all .txt files in the folder
raw_documents = []
for filename in os.listdir(docs_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(docs_path, filename)
        loader = TextLoader(file_path, encoding="utf-8")
        docs = loader.load()
        raw_documents.extend(docs)

print(f"Loaded {len(raw_documents)} documents.")

# Step 2 (optional): Clean up newlines and extra whitespace
def clean_doc(doc: Document) -> Document:
    cleaned = " ".join(doc.page_content.split())  # Removes newlines & extra spaces
    return Document(page_content=cleaned, metadata=doc.metadata)

cleaned_documents = [clean_doc(doc) for doc in raw_documents]

# Step 3: Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=75
)

chunked_documents = splitter.split_documents(cleaned_documents)

print(f"Split into {len(chunked_documents)} total chunks.")

# Preview the first 5 chunks
print(f"Showing first 5 of {len(chunked_documents)} chunks:\n")

for i, doc in enumerate(chunked_documents[:5]):
    print(f"--- Chunk {i+1} ---")
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
    print(textwrap.fill(doc.page_content[:500], width=100))  # limit preview to 500 characters
    print("\n")


Loaded 7 documents.
Split into 64 total chunks.
Showing first 5 of 64 chunks:

--- Chunk 1 ---
Source: /content/sample_data/CFFC_docs/CFFC_Gainesville Economic Indicators That Matter to Local Businesses.txt

Cashflow 4Cast Gainesville Economic Indicators That Matter to Local Businesses on April 02, 2025 📗
Gainesville Economic Indicators That Matter to Local Businesses 1. Average Weekly Earnings
(Gainesville) What It Is: This tracks the average amount workers in Gainesville earn per week —
across all private sector jobs. It’s one of the clearest measures of take-home pay and gives insight
into what people can realistically afford. Why It Matters for Gainesville: When earnings drop,
households tighten


--- Chunk 2 ---
Source: /content/sample_data/CFFC_docs/CFFC_Gainesville Economic Indicators That Matter to Local Businesses.txt

Why It Matters for Gainesville: When earnings drop, households tighten their budgets. That means
fewer nights out, postponed purchases, and more cautious spendi


## 🏢 Where Businesses Store Chunked Data in Production

In production, storing your chunked + embedded data is **crucial** for:

* 🔄 Avoiding repeated computation
* 🔍 Fast retrieval
* 🧩 Scaling to large document sets
* 🧠 Supporting updates or reindexing

### ✅ Common Storage Options

| Option                            | Description                                       | Used When                                         |
| --------------------------------- | ------------------------------------------------- | ------------------------------------------------- |
| **Chroma (on-disk)**              | Simple embedded DB, fast, file-based              | Local or lightweight deployments                  |
| **FAISS (with disk persistence)** | High-performance vector index                     | When doing custom similarity search, scaling up   |
| **Qdrant**                        | Open-source vector DB with REST API               | When you need a full DB server or remote access   |
| **Weaviate**                      | Scalable DB with hybrid search (vector + keyword) | For enterprise-grade RAG APIs                     |
| **Pinecone**                      | Hosted vector DB (SaaS)                           | Fully managed, production-grade retrieval         |
| **PostgreSQL + pgvector**         | SQL DB with vector support                        | When combining metadata + retrieval in one system |

---

## 🔐 Key Business Considerations

| Factor                 | Why it matters                                                 |
| ---------------------- | -------------------------------------------------------------- |
| **Persistence**        | You don’t want to re-chunk & re-embed every time your app runs |
| **Scalability**        | If your corpus grows, can the DB handle 100K+ chunks?          |
| **Search performance** | Sub-second vector search time is crucial for fast answers      |
| **Security**           | Internal document corpora need access control                  |
| **Deployment**         | On-prem vs cloud, latency, compliance, etc.                    |






## ✅ Step: Embed + Persist in Chroma

We’ll use:

* `HuggingFaceEmbeddings` to convert text to vectors
* `Chroma` to store those vectors on disk so you don’t have to recompute




In [13]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Step 1: Set up Hugging Face embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Step 2: Set up Chroma with persistence
persist_dir = "chroma_db"

vectorstore = Chroma.from_documents(
    documents=chunked_documents,
    embedding=embedding_model,
    persist_directory=persist_dir
)

print(f"✅ Stored {len(chunked_documents)} chunks in Chroma at '{persist_dir}'")

✅ Stored 64 chunks in Chroma at 'chroma_db'


## 🧠 Next Step: Retrieval + Generation (the "G" in RAG)

We’ll now:

1. 🔍 **Create a retriever** from your Chroma vectorstore
2. 🧾 **Build a prompt** that includes:

   * Retrieved chunks (context)
   * A user query
3. 🧠 **Send it to your LLM (Zephyr)**
4. ✅ **Get grounded answers** based on your document corpus

## ✅ Step 1: Create the Retriever


## 🔍 What `k` Controls

```python
retriever.search_kwargs = {"k": 4}
```

This tells your vector store to return the **top `k` most relevant document chunks** when you ask a question.

For example:

* If `k = 1`: You get only the single most relevant chunk.
* If `k = 4`: You get the top 4 chunks (in descending order of similarity to your query).

---

## ✅ Benefits of **Increasing** `k`

| Benefit                         | Explanation                                                                |
| ------------------------------- | -------------------------------------------------------------------------- |
| 🧠 **More context**             | Gives the LLM more information to base its answer on.                      |
| 🧩 **Better coverage**          | If relevant info is spread across several chunks, this helps catch it all. |
| 🚫 **Reduces hallucination**    | More context = less guessing from the model.                               |
| 📚 **Supports complex queries** | Especially useful when answering multi-part or nuanced questions.          |

---

## ⚠️ Tradeoffs of **Too Large** `k`

| Issue                     | Explanation                                                      |
| ------------------------- | ---------------------------------------------------------------- |
| 🌀 **Longer prompts**     | LLMs have a token limit. Too much context = prompt truncation.   |
| 😕 **Diluted relevance**  | Later chunks may be less relevant, and can “distract” the model. |
| 🐢 **Slower performance** | More tokens = more cost + longer generation time.                |

---

## 🔧 Good Starting Defaults

| Use Case                | Suggested `k`                                       |
| ----------------------- | --------------------------------------------------- |
| 🔍 Simple Q\&A          | `k = 2–4`                                           |
| 📚 Summary or synthesis | `k = 4–6`                                           |
| 🧪 Experimental tuning  | Try `k = 1`, `3`, `5`, `10` and compare LLM answers |

You can even dynamically tune it based on query type — e.g., use higher `k` for vague or multi-part questions.



In [15]:
retriever = vectorstore.as_retriever()

#  customize search depth with:
retriever.search_kwargs = {"k": 3}  # retrieve top 4 relevant chunks

## ✅ Step 2: Create the Prompt Template

We’ll create a structured prompt like this:

> "Use the following context to answer the question.
>
> Context:
> {context}
>
> Question: {question}
>
> Helpful Answer:"






In [16]:
from langchain_core.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant that uses business documents to answer questions.
Use the following context to answer the question as accurately as possible.

Context:
{context}

Question:
{question}

Answer:
""")


## ✅ Step 3: Create the RAG Chain

In [26]:
from langchain_core.runnables import RunnableLambda

# Load token from .env.
load_dotenv("/content/API_KEYS.env", override=True)

# Set up LLM
llm_HF = HuggingFaceEndpoint(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    max_new_tokens=512,
    do_sample=False,
    repetition_penalty=1.03,
    huggingfacehub_api_token=os.getenv("HUGGINGFACEHUB_API_TOKEN")  # or HF_TOKEN if you renamed it
)

chat_model = ChatHuggingFace(llm=llm_HF)

rag_chain = (
    RunnableLambda(lambda d: {"question": d["question"], "docs": retriever.invoke(d["question"])})
    | RunnableLambda(lambda d: {
        "context": "\n\n".join([doc.page_content for doc in d["docs"]]),
        "question": d["question"]
    })
    | prompt_template
    | chat_model
    | StrOutputParser()
)

## ✅ Step 4: Run a Query!

In [27]:
response = rag_chain.invoke({
    "question": "What are the recent economic indicators in Gainesville that affect local businesses?"
})

import textwrap
print("\n" + textwrap.fill(response, width=100))


According to Cashflow 4Cast's Federal Economic Indicators report for April 2, 2025, there are two
economic indicators that are impacting local businesses in Gainesville: inflation and unemployment
rate. Inflation, as measured by the Consumer Price Index (CPI), has been on the rise, with prices
for everyday goods and services increasing. This may lead to decreased consumer demand and
businesses rethinking hiring plans. The recent increase in the unemployment rate in Gainesville,
from 3.3% in December 2024 to 4.1% as of April 2, 2025, is also affecting local businesses. This
higher unemployment rate may result in fewer walk-in customers or clients, increased price
sensitivity among shoppers, and less confidence in hiring or expansion plans. These economic
indicators indicate that local businesses may need to adjust their strategies to adapt to the
changing economic climate.
