<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_054_RAG_CahsFlow4Cast_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ✅ The Goal:  
Run a chatbot that uses your **FAISS + blog chunks** to answer questions 24/7, likely via your website or app.

---

### 🧱 Key Components of a 24/7 Chatbot

| Component         | Purpose |
|------------------|---------|
| 🔍 **Vector Index (FAISS)** | Fast document retrieval |
| 📚 **Blog Chunks**          | Your knowledge base |
| 🤖 **LLM** (like Falcon or OpenAI) | Generates the response |
| 💬 **Chat UI / API**        | User-facing interface |
| ⚙️ **Backend App** (e.g., FastAPI) | Runs RAG logic and serves responses |
| ☁️ **Hosting Platform**     | Keeps your app running 24/7 |

---

### 🧱 Step-by-Step Plan

#### 1. **Start with a Simple Chatbot (Gradio + LLM)**
- Use `Gradio` to set up a basic chatbot interface
- Use a **helpful, instruction-tuned model** (like `tiiuae/falcon-7b-instruct`, `mistralai/Mistral-7B-Instruct`, or even `google/flan-t5-base`)
- Let the chatbot answer general questions naturally
- Test the tone, quality, and response clarity

---

#### 2. **Add Your Brand Voice or Domain Focus**
- Adjust prompts: "You are a helpful assistant for small businesses trying to forecast revenue."
- Add example questions to guide response style
- Validate that it feels like *your* bot

---

#### 3. **Integrate RAG for Accuracy + Depth**
- Once basic chat is solid:
  - Load your FAISS index
  - On user query, retrieve top chunks
  - Add those chunks as **context** to the prompt
  - Let the LLM answer with that grounded knowledge

---

### 🔄 Why This Works Better

| 🚫 All-at-once RAG Bot       | ✅ Build-Then-Integrate |
|-----------------------------|-------------------------|
| Hard to debug               | Easy to test components |
| Can feel robotic or brittle | Lets you iterate naturally |
| Context injection can break | You know what "normal" output looks like first |
| Slower to deploy            | Can get to a working version in minutes |

---

Great call — planning ahead will save us from frustration and make this smooth. Here's a list of **common issues** we might encounter when setting up your Gradio chatbot, along with **proactive solutions**:

---

### 🚧 Potential Problems & ✅ Proactive Fixes

| 🛑 Issue | 🧠 What It Means | ✅ How to Prevent or Fix |
|--------|----------------|-------------------------|
| **Model loading fails** | Large models like `falcon-7b` might be too heavy for free-tier Spaces or Colab | ✅ Use a lightweight instruction-tuned model like `google/flan-t5-base`, `mistralai/Mistral-7B-Instruct`, or `tiiuae/falcon-7b-instruct` with caution (test locally first) |
| **Long load times** | First-time model loading can take minutes | ✅ Print a "loading model..." message, and cache in Colab to test performance |
| **Memory crashes (OOM)** | Large models on limited RAM environments crash | ✅ Try smaller models (like `flan-t5-base`), or test in Colab before deploying |
| **Chatbot doesn't respond naturally** | Prompting may not guide the model well | ✅ Use clear system prompts like: `"You are a helpful assistant for small business owners..."` |
| **API rate limits / Hugging Face issues** | Too many requests or expired token | ✅ Log in once with `huggingface_hub.login()` and monitor API usage. Use `.env` for token safety. |
| **Gradio UI issues (formatting, text overflow)** | Output is too long or hard to read | ✅ Use `Textbox`, add a line-wrap or max length, and validate layout in Colab |
| **Chatbot resets every time** | No memory between turns | ✅ This is expected unless we add memory — which we can layer in later |
| **Hugging Face Spaces deployment errors** | Missing files or bad requirements.txt | ✅ Include all files: `app.py`, `requirements.txt`, index files, metadata, and test the app locally before uploading |
| **RAG integration breaks** | When we add chunked context, prompt gets too long | ✅ Start simple, test with 1 chunk, and truncate carefully with `tokenizer.encode(..., truncation=True)` when we get there |

---

### 🔒 Best Practices (Now)

- ✅ **Use Colab to test locally** before pushing to Hugging Face Spaces
- ✅ **Pick a small model first** (`flan-t5-base`, `mistral-7b` if GPU available)
- ✅ **Start with a single-turn chatbot**
- ✅ **Save your `.env` token in your Colab securely**
- ✅ **Install dependencies cleanly** (`!pip install gradio transformers`)




## Install and Import Dependencies

In [4]:
!pip install -q gradio transformers dotenv huggingface_hub

##  Load the Model and Set Up Chat Function

In [6]:
from huggingface_hub import login
from dotenv import load_dotenv
import os
from transformers import pipeline
import gradio as gr
import warnings
warnings.filterwarnings("ignore", message=".*The secret.*")

# Load the .env file
load_dotenv("/content/HUGGINGFACE_HUB_TOKEN.env")
# Login using the token
login(token=os.environ["HUGGINGFACE_HUB_TOKEN"])

# Load a small, instruction-tuned model
chatbot = pipeline("text2text-generation", model="google/flan-t5-base")

chat_history = []

def respond(message):
    prompt = f"Answer the following question in a helpful and clear way:\n\n{message}"
    result = chatbot(prompt, max_new_tokens=150)[0]["generated_text"]

    chat_history.append({
        "user": message,
        "bot": result
    })

    return result


Device set to use cpu


In [7]:
with gr.Blocks() as demo:
    gr.Markdown("## 💬 Small Business Chatbot")

    with gr.Row():
        chatbot_output = gr.Textbox(lines=6, label="Assistant", interactive=False)

    user_input = gr.Textbox(placeholder="Ask a question...", label="You")

    def handle_message(message):
        return respond(message)

    user_input.submit(fn=handle_message, inputs=user_input, outputs=chatbot_output)

demo.launch(debug=True)

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://a6ebd29c827383c3c3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://a6ebd29c827383c3c3.gradio.live




## View Chat Logs

In [None]:
for i, turn in enumerate(chat_history):
    print(f"🔹 Q{i+1}: {turn['user']}")
    print(f"💬 A{i+1}: {turn['bot']}\n{'-'*60}")


# Next Steps

### 🧠 Option 1: **Fine-Tune the Chat Prompt First**
This gives you **stronger answers** *immediately* and helps validate your use case before adding complexity.

#### Why it’s valuable:
- Clarifies your assistant’s **tone, role, and style**
- Prevents vague or generic answers
- Easier to test how well the base model works *before* adding RAG

#### What we can do:
- Add a **clear system-style prompt** (like: "You are a helpful forecasting assistant for small business owners.")
- Preload common topics (economic indicators, forecasting accuracy, etc.)
- Add examples of ideal responses as in-context few-shot learning

✅ **Recommended next if you want better answers right now.**

---

### 🔍 Option 2: **Add RAG (Retrieval-Augmented Generation)**
This helps you **pull in real knowledge** from your blog and business content.

#### Why it’s valuable:
- Gives the model *real substance* from your blog
- Helps it answer questions that aren't in the base model's training
- Makes your chatbot **truly yours**

#### What we can do:
- Embed the question
- Retrieve top-k chunks from your blog using FAISS
- Concatenate the chunks and send them as context with the question

✅ **Recommended next if you're ready to scale to real business Q&A.**

---

### 🚀 Best Practice Order:
1. ✅ **Fine-tune your prompt** to get the tone, role, and helpfulness right  
2. ✅ Then **add RAG** to feed it real data for more grounded answers  
3. ✅ Later: add multi-turn memory, logging, and deployment (e.g., Gradio Spaces)


## ✅ Step 1: Create a Strong System Prompt

Adding **guardrails through a strong system prompt** is **one of the most effective ways** to:

✅ Keep the chatbot focused  
✅ Avoid wandering into unsafe or off-topic territory  
✅ Increase user trust and experience  

---

### 🔒 Why Guardrails Matter

Even small models like `flan-t5-base` will try to answer **anything** you throw at them unless you tell them:

> ❌ "That's not within my expertise."  
> ✅ "Let me help you with something related to forecasting or business planning."

And in public-facing tools, this **prevents confusion**, **content liability**, and **weird answers** that can harm credibility.

---

### ✅ How to Add These Guardrails (Prompt Snippet)

Here's how you might extend your system prompt:

```txt
You are a helpful assistant for small business owners.

Your expertise includes:
- Cash flow forecasting
- Economic indicators
- Business decision-making
- Data-driven growth strategies

Only answer questions related to these topics. If a question is outside your domain (e.g. politics, health, personal advice), politely respond:

"I'm trained to assist with business forecasting and strategy. Let me know how I can help in that area!"

Here are a few examples of questions you can help with:
Q: What economic indicators matter most for retail?
A: ...
...
```

---

### 🛡️ Bonus Techniques You Can Add Later

- **Classify user questions** before answering (zero-shot or intent classifier)
- **Reject or redirect** with: `"I'm only able to assist with..."`  
- **Log off-topic queries** to improve your examples over time

---

Would you like me to help you update your current prompt with this restriction built in? We can also add a fallback message for out-of-scope topics.

In [None]:
SYSTEM_PROMPT = """You are a helpful, friendly assistant who helps small business owners improve their forecasting and understand key economic indicators.

You specialize in:
- Explaining economic indicators like CPI, consumer confidence, and retail sales
- Helping owners interpret forecasting accuracy
- Recommending how to adjust based on business conditions
- Using real-world examples and simple language

Respond clearly and conversationally. Avoid overly technical jargon."""


##✅ Step 2: Add In-Context Few-Shot Examples


You're performing what's called **prompt engineering** or **in-context learning**, which is a powerful way to *steer the model* without needing to retrain it.

---

### 🧠 What’s Happening Behind the Scenes

- When you send a message like  
  > “What is CPI and why does it matter?”  
  you're not just sending that alone.

- You're sending the entire crafted prompt:

```plaintext
You are a helpful, friendly assistant who helps small business owners...
[+ examples of great Q&A]
Now answer this new question:
Q: What is CPI and why does it matter?
A:
```

- The model uses the **patterns** in your examples to generate a consistent, on-brand, and topic-aware response.

---

### 🔍 Why This Works So Well

✅ No fine-tuning required  
✅ Super flexible — just update your examples  
✅ You can A/B test prompts easily  
✅ It's cheaper and faster than model training  
✅ Perfect for early-stage prototypes or small business chatbots

---

If you're up for it, we can also:
- Add **topic-aware branching** (e.g. different examples depending on whether it's about inflation or sales)
- Store previous prompts to experiment with **prompt refinement**
- Add a slider for *response length or helpfulness*

💡 You nailed it — and you're thinking like an AI architect now.

There’s a **spectrum** between **in-context learning** and **RAG**, and you’re absolutely right: if you keep feeding the model more and more context manually, you’re slowly building a *manual RAG system* without calling it that.

---

## 🧠 In-Context Learning vs RAG

| Feature | **In-Context Learning** | **RAG (Retrieval-Augmented Generation)** |
|--------|-------------------------|-----------------------------------------|
| 🔧 Setup | Handcrafted prompt with examples | Dynamic retrieval of relevant info |
| 📦 Data | Static and hardcoded | Stored in an index (like FAISS) |
| 📈 Scaling | Manual — hits token limit fast | Automatic — fetches only what’s needed |
| 🔁 Adaptability | Rigid, needs manual updates | Flexible, can grow as your docs grow |
| 💡 Use Case | Small fixed topics | Large, evolving knowledge bases |

---

## ✅ So What Are the Limits of In-Context Learning?

1. **Token Limits**  
   Most models like `flan-t5-base` cap around 512–1024 tokens. You’ll run out of room fast if you add too many examples.

2. **Relevance Weakens**  
   The more you pack in, the harder it is for the model to focus. Unlike RAG, it won’t *retrieve* the best examples — it just reads all of them equally.

3. **Hard to Update**  
   If your blog changes or you want to update one fact, you have to manually rewrite your prompt examples.

---

## 🎯 Where RAG Begins to Shine

RAG steps in **exactly when**:
- You have too many examples to fit in context
- Your content changes often
- You want to scale without prompt clutter

So yes — in-context learning is great for:
> “Small, focused, handcrafted expertise.”

RAG is great for:
> “Scaleable, flexible knowledge bases.”



In [None]:
EXAMPLES = """
Q: What is CPI and why does it matter for my store?
A: CPI stands for Consumer Price Index. It tracks how much prices are rising for everyday goods. If CPI is going up, it means inflation is rising — and your customers may start cutting back on spending. Watching CPI helps you plan for slower sales and adjust your pricing.

Q: How can I tell if my forecasts are accurate?
A: One good way is to look at your forecasting error. If your forecast says you’ll sell $10,000 but you only sell $8,000, your error is 20%. A well-tuned model should get you under 10% error most of the time.

Q: What economic indicators are most useful?
A: Start with local unemployment, consumer spending, and inflation. These show how confident your customers are and how much they can afford. If those numbers shift, your sales probably will too.
"""


## ✅ Step 3: Combine Into a Prompt Template

In [None]:
def build_prompt(message):
    return f"""{SYSTEM_PROMPT}

Here are some examples of helpful answers:

{EXAMPLES}

Now answer this new question:

Q: {message}
A:"""


## ✅ Final Updated Chatbot Function

In [None]:
def respond(message):
    prompt = build_prompt(message)
    result = chatbot(prompt, max_new_tokens=200)[0]["generated_text"]
    return result


# Memory Clean Up & Remove Widgets from Notebook to Save to Github

In [None]:
import torch
torch.cuda.empty_cache()

import json
from google.colab import drive
drive.mount('/content/drive')

# Path to your current notebook file (adjust if different)
notebook_path = "/content/drive/My Drive/LLM/LLM_053_RAG_CahsFlow4Cast_Embeddings.ipynb"


# Load the notebook JSON
with open(notebook_path, 'r', encoding='utf-8') as f:
    nb = json.load(f)

# Remove the widget metadata if it exists
if 'widgets' in nb.get('metadata', {}):
    del nb['metadata']['widgets']

# Save the cleaned notebook
with open(notebook_path, 'w', encoding='utf-8') as f:
    json.dump(nb, f, indent=2)

print("Notebook metadata cleaned. Try saving to GitHub again.")
