<a href="https://colab.research.google.com/github/mdehghani86/AppliedGenAI/blob/main/M8_Lab1_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="background: linear-gradient(90deg, #eef4fc 0%, #ddeafe 100%); border-radius: 16px; box-shadow: 0 6px 24px rgba(0,85,212,0.08); padding: 34px 36px 32px 36px; max-width: 820px; margin: 30px auto 36px auto; font-family: 'Segoe UI', Arial, sans-serif; color: #152033;">

  <h1 style="color: #0055d4; font-size: 2.3rem; margin-bottom: 10px; letter-spacing: -1px;">
    🔗 LangChain Lab 5: Retrieval-Augmented Generation (RAG)
  </h1>
  <div style="font-size: 1.07rem; color: #0055d4; margin-bottom: 22px;">
    Prof. Dehghani <span style="color:#2d3c66; font-size:1rem;">(m.dehghani@northeastern.edu)</span>
  </div>

  <h2 style="margin-top:18px; color:#1b2e5b;">📖 Introduction to RAG</h2>
  <h3 style="font-weight:600; color:#284ab6; margin-top:8px; font-size:1.12rem;">
    🔹 What is Retrieval-Augmented Generation (RAG)?
  </h3>
  <div style="margin-bottom:12px;">
    <b>Retrieval-Augmented Generation (RAG)</b> is a technique that enhances Large Language Models (LLMs) by retrieving external knowledge before generating a response. Instead of relying solely on a model's pretrained knowledge, RAG fetches relevant <b>documents, database entries, or structured data</b> to improve accuracy.
  </div>

  <h3 style="font-weight:600; color:#284ab6; font-size:1.12rem;">
    🚀 Why Use RAG?
  </h3>
  <ul style="margin: 0 0 8px 18px; padding: 0;">
    <li><b>✅ Limited Knowledge</b> – LLMs can’t update their training data dynamically.</li>
    <li><b>✅ Hallucinations</b> – Models sometimes generate incorrect or fabricated information.</li>
    <li><b>✅ Domain-Specific Needs</b> – For specialized fields like <b>finance, law, or medicine</b>, retrieval ensures better accuracy.</li>
  </ul>
  <div style="margin-bottom:16px;">
    <b>RAG solves these issues by combining retrieval and generation,</b> allowing models to fetch relevant knowledge on demand.
  </div>

  <h3 style="font-weight:600; color:#284ab6; font-size:1.12rem;">
    🛠️ How Does RAG Work?
  </h3>
  <ol style="margin: 0 0 12px 25px;">
    <li><b>Retrieval:</b> The system <b>searches for relevant information</b> in a knowledge source (e.g., vector database, documents).</li>
    <li><b>Generation:</b> The retrieved information is <b>passed as context</b> to an LLM, which generates a response based on both its knowledge and the retrieved data.</li>
  </ol>
  <div style="margin-bottom:18px;">
    <span style="background: #e5f0ff; border-radius: 7px; padding: 4px 10px;">
      📌 <b>Example Use Case:</b> A chatbot answering questions about <b>company policies</b> can use RAG to pull up official policy documents instead of relying only on pre-trained responses.
    </span>
  </div>

  <h3 style="font-weight:600; color:#284ab6; font-size:1.12rem;">
    🔬 Comparison: Traditional LLM vs. RAG
  </h3>
  <table style="width: 97%; background: #f8fbff; border-collapse: collapse; font-size:1rem; box-shadow: 0 1px 6px rgba(44,70,169,0.07); margin-bottom:22px;">
    <tr style="background:#d9e6fb;">
      <th style="padding:7px 10px; border:1px solid #cee2fa; text-align:left;">Feature</th>
      <th style="padding:7px 10px; border:1px solid #cee2fa; text-align:left;">Traditional LLM</th>
      <th style="padding:7px 10px; border:1px solid #cee2fa; text-align:left;">RAG-Enhanced LLM</th>
    </tr>
    <tr>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Knowledge Source</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Fixed (Training Data)</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Dynamic (Retrieval + LLM)</td>
    </tr>
    <tr>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Updates</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Requires Retraining</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Can Fetch New Information</td>
    </tr>
    <tr>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Risk of Hallucinations</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">High</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Reduced</td>
    </tr>
    <tr>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Domain-Specific Adaptability</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Limited</td>
      <td style="padding:7px 10px; border:1px solid #cee2fa;">Highly Adaptable</td>
    </tr>
  </table>

  <h3 style="font-weight:600; color:#284ab6; font-size:1.12rem;">
    🏗️ Next Step: Setting Up the Environment
  </h3>
  <div>
    In the next section, we will <b>install required libraries</b> and set up our workspace for building a <b>RAG pipeline in Google Colab</b>.
  </div>
</div>


In [None]:
# ==================================================
# 📌 Installing Required Libraries
# ==================================================
!pip install langchain langchain-community  # Core LangChain framework & community package
!pip install openai==0.28  # OpenAI API package (version 0.28) for GPT models

# Additional libraries for RAG (FAISS, ChromaDB, Tokenization, and Unstructured Data Processing)
!pip install faiss-cpu chromadb tiktoken unstructured

!pip install "unstructured[pdf]" pypdf pdfminer.six



In [None]:
# ==================================================
# 📌 Importing Required Libraries for LangChain RAG Lab
# ==================================================

# ✅ System & Environment Setup
import os  # For setting environment variables, such as API keys

# ✅ Jupyter & Colab Utilities
import ipywidgets as widgets  # For creating interactive input widgets
from IPython.display import clear_output, display  # For managing notebook outputs

# ✅ OpenAI API
import openai  # Direct interaction with OpenAI API (useful for API-based calls)

# ✅ LangChain Core Components
from langchain.chat_models import ChatOpenAI  # OpenAI chat models (GPT)
from langchain.llms import OpenAI  # OpenAI LLM wrapper
from langchain.prompts import PromptTemplate  # Structured prompt templates
from langchain.memory import ConversationBufferMemory  # Maintaining conversation history

# ✅ RAG-Specific LangChain Imports
from langchain.vectorstores import FAISS  # FAISS for fast retrieval
from langchain.embeddings.openai import OpenAIEmbeddings  # OpenAI embeddings for vector search
from langchain.chains import RetrievalQA  # Prebuilt RAG pipeline in LangChain
from langchain.document_loaders import TextLoader  # Loading documents
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Splitting text into chunks

# ✅ Confirmation message
print("✅ All required libraries imported successfully!")


In [None]:
# ==================================================
# 🔑 OpenAI API Key Setup from Colab Secrets
# ==================================================

# ✅ Retrieve OpenAI API Key from Colab's Secret Storage
try:
    from google.colab import userdata  # Import Colab's secret storage
    openai_key = userdata.get('OpenAI_Key')  # Retrieve key from Colab Secrets

    if openai_key:
        os.environ["OPENAI_API_KEY"] = openai_key
        print("✅ OpenAI API Key has been set successfully from Colab Secrets!")
    else:
        print("❌ OpenAI API Key not found in Colab Secrets. Please add it.")

except Exception as e:
    print(f"❌ Error retrieving OpenAI API Key: {e}")


# 📌 Case Study: OpenAI's Marketing Strategy – RAG vs. Non-RAG

## 🎯 Objective
This case study evaluates how **Retrieval-Augmented Generation (RAG)** improves AI-generated responses for a business use case. We analyze OpenAI's **marketing strategy**, first using a standard LLM (**without RAG**) and then incorporating **retrieved external data (RAG)** to enhance the answer.

---

## 🔍 Approach
We structured our experiment in **four key steps**:

1. **Non-RAG Query:**  
   - Asked OpenAI's GPT model: *"What is OpenAI's marketing strategy?"*  
   - The model relied **only on its pretrained knowledge**, potentially outdated.  

2. **Loading External Knowledge:**  
   - Uploaded a **marketing-related PDF from Dropbox** to provide fresh, structured data.  
   - Split the document into **smaller retrievable text chunks** using LangChain.  

3. **Embedding & Retrieval:**  
   - Converted document chunks into **vector embeddings** (FAISS).  
   - Set up a **retriever** to fetch relevant context dynamically.  

4. **RAG-Based Query:**  
   - Asked the same question, but now the model **retrieved** relevant document excerpts.  
   - AI generated a **more informed and factually grounded** response.  

---

## 📊 Comparison: Non-RAG vs. RAG Responses

| Feature              | ❌ **Without RAG** (LLM Only) | ✅ **With RAG** (Retrieved Data) |
|----------------------|-----------------------------|----------------------------------|
| **Knowledge Source** | Trained model (static)      | External documents (dynamic)    |
| **Response Quality** | General & vague            | Specific & data-backed          |
| **Up-to-date Info**  | Limited                     | Can retrieve recent data        |
| **Risk of Hallucination** | Higher                | Reduced                         |

---

In [None]:
# ==================================================
# ❌ No RAG: Ask OpenAI About Its Marketing Strategy
# ==================================================

from langchain.chat_models import ChatOpenAI

# ✅ Initialize OpenAI Model
llm = ChatOpenAI(model="gpt-4", temperature=0.7)

# ✅ Test Query Without RAG
query = "What is OpenAI's marketing strategy in 3 bullets?"
response = llm.invoke(query)  # Corrected to use .invoke()

# ✅ Display Result
print("🤖 OpenAI's Marketing Strategy (No RAG):")
print(response.content)  # Corrected to use .content


"""
# 📌 Comparing `RecursiveCharacterTextSplitter` vs. `CharacterTextSplitter`

## 🔍 Why Use a Text Splitter?
Large documents must be broken into smaller, manageable chunks for efficient **retrieval in RAG pipelines**. LangChain provides different text splitters for this purpose.

---

## ⚡ `CharacterTextSplitter`
**Basic, fast, but limited control.**  
✅ Splits text based on a **fixed character limit** (e.g., 500 characters).  
✅ Doesn't consider **logical sentence breaks**—may split words in half.  
✅ **Good for simple text division** without deep structure.

---

## 🔄 `RecursiveCharacterTextSplitter`
**More advanced & structure-aware.**  
✅ Attempts to **split text at logical breakpoints** (e.g., paragraphs, sentences).  
✅ Uses a **fallback mechanism**: tries to split by **paragraphs > sentences > words** if possible.  
✅ **Better for structured documents** like PDFs, articles, or books.

---

## 📊 Summary Table

| Feature                        | `CharacterTextSplitter` | `RecursiveCharacterTextSplitter` |
|--------------------------------|-------------------------|----------------------------------|
| **Splitting Logic**             | Fixed character count   | Tries paragraphs → sentences → words |
| **Maintains Logical Flow?**     | ❌ No                   | ✅ Yes |
| **Best for PDFs & Long Texts?** | ❌ No                   | ✅ Yes |
| **Computational Efficiency**    | ✅ Faster               | ⚠️ Slightly slower |

---

## ✅ **Which One Should You Use?**
- **For simple splitting** (e.g., short plain text) → Use `CharacterTextSplitter`.  
- **For structured documents** (PDFs, articles, books) → Use `RecursiveCharacterTextSplitter`.  
- **If unsure** → Default to `RecursiveCharacterTextSplitter` for better retrieval quality.

"""


In [None]:
# ==================================================
# 📂 Download & Load PDF from Dropbox for RAG
# ==================================================
# This cell fixes the issue with loading PDFs from a Dropbox URL.
# Instead of using `UnstructuredURLLoader` (which requires extra dependencies),
# we:
# ✅ Step 1: Download the PDF from Dropbox and save it locally.
# ✅ Step 2: Use `PyPDFLoader` to extract text from the PDF.
# ✅ Step 3: Split the extracted text into small chunks for retrieval.
# ✅ Step 4: Preview the first few chunks to verify the content.

import requests  # Library for downloading files
from langchain.document_loaders import PyPDFLoader  # Stable PDF loader for LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Splits text into smaller parts

# ==================================================
# ✅ Step 1: Download the PDF from Dropbox and Save Locally
# ==================================================
dropbox_url = "https://www.dropbox.com/scl/fi/wvvef7qxrq36czo4poquc/pdf.pdf?rlkey=yp0sn2f60bjumn7m943hh3o4u&dl=1"
pdf_path = "/content/document.pdf"  # Local path to save the downloaded file

# Download the file from Dropbox
response = requests.get(dropbox_url)
with open(pdf_path, "wb") as file:
    file.write(response.content)

print("✅ PDF Downloaded Successfully!")

# ==================================================
# ✅ Step 2: Load the PDF Using `PyPDFLoader`
# ==================================================
# PyPDFLoader extracts the text content from the entire PDF document.
loader = PyPDFLoader(pdf_path)
documents = loader.load()  # Loads the text into a LangChain-compatible format

# ==================================================
# ✅ Step 3: Split Text into Chunks for Better Retrieval
# ==================================================
# Large text blocks make retrieval inefficient, so we split the text into smaller pieces.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=50)
docs = text_splitter.split_documents(documents)  # Splits the document into multiple parts

# ==================================================
# ✅ Step 4: Preview the First Few Chunks to Ensure Proper Splitting
# ==================================================
print(f"✅ Total Chunks: {len(docs)}")  # Shows how many text chunks were created

# Display the first few chunks to verify extraction
for i in range(min(3, len(docs))):  # Avoid index error if the document is too short
    print(f"\n📜 Chunk {i+1}: {docs[i].page_content[:300]}...")  # Display first 300 chars of each chunk


# 📌 Vector Databases & FAISS: Efficient Retrieval in RAG

"""
# 🔍 What Are Vector Databases?
Vector databases store and search **high-dimensional embeddings**, allowing AI to find **similar text chunks efficiently**. They are essential for **Retrieval-Augmented Generation (RAG)**, where AI retrieves relevant context before generating responses.

## 🚀 Why Use a Vector Database?
- 🔎 **Fast similarity search** for large datasets.
- 📖 **Improves accuracy** in AI-generated responses.
- ⚡ **Optimized for large-scale AI applications** (chatbots, search engines, etc.).

---

## 🔹 FAISS: A Popular Choice
FAISS (**Facebook AI Similarity Search**) is an **open-source, fast, and efficient** vector database optimized for **local similarity search**. It’s widely used for:

✅ **Low-latency text retrieval**  
✅ **Handling millions of vectors efficiently**  
✅ **Offline or on-device AI applications**  

---

## 🔄 Other Vector Database Options
| Vector DB  | Best For | Key Features |
|------------|---------|--------------|
| **FAISS**  | Local, Fast Search | ✅ No server required, efficient indexing |
| **ChromaDB** | Simple RAG Pipelines | ✅ Lightweight, native LangChain support |
| **Pinecone** | Scalable Cloud Search | ✅ Fully managed, real-time retrieval |
| **Weaviate** | Hybrid Search (Text + Metadata) | ✅ Graph-based, AI-powered filtering |

### ✅ Choosing the Right One:
- **For local, fast retrieval** → Use **FAISS**.  
- **For easy cloud-based search** → Use **Pinecone**.  
- **For AI-driven search & metadata filtering** → Use **Weaviate**.  

"""


In [None]:
# ==================================================
# 🔍 Step 3: Convert Text Chunks to Embeddings (FAISS Vector Database)
# ==================================================
# This step:
# ✅ Converts each text chunk into vector embeddings using OpenAI Embeddings.
# ✅ Stores the embeddings in FAISS, a fast and efficient vector search database.
# ✅ Confirms successful embedding of document chunks.

from langchain.vectorstores import FAISS  # FAISS for fast similarity search
from langchain.embeddings.openai import OpenAIEmbeddings  # OpenAI's embedding model

# ✅ Step 1: Initialize OpenAI Embeddings
embedding_model = OpenAIEmbeddings()

# ✅ Step 2: Convert Text Chunks into Vector Embeddings and Store in FAISS
vector_db = FAISS.from_documents(docs, embedding_model)


# ✅ Step 3: Confirm database is ready
print(f"✅ {len(docs)} document chunks embedded successfully!")


In [None]:
# ==================================================
# 🔍 Step 4: Querying the Vector Database (Retrieval + Generation)
# ==================================================
# This step:
# ✅ Creates a retriever to fetch relevant document chunks.
# ✅ Uses OpenAI's LLM to generate an answer based on retrieved content.
# ✅ Compares RAG-based response with the non-RAG response.

from langchain.chains import RetrievalQA

# ✅ Step 1: Create a Retriever (Finds Relevant Chunks)
retriever = vector_db.as_retriever()

# ✅ Step 2: Create a Retrieval-Augmented Generation (RAG) Chain
rag_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)

# ✅ Step 3: Ask the same question, but now with RAG retrieval
query = "What is OpenAI's marketing strategy in 3 bullets ?"
response_rag = rag_chain.run(query)

# ✅ Step 4: Display RAG-Based Response
print("\n🔍 RAG-Based Response (With Retrieval):")
print(response_rag)


# 📌 Loading External Data: HTML & CSV in LangChain

"""
# 🔍 Loading External Data in LangChain
LangChain allows us to **ingest external data** from sources like **HTML (web pages)** and **CSV files (structured data)** for retrieval-based AI applications.

---

## ✅ Loading HTML Data (Web Scraping)
We can extract text from websites using `HTMLLoader`:

```python
from langchain.document_loaders import HTMLLoader

# Load a Wikipedia page (example)
loader = HTMLLoader("https://en.wikipedia.org/wiki/Renewable_energy")
documents = loader.load()


# ✋**Hands-On: RAG with HTML Data**


In [None]:

---

### **📌 Hands-On Task: Load Wikipedia Data on Renewable Energy**
```python
# ==================================================
# ✋ **Hands-On: Load & Retrieve Renewable Energy Info from Wikipedia**
# ==================================================
# 📌 **Task Instructions:**
# 1️⃣ Fill in the missing placeholders (`-----`) to complete the process.
# 2️⃣ Use `HTMLLoader` to load Wikipedia data.
# 3️⃣ Split text into retrievable chunks.
# 4️⃣ Convert chunks into vector embeddings using FAISS.
# 5️⃣ Use retrieval to answer a question about renewable energy.

from langchain.document_loaders import HTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA

# ==================================================
# ✅ Step 1: Load Wikipedia Page on Renewable Energy
# ==================================================
wiki_url = "https://en.wikipedia.org/wiki/Renewable_energy"
loader = -----  # Load HTML from Wikipedia
documents = -----  # Extract text from the page

# ==================================================
# ✅ Step 2: Split Text into Chunks
# ==================================================
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = -----  # Split extracted text into smaller chunks

# ==================================================
# ✅ Step 3: Convert Chunks to Embeddings & Store in FAISS
# ==================================================
embedding_model = -----  # Use OpenAIEmbeddings or another model
vector_db = -----  # Convert docs into vector embeddings and store in FAISS

# ==================================================
# ✅ Step 4: Create a Retriever to Fetch Relevant Information
# ==================================================
retriever = -----  # Convert FAISS vector store into a retriever

# ==================================================
# ✅ Step 5: Ask AI a Question About Renewable Energy
# ==================================================
rag_chain = RetrievalQA.from_chain_type(-----, retriever=retriever)  # Define the RAG pipeline

query = "What are the main types of renewable energy sources?"
response_rag = rag_chain.run(query)

# ✅ Step 6: Display Retrieved Answer
print("\n🌍 🔋 AI Answer on Renewable Energy:")
print(response_rag)



# 📌 Loading CSV Data in LangChain for AI Retrieval


"""
# 🔍 Why Load CSV Data in LangChain?
CSV files store **structured data** such as financial reports, survey responses, or product catalogs. With LangChain, we can:
- 📊 **Extract relevant data** for AI analysis.
- 🔍 **Retrieve context-specific insights** using vector embeddings.
- 🤖 **Improve AI-generated responses** by grounding answers in real data.

---

## ✅ How to Load CSV Data
LangChain provides the `CSVLoader` to extract text from CSV files.

### 🔹 Example:
```python
from langchain.document_loaders import CSVLoader

# Load a sample CSV file
loader = CSVLoader(file_path="data.csv")
documents = loader.load()



# ✋** Hands-On: No RAG vs. RAG with CSV Data**



In [None]:
# ==================================================
# ✋ **Hands-On: AI Insights from CSV Data (No RAG vs. RAG)**
# ==================================================
# 📌 **Task Instructions:**
# 1️⃣ First, run the **No RAG version**, where AI generates a response **without external data**.
# 2️⃣ Then, fill in the **placeholders (`-----`)** to complete the **RAG version**, which retrieves information from a CSV dataset.
# 3️⃣ Use any **structured dataset (e.g., market trends, financial reports, healthcare statistics, etc.)**.
# 4️⃣ Compare AI’s responses **before and after retrieval**.

import pandas as pd
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA

# ==================================================
# ✅ Step 1: Upload CSV Data in Google Colab
# ==================================================
# 🔹 Instructions:
# 1. Click the **folder icon** 📂 in the left sidebar.
# 2. Click the **Upload** button.
# 3. Upload your **CSV file** (e.g., `your_dataset.csv`).
# 4. Make sure the filename below matches your uploaded file.

csv_path = "/content/your_dataset.csv"  # Update with your actual file name

# ==================================================
# ✅ Step 2: Define the Question (Both Versions)
# ==================================================
query = "What are the key insights from the dataset?"  # Generalized question

# ==================================================
# ❌ No RAG: Ask AI Without External Data
# ==================================================
llm = ChatOpenAI(model="gpt-4", temperature=0.7)
response_no_rag = llm.invoke(query)

print("\n🚀 **AI Response Without RAG:**")
print(response_no_rag.content)  # Model response without retrieval

# ==================================================
# ✅ Step 3: Load CSV Data for RAG
# ==================================================
loader = -----  # Use `CSVLoader` to load data
documents = -----  # Extract documents from CSV

# ==================================================
# ✅ Step 4: Split CSV Data into Chunks for Retrieval
# ==================================================
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = -----  # Split CSV data into smaller retrievable chunks

# ==================================================
# ✅ Step 5: Convert Chunks to Embeddings & Store in FAISS
# ==================================================
embedding_model = -----  # Use OpenAIEmbeddings or an alternative
vector_db = -----  # Convert docs into vector embeddings and store in FAISS

# ==================================================
# ✅ Step 6: Create a Retriever and AI Pipeline for RAG
# ==================================================
retriever = -----  # Convert FAISS vector store into a retriever
rag_chain = RetrievalQA.from_chain_type(-----, retriever=retriever)  # Define the RAG pipeline

# ==================================================
# ✅ Step 7: Ask the Same Question Using RAG
# ==================================================
response_rag = rag_chain.run(query)

# ==================================================
# ✅ Step 8: Print Model Responses (Compare No RAG vs. RAG)
# ==================================================
print("\n🚀 **AI Response Without RAG:**")
print(response_no_rag.content)  # Response without retrieval

print("\n📊 **AI Response With RAG (CSV Data Used):**")
print(response_rag)  # Response with retrieved CSV insights
