<a href="https://colab.research.google.com/github/mdehghani86/AppliedGenAI/blob/main/LangChain_Lab5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üîó **LangChain Lab 4 Retrieval-Augmented Generation (RAG)s**  
- **Prof. Dehghani (m.dehghani@northeastern.edu)**  

## üìñ Introduction to RAG

### üîπ What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (**RAG**) is a technique that enhances Large Language Models (LLMs) by retrieving external knowledge before generating a response. Instead of relying solely on a model's pretrained knowledge, RAG fetches relevant **documents, database entries, or structured data** to improve accuracy.

### üöÄ **Why Use RAG?**
Traditional LLMs have limitations:
‚úÖ **Limited Knowledge** ‚Äì LLMs can‚Äôt update their training data dynamically.  
‚úÖ **Hallucinations** ‚Äì Models sometimes generate incorrect or fabricated information.  
‚úÖ **Domain-Specific Needs** ‚Äì For specialized fields like **finance, law, or medicine**, retrieval ensures better accuracy.

**RAG solves these issues by combining retrieval and generation**, allowing models to fetch relevant knowledge on demand.

---

### üõ†Ô∏è **How Does RAG Work?**
RAG consists of two main steps:

1Ô∏è‚É£ **Retrieval:** The system **searches for relevant information** in a knowledge source (e.g., vector database, documents).  
2Ô∏è‚É£ **Generation:** The retrieved information is **passed as context** to an LLM, which generates a response based on both its knowledge and the retrieved data.

üìå **Example Use Case**: A chatbot answering questions about **company policies** can use RAG to pull up official policy documents instead of relying only on pre-trained responses.

---

### üî¨ **Comparison: Traditional LLM vs. RAG**
| Feature | Traditional LLM | RAG-Enhanced LLM |
|---------|----------------|------------------|
| **Knowledge Source** | Fixed (Training Data) | Dynamic (Retrieval + LLM) |
| **Updates** | Requires Retraining | Can Fetch New Information |
| **Risk of Hallucinations** | High | Reduced |
| **Domain-Specific Adaptability** | Limited | Highly Adaptable |

---
### üèóÔ∏è **Next Step: Setting Up the Environment**
In the next section, we will **install required libraries** and set up our workspace for building a **RAG pipeline in Google Colab**.



In [None]:
# ==================================================
# üìå Installing Required Libraries
# ==================================================
!pip install langchain langchain-community  # Core LangChain framework & community package
!pip install openai==0.28  # OpenAI API package (version 0.28) for GPT models

# Additional libraries for RAG (FAISS, ChromaDB, Tokenization, and Unstructured Data Processing)
!pip install faiss-cpu chromadb tiktoken unstructured

!pip install "unstructured[pdf]" pypdf pdfminer.six



In [None]:
# ==================================================
# üìå Importing Required Libraries for LangChain RAG Lab
# ==================================================

# ‚úÖ System & Environment Setup
import os  # For setting environment variables, such as API keys

# ‚úÖ Jupyter & Colab Utilities
import ipywidgets as widgets  # For creating interactive input widgets
from IPython.display import clear_output, display  # For managing notebook outputs

# ‚úÖ OpenAI API
import openai  # Direct interaction with OpenAI API (useful for API-based calls)

# ‚úÖ LangChain Core Components
from langchain.chat_models import ChatOpenAI  # OpenAI chat models (GPT)
from langchain.llms import OpenAI  # OpenAI LLM wrapper
from langchain.prompts import PromptTemplate  # Structured prompt templates
from langchain.memory import ConversationBufferMemory  # Maintaining conversation history

# ‚úÖ RAG-Specific LangChain Imports
from langchain.vectorstores import FAISS  # FAISS for fast retrieval
from langchain.embeddings.openai import OpenAIEmbeddings  # OpenAI embeddings for vector search
from langchain.chains import RetrievalQA  # Prebuilt RAG pipeline in LangChain
from langchain.document_loaders import TextLoader  # Loading documents
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Splitting text into chunks

# ‚úÖ Alternative Embeddings (Hugging Face)
from langchain.embeddings import HuggingFaceEmbeddings  # HF embeddings for local models
import transformers  # Hugging Face Transformers

# ‚úÖ Confirmation message
print("‚úÖ All required libraries imported successfully!")


In [None]:
# ==================================================
# üîë OpenAI API Key Setup from Colab Secrets
# ==================================================

# ‚úÖ Retrieve OpenAI API Key from Colab's Secret Storage
try:
    from google.colab import userdata  # Import Colab's secret storage
    openai_key = userdata.get('OpenAI_Key')  # Retrieve key from Colab Secrets

    if openai_key:
        os.environ["OPENAI_API_KEY"] = openai_key
        print("‚úÖ OpenAI API Key has been set successfully from Colab Secrets!")
    else:
        print("‚ùå OpenAI API Key not found in Colab Secrets. Please add it.")

except Exception as e:
    print(f"‚ùå Error retrieving OpenAI API Key: {e}")


# üìå Case Study: OpenAI's Marketing Strategy ‚Äì RAG vs. Non-RAG

## üéØ Objective
This case study evaluates how **Retrieval-Augmented Generation (RAG)** improves AI-generated responses for a business use case. We analyze OpenAI's **marketing strategy**, first using a standard LLM (**without RAG**) and then incorporating **retrieved external data (RAG)** to enhance the answer.

---

## üîç Approach
We structured our experiment in **four key steps**:

1. **Non-RAG Query:**  
   - Asked OpenAI's GPT model: *"What is OpenAI's marketing strategy?"*  
   - The model relied **only on its pretrained knowledge**, potentially outdated.  

2. **Loading External Knowledge:**  
   - Uploaded a **marketing-related PDF from Dropbox** to provide fresh, structured data.  
   - Split the document into **smaller retrievable text chunks** using LangChain.  

3. **Embedding & Retrieval:**  
   - Converted document chunks into **vector embeddings** (FAISS).  
   - Set up a **retriever** to fetch relevant context dynamically.  

4. **RAG-Based Query:**  
   - Asked the same question, but now the model **retrieved** relevant document excerpts.  
   - AI generated a **more informed and factually grounded** response.  

---

## üìä Comparison: Non-RAG vs. RAG Responses

| Feature              | ‚ùå **Without RAG** (LLM Only) | ‚úÖ **With RAG** (Retrieved Data) |
|----------------------|-----------------------------|----------------------------------|
| **Knowledge Source** | Trained model (static)      | External documents (dynamic)    |
| **Response Quality** | General & vague            | Specific & data-backed          |
| **Up-to-date Info**  | Limited                     | Can retrieve recent data        |
| **Risk of Hallucination** | Higher                | Reduced                         |

---

In [None]:
# ==================================================
# ‚ùå No RAG: Ask OpenAI About Its Marketing Strategy
# ==================================================

from langchain.chat_models import ChatOpenAI

# ‚úÖ Initialize OpenAI Model
llm = ChatOpenAI(model="gpt-4", temperature=0.7)

# ‚úÖ Test Query Without RAG
query = "What is OpenAI's marketing strategy?"
response = llm.invoke(query)  # Corrected to use .invoke()

# ‚úÖ Display Result
print("ü§ñ OpenAI's Marketing Strategy (No RAG):")
print(response.content)  # Corrected to use .content


# ==================================================
# üìå Comparing `RecursiveCharacterTextSplitter` vs. `CharacterTextSplitter`
# ==================================================

"""
# üìå Comparing `RecursiveCharacterTextSplitter` vs. `CharacterTextSplitter`

## üîç Why Use a Text Splitter?
Large documents must be broken into smaller, manageable chunks for efficient **retrieval in RAG pipelines**. LangChain provides different text splitters for this purpose.

---

## ‚ö° `CharacterTextSplitter`
**Basic, fast, but limited control.**  
‚úÖ Splits text based on a **fixed character limit** (e.g., 500 characters).  
‚úÖ Doesn't consider **logical sentence breaks**‚Äîmay split words in half.  
‚úÖ **Good for simple text division** without deep structure.

---

## üîÑ `RecursiveCharacterTextSplitter`
**More advanced & structure-aware.**  
‚úÖ Attempts to **split text at logical breakpoints** (e.g., paragraphs, sentences).  
‚úÖ Uses a **fallback mechanism**: tries to split by **paragraphs > sentences > words** if possible.  
‚úÖ **Better for structured documents** like PDFs, articles, or books.

---

## üìä Summary Table

| Feature                        | `CharacterTextSplitter` | `RecursiveCharacterTextSplitter` |
|--------------------------------|-------------------------|----------------------------------|
| **Splitting Logic**             | Fixed character count   | Tries paragraphs ‚Üí sentences ‚Üí words |
| **Maintains Logical Flow?**     | ‚ùå No                   | ‚úÖ Yes |
| **Best for PDFs & Long Texts?** | ‚ùå No                   | ‚úÖ Yes |
| **Computational Efficiency**    | ‚úÖ Faster               | ‚ö†Ô∏è Slightly slower |

---

## ‚úÖ **Which One Should You Use?**
- **For simple splitting** (e.g., short plain text) ‚Üí Use `CharacterTextSplitter`.  
- **For structured documents** (PDFs, articles, books) ‚Üí Use `RecursiveCharacterTextSplitter`.  
- **If unsure** ‚Üí Default to `RecursiveCharacterTextSplitter` for better retrieval quality.

"""


In [None]:
# ==================================================
# üìÇ Download & Load PDF from Dropbox for RAG
# ==================================================
# This cell fixes the issue with loading PDFs from a Dropbox URL.
# Instead of using `UnstructuredURLLoader` (which requires extra dependencies),
# we:
# ‚úÖ Step 1: Download the PDF from Dropbox and save it locally.
# ‚úÖ Step 2: Use `PyPDFLoader` to extract text from the PDF.
# ‚úÖ Step 3: Split the extracted text into small chunks for retrieval.
# ‚úÖ Step 4: Preview the first few chunks to verify the content.

import requests  # Library for downloading files
from langchain.document_loaders import PyPDFLoader  # Stable PDF loader for LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Splits text into smaller parts

# ==================================================
# ‚úÖ Step 1: Download the PDF from Dropbox and Save Locally
# ==================================================
dropbox_url = "https://www.dropbox.com/scl/fi/wvvef7qxrq36czo4poquc/pdf.pdf?rlkey=yp0sn2f60bjumn7m943hh3o4u&dl=1"
pdf_path = "/content/document.pdf"  # Local path to save the downloaded file

# Download the file from Dropbox
response = requests.get(dropbox_url)
with open(pdf_path, "wb") as file:
    file.write(response.content)

print("‚úÖ PDF Downloaded Successfully!")

# ==================================================
# ‚úÖ Step 2: Load the PDF Using `PyPDFLoader`
# ==================================================
# PyPDFLoader extracts the text content from the entire PDF document.
loader = PyPDFLoader(pdf_path)
documents = loader.load()  # Loads the text into a LangChain-compatible format

# ==================================================
# ‚úÖ Step 3: Split Text into Chunks for Better Retrieval
# ==================================================
# Large text blocks make retrieval inefficient, so we split the text into smaller pieces.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)  # Splits the document into multiple parts

# ==================================================
# ‚úÖ Step 4: Preview the First Few Chunks to Ensure Proper Splitting
# ==================================================
print(f"‚úÖ Total Chunks: {len(docs)}")  # Shows how many text chunks were created

# Display the first few chunks to verify extraction
for i in range(min(3, len(docs))):  # Avoid index error if the document is too short
    print(f"\nüìú Chunk {i+1}: {docs[i].page_content[:300]}...")  # Display first 300 chars of each chunk


# üìå Vector Databases & FAISS: Efficient Retrieval in RAG

"""
# üîç What Are Vector Databases?
Vector databases store and search **high-dimensional embeddings**, allowing AI to find **similar text chunks efficiently**. They are essential for **Retrieval-Augmented Generation (RAG)**, where AI retrieves relevant context before generating responses.

## üöÄ Why Use a Vector Database?
- üîé **Fast similarity search** for large datasets.
- üìñ **Improves accuracy** in AI-generated responses.
- ‚ö° **Optimized for large-scale AI applications** (chatbots, search engines, etc.).

---

## üîπ FAISS: A Popular Choice
FAISS (**Facebook AI Similarity Search**) is an **open-source, fast, and efficient** vector database optimized for **local similarity search**. It‚Äôs widely used for:
‚úÖ **Low-latency text retrieval**  
‚úÖ **Handling millions of vectors efficiently**  
‚úÖ **Offline or on-device AI applications**  

---

## üîÑ Other Vector Database Options
| Vector DB  | Best For | Key Features |
|------------|---------|--------------|
| **FAISS**  | Local, Fast Search | ‚úÖ No server required, efficient indexing |
| **ChromaDB** | Simple RAG Pipelines | ‚úÖ Lightweight, native LangChain support |
| **Pinecone** | Scalable Cloud Search | ‚úÖ Fully managed, real-time retrieval |
| **Weaviate** | Hybrid Search (Text + Metadata) | ‚úÖ Graph-based, AI-powered filtering |

### ‚úÖ Choosing the Right One:
- **For local, fast retrieval** ‚Üí Use **FAISS**.  
- **For easy cloud-based search** ‚Üí Use **Pinecone**.  
- **For AI-driven search & metadata filtering** ‚Üí Use **Weaviate**.  

"""


In [None]:
# ==================================================
# üîç Step 3: Convert Text Chunks to Embeddings (FAISS Vector Database)
# ==================================================
# This step:
# ‚úÖ Converts each text chunk into vector embeddings using OpenAI Embeddings.
# ‚úÖ Stores the embeddings in FAISS, a fast and efficient vector search database.
# ‚úÖ Confirms successful embedding of document chunks.

from langchain.vectorstores import FAISS  # FAISS for fast similarity search
from langchain.embeddings.openai import OpenAIEmbeddings  # OpenAI's embedding model

# ‚úÖ Step 1: Initialize OpenAI Embeddings
embedding_model = OpenAIEmbeddings()

# ‚úÖ Step 2: Convert Text Chunks into Vector Embeddings and Store in FAISS
vector_db = FAISS.from_documents(docs, embedding_model)

# ‚úÖ Step 3: Confirm database is ready
print(f"‚úÖ {len(docs)} document chunks embedded successfully!")


In [None]:
# ==================================================
# üîç Step 4: Querying the Vector Database (Retrieval + Generation)
# ==================================================
# This step:
# ‚úÖ Creates a retriever to fetch relevant document chunks.
# ‚úÖ Uses OpenAI's LLM to generate an answer based on retrieved content.
# ‚úÖ Compares RAG-based response with the non-RAG response.

from langchain.chains import RetrievalQA

# ‚úÖ Step 1: Create a Retriever (Finds Relevant Chunks)
retriever = vector_db.as_retriever()

# ‚úÖ Step 2: Create a Retrieval-Augmented Generation (RAG) Chain
rag_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)

# ‚úÖ Step 3: Ask the same question, but now with RAG retrieval
query = "What is OpenAI's marketing strategy in 3 bullets?"
response_rag = rag_chain.run(query)

# ‚úÖ Step 4: Display RAG-Based Response
print("\nüîç RAG-Based Response (With Retrieval):")
print(response_rag)


# üìå Loading External Data: HTML & CSV in LangChain

"""
# üîç Loading External Data in LangChain
LangChain allows us to **ingest external data** from sources like **HTML (web pages)** and **CSV files (structured data)** for retrieval-based AI applications.

---

## ‚úÖ Loading HTML Data (Web Scraping)
We can extract text from websites using `HTMLLoader`:

```python
from langchain.document_loaders import HTMLLoader

# Load a Wikipedia page (example)
loader = HTMLLoader("https://en.wikipedia.org/wiki/Renewable_energy")
documents = loader.load()


### ‚úã Hands-On:

In [None]:

---

### **üìå Hands-On Task: Load Wikipedia Data on Renewable Energy**
```python
# ==================================================
# ‚úã **Hands-On: Load & Retrieve Renewable Energy Info from Wikipedia**
# ==================================================
# üìå **Task Instructions:**
# 1Ô∏è‚É£ Fill in the missing placeholders (`-----`) to complete the process.
# 2Ô∏è‚É£ Use `HTMLLoader` to load Wikipedia data.
# 3Ô∏è‚É£ Split text into retrievable chunks.
# 4Ô∏è‚É£ Convert chunks into vector embeddings using FAISS.
# 5Ô∏è‚É£ Use retrieval to answer a question about renewable energy.

from langchain.document_loaders import HTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA

# ==================================================
# ‚úÖ Step 1: Load Wikipedia Page on Renewable Energy
# ==================================================
wiki_url = "https://en.wikipedia.org/wiki/Renewable_energy"
loader = -----  # Load HTML from Wikipedia
documents = -----  # Extract text from the page

# ==================================================
# ‚úÖ Step 2: Split Text into Chunks
# ==================================================
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = -----  # Split extracted text into smaller chunks

# ==================================================
# ‚úÖ Step 3: Convert Chunks to Embeddings & Store in FAISS
# ==================================================
embedding_model = -----  # Use OpenAIEmbeddings or another model
vector_db = -----  # Convert docs into vector embeddings and store in FAISS

# ==================================================
# ‚úÖ Step 4: Create a Retriever to Fetch Relevant Information
# ==================================================
retriever = -----  # Convert FAISS vector store into a retriever

# ==================================================
# ‚úÖ Step 5: Ask AI a Question About Renewable Energy
# ==================================================
rag_chain = RetrievalQA.from_chain_type(-----, retriever=retriever)  # Define the RAG pipeline

query = "What are the main types of renewable energy sources?"
response_rag = rag_chain.run(query)

# ‚úÖ Step 6: Display Retrieved Answer
print("\nüåç üîã AI Answer on Renewable Energy:")
print(response_rag)
