# Notebook 3: Foundational RAG Pipeline

**Retrieval-Augmented Generation**

Based on: https://github.com/NirDiamant/rag_techniques

## Learning Objectives
- Understand the RAG pipeline and why it matters
- Implement document chunking with different strategies
- Create embeddings and store them in a vector database
- Build a simple retriever to find relevant context


## 1. Setup

In [None]:
# Install required packages
!pip install langchain==1.2.7 langchain-community langchain-groq langchain-huggingface langchain-text-splitters faiss-cpu sentence-transformers python-dotenv

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up Groq API key
if not os.getenv('GROQ_API_KEY'):
    os.environ['GROQ_API_KEY'] = input('Enter your Groq API key: ')

## 2. What is RAG?

**Retrieval-Augmented Generation (RAG)** solves two key problems with LLMs:

1. **Knowledge**: LLMs only know what they were trained on
2. **Hallucination**: LLMs can make up facts

**Solution**: Before generating, retrieve relevant information from a knowledge base and include it in the prompt.

### The RAG Pipeline

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     INDEXING (one-time)                         ‚îÇ
‚îÇ        Document ‚Üí Chunk ‚Üí Embed ‚Üí Store in Vector DB            ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                              ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     RETRIEVAL (per query)                       ‚îÇ
‚îÇ     Query ‚Üí Embed ‚Üí Search Vector DB ‚Üí Get Relevant Chunks      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                              ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        GENERATION                               ‚îÇ
‚îÇ       Query + Retrieved Context ‚Üí LLM ‚Üí Answer                  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## 3. Document Loading

First, let's load our sample document.

In [None]:
from langchain_community.document_loaders import TextLoader

# Load the CCI undergraduate catalog document
loader = TextLoader("data/CCI_2022-2023-Undergraduate-Catalog.txt")
documents = loader.load()

# Check what we loaded
print(f"Loaded {len(documents)} document(s)")
print(f"Document length: {len(documents[0].page_content)} characters")
print(f"\nFirst 500 characters:")
print(documents[0].page_content[:500])

## 4. Chunking

Documents are often too long to fit in an LLM's context window, and we only need relevant parts anyway. **Chunking** splits documents into smaller pieces.

### Key Parameters
- **chunk_size**: Maximum characters per chunk
- **chunk_overlap**: Characters shared between consecutive chunks (prevents cutting off context)

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Maximum characters per chunk
    chunk_overlap=50,      # Overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try to split at these boundaries first
)

# Split the documents
chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from the document")
print(f"\n--- Chunk 1 ---")
print(chunks[0].page_content)
print(f"\n--- Chunk 10 ---")
print(chunks[9].page_content)

### Experiment: Different Chunk Sizes

Let's see how chunk size affects the number and content of chunks.

In [None]:
# Try different chunk sizes
for size in [200, 500, 1000]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size, 
        chunk_overlap=50,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]  # Try to split at these boundaries first
    )
    
    test_chunks = splitter.split_documents(documents)
    avg_len = sum(len(c.page_content) for c in test_chunks) / len(test_chunks)

    print(f"Chunk size {size}: {len(test_chunks)} chunks, avg length: {avg_len:.0f} chars")

**Trade-offs**:
- **Smaller chunks**: More precise retrieval, but may lose context
- **Larger chunks**: More context, but may include irrelevant information

A common starting point is **500-1000 characters** with **10-20% overlap**.

## 5. Embeddings

**Embeddings** convert text into numerical vectors that capture meaning. Similar texts have similar vectors.

- a) "Machine learning is AI"  ‚Üí  [0.2, -0.5, 0.8, ...]
- b) "AI and ML are related"   ‚Üí  [0.3, -0.4, 0.7, ...]  
- c) "I like pizza"            ‚Üí  [-0.8, 0.1, 0.2, ...]  

### Libraries:
**sentence-transformers**
- Developed by HuggingFace for semantic text embeddings
- Provides pre-trained models that can convert text into dense vector representations (embeddings)

**langchain-huggingface**
- LangChain integration package that wraps sentence-transformers
- Provides LangChain-compatible interfaces to use HuggingFace models in LangChain workflows

**all-MiniLM-L6-v2 embedding model**
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

# Initialize embedding model (downloads on first run, ~90MB)
print("Loading embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",  # Fast and good quality
    model_kwargs={'device': 'cpu'}   # Use 'cuda' if you have a GPU
)
print("Embedding model loaded!")

In [None]:
# Let's see what embeddings look like
test_text = "Machine learning is a type of artificial intelligence."
test_embedding = embeddings.embed_query(test_text)

# We will only print the first 10 entries out of 384.
print(f"Text: '{test_text}'")
print(f"Embedding dimensions: {len(test_embedding)}")
print(f"First 10 values: {test_embedding[:10]}")

### How Similarity is Measured: Cosine Similarity

**Cosine similarity** measures the angle between two vectors, ranging from -1 to 1:
- **1.0**: Identical meaning (0¬∞ angle)
- **0.0**: No relationship (90¬∞ angle) 
- **-1.0**: Opposite meaning (180¬∞ angle)

In [None]:
# Demonstrate similarity - similar texts have similar embeddings
import numpy as np

texts = [
    "Machine learning is a type of AI",
    "AI and machine learning are closely related",
    "I like pizza"
]

embs = [embeddings.embed_query(t) for t in texts]

# Calculate cosine similarity between first text and others
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("Similarity to 'Machine learning is a type of AI':")
for i, text in enumerate(texts):
    sim = cosine_similarity(embs[0], embs[i])
    print(f"  {sim:.3f} - '{text}'")

## 6. Vector Store (FAISS)

### Vector Store
A specialized database optimized for:
- **Storing** high-dimensional vectors (embeddings)
- **Indexing** vectors for fast retrieval
- **Searching** for similar vectors using distance metrics (e.g., cosine similarity)

### FAISS
- **Free & Open Source**: Developed by Meta AI Research
- **Runs Locally**: No API calls, no cloud costs
- **Fast**: Optimized for billion-scale similarity searches

**Alternative Vector Stores:**
- **Pinecone**, **Weaviate**, **Qdrant**: Cloud-hosted (require API keys)
- **Chroma**, **LanceDB**: Other local options similar to FAISS

**GitHub**: https://github.com/facebookresearch/faiss

In [None]:
from langchain_community.vectorstores import FAISS

# Create vector store from our chunks
print(f"Creating vector store from {len(chunks)} chunks...")
vectorstore = FAISS.from_documents(chunks, embeddings)
print("Vector store created!")

## 7. Building a Retriever

A **retriever** wraps the vector store and provides a clean interface for getting relevant documents.

In [None]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",  
    search_kwargs={"k": 3}     # Number of results to return
)

# Use the retriever
query = "What are the graduation requirements for CCI students?"
relevant_docs = retriever.invoke(query)

print(f"Query: '{query}'")
print(f"\nRetrieved {len(relevant_docs)} relevant documents")

for i, doc in enumerate(relevant_docs, 1):
    print(f"--- Result {i} ---")
    print(doc.page_content[:300] + "..." if len(doc.page_content) > 300 else doc.page_content)
    print()

## 8. Complete RAG Pipeline

Now let's put it all together: retrieve context and generate an answer!

In [None]:
from langchain_groq import ChatGroq
from langchain_core.messages import HumanMessage

# Initialize LLM
llm = ChatGroq(model="openai/gpt-oss-20b", temperature=0.3)

def simple_rag(question: str) -> str:
    """A simple RAG pipeline: retrieve context, then generate answer."""
    
    # Step 1: Retrieve relevant chunks
    relevant_docs = retriever.invoke(question)
    context = "\n\n".join([doc.page_content for doc in relevant_docs])
    
    # Step 2: Create prompt with context
    prompt = f"""Answer the question based ONLY on the following context. 

Context:
{context}

Question: {question}

Answer:"""
    
    # Step 3: Generate answer
    response = llm.invoke([HumanMessage(content=prompt)])
    return response.content


# Test the RAG pipeline
question = "What are the graduation requirements for CCI students?"
answer = simple_rag(question)

print(f"‚ùì Question: {question}")
print(f"\nüí¨ Answer: {answer}")

In [None]:
# Try more questions!
questions = [
    "What courses are required for computer science majors?",
    "How many credit hours are needed to graduate?",
    "What degree programs are within the College of Computing and Informatics?",
    "What is a recipe for chocolate cake?"  # Not in our document!
]

for q in questions:
    print(f"‚ùì {q}")
    print(f"üí¨ {simple_rag(q)}")
    print("-" * 50)

## Summary

In this notebook, you learned the foundational RAG pipeline:

1. **Document Loading**: Load documents from files
2. **Chunking**: Split documents into smaller pieces with `RecursiveCharacterTextSplitter`
3. **Embeddings**: Convert text to vectors with `HuggingFaceEmbeddings`
4. **Vector Store**: Index and search with `FAISS`
5. **Retriever**: Clean interface for getting relevant documents
6. **Generation**: Combine context with query and send to LLM

**Key Parameters to Tune**:
- `chunk_size`: 500-1000 is a good starting point
- `chunk_overlap`: 10-20% of chunk size
- `k`: Number of documents to retrieve (3-5 is common)