# RAG using LangChain

This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline using LangChain.

## Objectives
- Load and embed documents
- Store in a FAISS vector store using LangChain
- Retrieve context from the store
- Use a language model to answer questions based on context

### Without LangChain
- You do everything yourself:
    - Manually split documents
    - Manually embed them
    - Manually build and search the FAISS index
    - Manually construct prompts for OpenAI
    - Manually handle context formatting and errors

In [1]:
# Install LangChain and dependencies
%pip install langchain langchain-community faiss-cpu openai sentence-transformers -q

Note: you may need to restart the kernel to use updated packages.


## Step 1: Load Documents and Create Embeddings

| Scenario                          | Suggested `chunk_size` | `chunk_overlap` |
| --------------------------------- | ---------------------- | --------------- |
| Short, clean sentences            | 50–100 chars           | 0–10            |
| Longer technical paragraphs       | 150–300 chars          | 10–50           |
| Preparing for GPT context input   | 500–1000 chars         | 50–200          |
| Sentence transformers (embedding) | 100–300 chars          | 20–50           |

Generally, you want chunks to:
- Be large enough to carry meaning
- Be small enough to embed efficiently
- Use overlap if your content has context flow (like narrative or technical docs)

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI

# Sample data
text = """The mitochondria is the powerhouse of the cell.
Photosynthesis occurs in the chloroplasts of plant cells.
DNA is stored in the nucleus.
Proteins are synthesized by ribosomes.
ATP provides energy for cellular processes."""

# Split into documents
# CharacterTextSplitter does not split by character count alone,
# it first splits by newlines (\n) or double newlines into paragraphs or lines, 
# then applies the chunk_size limit.
# Because CharacterTextSplitter assumes paragraph boundaries first, 
# it tries to chunk within each paragraph, not across them.
text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=0)
documents = text_splitter.create_documents([text])

# Embedding model - GPU
# embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Embedding model - CPU
embedding = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"device": "cpu"},
)

# Create FAISS vector store
db = FAISS.from_documents(documents, embedding)
print("Stored documents:", len(documents))
print(documents)

Stored documents: 1
[Document(metadata={}, page_content='The mitochondria is the powerhouse of the cell.\nPhotosynthesis occurs in the chloroplasts of plant cells.\nDNA is stored in the nucleus.\nProteins are synthesized by ribosomes.\nATP provides energy for cellular processes.')]


## Step 2: Create Retriever and Ask Questions

In [4]:
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})

query = "What produces energy in cells?"
docs = retriever.get_relevant_documents(query)

print("Top matching chunks:")
for d in docs:
    print("-", d.page_content)

  docs = retriever.get_relevant_documents(query)


Top matching chunks:
- The mitochondria is the powerhouse of the cell.
Photosynthesis occurs in the chloroplasts of plant cells.
DNA is stored in the nucleus.
Proteins are synthesized by ribosomes.
ATP provides energy for cellular processes.


## Step 3: RAG Chain with OpenAI LLM

In [5]:
from langchain.chains import RetrievalQA

llm = OpenAI(temperature=0.3)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

result = rag_chain({"query": query})
print("Answer:", result['result'])

  llm = OpenAI(temperature=0.3)
  result = rag_chain({"query": query})


Answer: 
ATP provides energy for cellular processes.


## Summary
- LangChain simplified our RAG workflow.
- We created a retriever from vector data.
- Used LangChain’s `RetrievalQA` to fetch relevant info and answer questions.