<a href="https://colab.research.google.com/github/hamidb201214-svg/Lectures/blob/main/M3_3_NLG_3_RAG_Mistral_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Simple Retrieval Augmented Generation (RAG)
#### RAG with LangChain + Chroma + Hugging Face (Sentence Embeddings + Local ~2B LLM)


![](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*s_pbYF-jOTqSYrMG.png)

To work with external files, LangChain provides data loaders that can be used to load documents from various sources. Combining LLMs with external data is generally referred to as Retrieval Augmented Generation (RAG).

## Using the **“Attention Is All You Need”** paper as the knowledge source (PDF)

This notebook builds a **Retrieval-Augmented Generation (RAG)** demo using:

- **Sentence embeddings** from Hugging Face (**sentence-transformers**)
- **ChromaDB** as the vector store
- A **local** small LLM from Hugging Face (example: **Gemma 2B Instruct**)
- The corpus is loaded from a **PDF** (default: *Attention Is All You Need*, Vaswani et al. 2017)

> If you prefer to use a different PDF, just set `PDF_PATH` to your file.


In [None]:
# Install (restart kernel after install if needed)
%pip -q install -U \
  langchain \
  langchain-classic \
  langchain-chroma \
  langchain-huggingface \
  langchain-community \
  langchain-text-splitters \
  sentence-transformers \
  transformers \
  accelerate \
  bitsandbytes \
  chromadb \
  pypdf \
  requests


## 1) Hugging Face token (needed for gated models)

Set `HF_TOKEN` or `HUGGINGFACEHUB_API_TOKEN`.
- If you're using Gemma, you must accept the model license on Hugging Face first.


In [None]:
import os, getpass

# Pick one env var name and stick to it:
if not (os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")):
    token = getpass.getpass("Hugging Face token (for gated models): ")
    os.environ["HF_TOKEN"] = token

HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")


## 2) Download + load the Attention paper (PDF)

Default: **arXiv PDF** for *Attention Is All You Need*.

If your environment has **no internet**, download the PDF manually and set `PDF_PATH` to that file.


In [None]:
from pathlib import Path
import requests

# You can replace this with your own PDF.
ARXIV_PDF_URL = "https://arxiv.org/pdf/1706.03762.pdf"

DATA_DIR = Path("./data")
DATA_DIR.mkdir(exist_ok=True)

PDF_PATH = DATA_DIR / "attention_is_all_you_need.pdf"

# Download if missing (requires internet)
if not PDF_PATH.exists():
    print("Downloading:", ARXIV_PDF_URL)
    r = requests.get(ARXIV_PDF_URL, timeout=60)
    r.raise_for_status()
    PDF_PATH.write_bytes(r.content)

print("PDF path:", PDF_PATH.resolve())
print("PDF size (MB):", round(PDF_PATH.stat().st_size / 1e6, 2))


In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(str(PDF_PATH))
docs = loader.load()

print("Loaded pages:", len(docs))
print("Example metadata:", docs[0].metadata)
print("\nFirst page snippet:\n", docs[0].page_content[:500], "...")


## 3) Split documents into chunks

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=900, chunk_overlap=150)
splits = splitter.split_documents(docs)

print("Chunks:", len(splits))
print("Example chunk:\n", splits[0].page_content[:300], "...")


## 4) Sentence embeddings from Hugging Face (sentence-transformers)

A popular fast baseline: `sentence-transformers/all-MiniLM-L6-v2`.


In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    # Good default for cosine similarity:
    encode_kwargs={"normalize_embeddings": True},
)

# Quick sanity check:
vec = embeddings.embed_query("hello world")
print("Embedding dim:", len(vec))


## 5) Store embeddings in Chroma

In [None]:
from langchain_chroma import Chroma

PERSIST_DIR = "./chroma_attention_paper_db"

vector_store = Chroma(
    collection_name="attention_paper_rag",
    embedding_function=embeddings,
    persist_directory=PERSIST_DIR,
)

vector_store.add_documents(splits)

print("Persist dir:", PERSIST_DIR)
print("Count:", vector_store._collection.count())


## 6) Load a ~2B model locally (example: Gemma 2B Instruct)

Model options you can try:
- `google/gemma-2b-it` (2B, gated on HF)
- `BSC-LT/salamandraTA-2b-instruct` (2B, Apache-2.0)
- If you can accept slightly smaller: `HuggingFaceTB/SmolLM2-1.7B-Instruct`

The cell below tries **4-bit** (bitsandbytes) if available; otherwise it falls back to standard loading.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

# Pick your model here:
MODEL_ID = "google/gemma-2b-it"   # gated
# MODEL_ID = "BSC-LT/salamandraTA-2b-instruct"  # open
# MODEL_ID = "HuggingFaceTB/SmolLM2-1.7B-Instruct"  # smaller but easy to run

bnb_config = BitsAndBytesConfig(load_in_4bit=True)

try:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        token=HF_TOKEN,
        device_map="auto",
        quantization_config=bnb_config,
        torch_dtype="auto",
    )
    print("Loaded in 4-bit.")
except Exception as e:
    print("4-bit load failed, falling back to standard load. Error was:\n", str(e)[:500], "...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        token=HF_TOKEN,
        device_map="auto",
        torch_dtype="auto",
    )

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)

gen_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    do_sample=False,
    temperature=0.0,
    return_full_text=False,
)

# Small smoke test:
out = gen_pipe("Say hello in one short sentence.")[0]["generated_text"]
print(out)


## 7) Build a RAG chain (retriever + prompt + LLM)

In [None]:
from langchain_huggingface import HuggingFacePipeline, ChatHuggingFace
from langchain_core.prompts import ChatPromptTemplate

# Wrap pipeline for LangChain
llm = HuggingFacePipeline(pipeline=gen_pipe)

# Treat it like a chat model (works well with instruct models)
chat_model = ChatHuggingFace(llm=llm)

# In LangChain v1+, helper chains live in langchain-classic:
from langchain_classic.chains import create_retrieval_chain
from langchain_classic.chains.combine_documents import create_stuff_documents_chain

retriever = vector_store.as_retriever(search_kwargs={"k": 4})

prompt = ChatPromptTemplate.from_template(
    """You are a helpful assistant. Answer the question using ONLY the context from the Attention paper.

<context>
{context}
</context>

Question: {input}

If the context does not contain the answer, say you don't know."""
)

doc_chain = create_stuff_documents_chain(chat_model, prompt)
rag_chain = create_retrieval_chain(retriever, doc_chain)


## 8) Ask questions + inspect retrieved sources

In [None]:
question = "What is multi-head attention and why is it useful in the Transformer?"

result = rag_chain.invoke({"input": question})

print("ANSWER:\n", result["answer"])
print("\nSOURCES (top retrieved chunks):")
for i, d in enumerate(result["context"], 1):
    print(f"\n--- Chunk {i} ---")
    print("metadata:", d.metadata)
    print(d.page_content[:400], "...")


## 9) (Optional) Reload the persisted Chroma DB later

In [None]:
# vector_store_reloaded = Chroma(
#     collection_name="attention_paper_rag",
#     embedding_function=embeddings,
#     persist_directory=PERSIST_DIR,
# )
# print(vector_store_reloaded._collection.count())
