<a href="https://colab.research.google.com/github/mdrk300902/demo-repo/blob/main/15_RAG_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Question Answering Pipeline with Hugging Face and LangChain

This project implements an advanced question answering (QA) system built on top of:

- **Unstructured web content ingestion:** We load relevant documents from a trusted health website.
- **Text chunking and preprocessing:** Large documents are split into smaller, manageable text chunks based on token length to fit model context windows.
- **Semantic embeddings and vector search:** Each chunk is converted into dense vector embeddings using Hugging Face sentence-transformers, then indexed with ChromaDB for fast similarity search.
- **Large Language Model (LLM) generation:** We use Hugging Face's Flan-T5 large model for generating answers based on retrieved chunks, running locally with GPU acceleration if available.
- **Conversational memory:** The system supports multi-turn interactions by maintaining conversation history.
- **Prompt engineering:** Customized prompts guide the LLM to produce clear and contextually relevant answers.
- **Performance optimizations:** Chunk sizes and retrieved result counts are tuned to fit the model's token limits and reduce inference errors.


### 1. Install Required Packages (run once)

In [2]:
!pip install -U langchain langchain-huggingface langchain-community sentence-transformers transformers chromadb tiktoken huggingface_hub

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting transformers
  Downloading transformers-4.55.3-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-1.0.20-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)

### 2. Imports and Environment Setup

In [3]:
import os
import torch
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import tiktoken

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "HUGGINGFACE_API_TOKEN"



### 3. Load Document from the Web

In [4]:
loader = WebBaseLoader("https://my.clevelandclinic.org/health/diseases/10946-cavities")
documents = loader.load()

### 4. Tokenizer and Text Splitting - Reduce chunk size for model limit (<= 512 tokens)

In [5]:
tokenizer_encoding = tiktoken.get_encoding('cl100k_base')
def tiktoken_len(text):
    return len(tokenizer_encoding.encode(text, disallowed_special=()))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    length_function=tiktoken_len
)
chunks = splitter.split_documents(documents)


### 5. Create Hugging Face Embeddings for chunks

In [None]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectordb = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db")
vectordb.persist()

### 6. Load Hugging Face Flan-T5 Large with explicit GPU usage check

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Model running on device: {device}")

hf_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
hf_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large").to(device)

pipe = pipeline(
    "text2text-generation",
    model=hf_model,
    tokenizer=hf_tokenizer,
    device=0 if device == "cuda" else -1,
    max_length=256,
    do_sample=False
)

llm = HuggingFacePipeline(pipeline=pipe)

### 7. Define a prompt template to guide the LLM responses

In [8]:
prompt_template = PromptTemplate(
    input_variables=["question", "context"],
    template=(
        "You are a helpful assistant. Answer the question based on the context below:\n\n"
        "Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )
)

### 8. Initialize conversational memory for multi-turn chat

In [9]:
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
memory.output_key = "answer"

  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)


### 9. Build the Conversational Retrieval QA Chain with explicit output_key to fix memory storage

In [10]:
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vectordb.as_retriever(search_kwargs={"k": 1}),
    memory=memory,
    return_source_documents=True,
    combine_docs_chain_kwargs={"prompt": prompt_template},
    output_key="answer"   # Specifying which key to store in memory
)

### 10. Query Function for multi-turn interaction

In [11]:
def chat(query: str):
    result = qa_chain({"question": query})
    print("\nAnswer:\n", result["answer"])
    print("\nSources:")
    for doc in result["source_documents"]:
        print("-", doc.metadata.get("source", "Unknown"))

### 11. Example Queries

In [12]:
chat("How can I prevent cavity in my tooth?")
chat("What causes bleeding gums?")
chat("Can you summarize the key dental hygiene tips?")

  result = qa_chain({"question": query})



Answer:
 Brushing your teeth with a soft-bristled brush and fluoride toothpaste at least twice a day, and preferably after every meal

Sources:
- https://my.clevelandclinic.org/health/diseases/10946-cavities

Answer:
 a tooth abscess

Sources:
- https://my.clevelandclinic.org/health/diseases/10946-cavities

Answer:
 (iii)

Sources:
- https://my.clevelandclinic.org/health/diseases/10946-cavities


## Technologies Used

- **LangChain:** Framework for building composable LLM applications, including document loading, text splitting, memory management, and chain building.

- **Hugging Face Transformers & Pipelines:** Open-source library to load and run powerful pre-trained language models; here we use Flan-T5 for text-to-text generation.

- **Sentence-Transformers:** Hugging Face model family specialized in producing semantically meaningful embeddings for text, enabling efficient vector similarity search.

- **ChromaDB:** Fast, scalable vector database for storing and searching embeddings locally or in the cloud.

- **tiktoken:** Tokenizer compatible with OpenAI and Hugging Face models for accurate token-based text splitting.

- **PyTorch:** Deep learning framework providing GPU acceleration for model inference.

- **Google Colab/GPU (optional):** Cloud environment for running the pipeline with hardware acceleration.

- **Prompt Engineering:** Custom prompt templates to control and optimize model responses for better relevance and clarity.

- **Conversational Memory:** Tracks dialogue context across multiple user interactions for coherent multi-turn conversations.

This combination provides an efficient, cost-effective, and extensible retriever-augmented generation system
