Here is a step-by-step Markdown explanation for your notebook, describing each part and its purpose:

---

# Conversational RAG with LangChain, FAISS, and OpenAI: Step-by-Step Explanation

## 1. **Install Required Packages**



In [None]:
! pip install langchain faiss-cpu sentence-transformers openai tiktoken rouge-score nltk python-dotenv langchain-community langchain_openai rouge nltk

*Installs all necessary libraries for document loading, embeddings, vector search, LLMs, and environment management.*

---

## 2. **Load Environment Variables**



In [18]:
from dotenv import load_dotenv
load_dotenv()

True

*Loads API keys and other secrets from a `.env` file, so you don’t hardcode them in your notebook.*

---

## 3. **Load and Preprocess Documents**



In [19]:
from pathlib import Path
from langchain.schema import Document
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
import re

def load_documents_from_folder(folder_path):
    """Load each .txt file in the folder as a separate Document."""
    txt_files = Path(folder_path).glob("*.txt")
    documents = []
    for file in txt_files:
        text = file.read_text(encoding="utf-8")
        clean_text = re.sub(r'\s+', ' ', text.strip())  # Clean and normalize
        doc = Document(page_content=clean_text, metadata={"source": file.name})
        documents.append(doc)
    return documents

# Load documents from folder
folder_path = "ancient_greece_data"
documents = load_documents_from_folder(folder_path)

*Reads all `.txt` files from a folder, cleans the text, and wraps each as a LangChain `Document`.*

---

## 4. **Create Embeddings and Build FAISS Index**



In [20]:
embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embedding_model)
vectorstore.save_local("faiss_index_ancient_greece_notebook")

*Converts documents to embeddings and stores them in a FAISS vector database for fast similarity search. Saves the index locally.*

---

## 5. **Load FAISS Index (for Reuse)**



In [21]:
vectorstore = FAISS.load_local("faiss_index_ancient_greece", embedding_model, allow_dangerous_deserialization=True)

*Loads the previously saved FAISS index for querying.*

---

## 6. **Create a Retriever**



In [22]:
retriever = vectorstore.as_retriever()

*Wraps the vectorstore as a retriever object for searching relevant documents.*

---

## 7. **Initialize the LLM**



In [23]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

*Sets up the OpenAI GPT-4o-mini model for answering questions.*

---

## 8. **Create a History-Aware Retriever**



In [24]:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant that rewrites follow-up questions into standalone questions using chat history."),
        MessagesPlaceholder("chat_history"),
        ("human", """Given the above chat history and the latest user question below,
reformulate it into a standalone question. Do not answer the question.
If it's already standalone, return it as is.

Latest user question:
{input}"""),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

*Uses chat history to reformulate follow-up questions into standalone questions for better retrieval.*

---

## 9. **Create the RAG Prompt and Chain**



In [25]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", (
            """You are an assistant for question-answering tasks.
            Answer this question using the provided context only.
            If you dont know the answer, just say 'I dont know'
            {context}"""
        )),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
contextual_rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

*Defines how the LLM should answer using only retrieved context. Chains the retriever and LLM together.*

---

## 10. **Enable Conversational Memory**



In [26]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

conversational_rag_chain = RunnableWithMessageHistory(
    contextual_rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

*Adds chat history tracking for each session, enabling context-aware conversations.*

---

## 11. **Run the Conversational RAG Chain**



In [27]:
response = conversational_rag_chain.invoke(
    {"input": "who is socrates"},
    config={
        "configurable": {"session_id": "abc1235"}
    },
)
sources = set()
if "context" in response:
    for doc in response["context"]:
        if "source" in doc.metadata:
            sources.add(doc.metadata["source"])
citation = ""
if sources:
    citation = f"\n\nThis answer is based on information from: {', '.join(sorted(sources))}."
print(response["answer"] + citation)

# Optionally, still print retrieved chunks and sources for transparency (for debugging or evaluation)
if "context" in response:
    print("\nRetrieved Chunks and Sources:")
    for doc in response["context"]:
        print(f"\n---\nChunk:\n{doc.page_content}\nSource: {doc.metadata.get('source', 'N/A')}")
    print("\nSources:", ", ".join(sorted(sources)))
else:
    print("No sources found.")


Socrates was a philosopher who lived in Athens during the 5th century BCE. He is considered the father of Western philosophy and is famous for his method of inquiry known as the Socratic Method, which involved questioning others to stimulate critical thinking and the search for truth. Although he left no written works, his ideas and teachings were passed down through his disciple, Plato. Socrates believed that wisdom came from acknowledging one's own ignorance and engaging in open dialogue. His commitment to truth and ethical conduct ultimately led to his trial and sentence to death by drinking poison hemlock.

This answer is based on information from: 18.txt, 31.txt, 32.txt, 33.txt.

Retrieved Chunks and Sources:

---
Chunk:
Greek Philosophy: Socrates, Plato, and Aristotle Socrates and the Socratic Method Ancient Greece is widely regarded as the birthplace of Western philosophy, and at the forefront of this philosophical revolution were three influential thinkers: Socrates, Plato, and

*Asks a question and prints the answer, storing the conversation under a session ID.*

---

## 12. **Ask a Follow-up Question**



In [28]:
response = conversational_rag_chain.invoke(
    {"input": "where did he lived"},
    config={
        "configurable": {"session_id": "abc1235"}
    },
)
# print(response["answer"])
sources = set()
if "context" in response:
    for doc in response["context"]:
        if "source" in doc.metadata:
            sources.add(doc.metadata["source"])
citation = ""
if sources:
    citation = f"\n\nThis answer is based on information from: {', '.join(sorted(sources))}."
print(response["answer"] + citation)

# Optionally, still print retrieved chunks and sources for transparency (for debugging or evaluation)
if "context" in response:
    print("\nRetrieved Chunks and Sources:")
    for doc in response["context"]:
        print(f"\n---\nChunk:\n{doc.page_content}\nSource: {doc.metadata.get('source', 'N/A')}")
    print("\nSources:", ", ".join(sorted(sources)))
else:
    print("No sources found.")

Socrates lived in Athens.

This answer is based on information from: 18.txt, 31.txt, 32.txt, 33.txt.

Retrieved Chunks and Sources:

---
Chunk:
Greek Philosophy: Socrates, Plato, and Aristotle Socrates and the Socratic Method Ancient Greece is widely regarded as the birthplace of Western philosophy, and at the forefront of this philosophical revolution were three influential thinkers: Socrates, Plato, and Aristotle. This chapter delves into their profound contributions to Greek philosophy and their enduring impact on the world of ideas. The first philosopher to be discussed is Socrates, who lived in Athens during the 5th century BCE. Although he left no written works behind, his ideas and teachings were passed down through his disciple, Plato. Socrates is famous for his method of inquiry known as the Socratic Method, which involved questioning others to stimulate critical thinking and the search for truth. He believed that wisdom came from acknowledging one's own ignorance and engaging

*Asks a follow-up question. The system uses chat history to understand "he" refers to Socrates.*

---

## **Summary**

- **Load and preprocess documents** → **Embed and index with FAISS** → **Set up retriever and LLM** → **Enable conversational memory** → **Ask questions in context**.
- This workflow enables a chatbot to answer questions about your dataset, using retrieval-augmented generation and chat history for context.

---

In [17]:
# This function splits documents into smaller chunks using a recursive character-based text splitter.
# It allows for a specified chunk size and overlap between chunks.
# If you use a text splitter (like RecursiveCharacterTextSplitter), it will automatically copy the metadata (including "source") to each chunk.
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# def split_documents_recursively(documents, chunk_size=1000, chunk_overlap=200):
#     splitter = RecursiveCharacterTextSplitter(
#         chunk_size=chunk_size,
#         chunk_overlap=chunk_overlap,
#         separators=["\n\n", "\n", ".", "!", "?", " ", ""]
#     )
#     split_docs = splitter.split_documents(documents)
#     logger.info(f"Split into {len(split_docs)} chunks (chunk_size={chunk_size}, overlap={chunk_overlap})")
#     return split_docs

In [29]:
response = conversational_rag_chain.invoke(
    {"input": ".?p12#"},
    config={
        "configurable": {"session_id": "abc1235"}
    },
)
# print(response["answer"])
sources = set()
if "context" in response:
    for doc in response["context"]:
        if "source" in doc.metadata:
            sources.add(doc.metadata["source"])
citation = ""
if sources:
    citation = f"\n\nThis answer is based on information from: {', '.join(sorted(sources))}."
print(response["answer"] + citation)

# Optionally, still print retrieved chunks and sources for transparency (for debugging or evaluation)
if "context" in response:
    print("\nRetrieved Chunks and Sources:")
    for doc in response["context"]:
        print(f"\n---\nChunk:\n{doc.page_content}\nSource: {doc.metadata.get('source', 'N/A')}")
    print("\nSources:", ", ".join(sorted(sources)))
else:
    print("No sources found.")

I dont know

This answer is based on information from: 18.txt, 31.txt, 33.txt, 60.txt.

Retrieved Chunks and Sources:

---
Chunk:
Philosophy in Athens In addition to its democratic achievements, Athens also became the birthplace of philosophy, nurturing some of the greatest thinkers in history. The city's intellectual environment provided fertile ground for the development of new ideas and the exploration of fundamental questions about existence, ethics, and knowledge. Socrates, considered the father of Western philosophy, played a pivotal role in shaping the Athenian philosophical tradition. Through his Socratic method, he encouraged critical thinking and questioning of assumptions, inspiring his students, including Plato and Xenophon, to become influential philosophers in their own right. Plato, one of Socrates' most notable disciples, established the Academy, an institution that became a center for philosophical inquiry. Plato's writings explored a wide range of topics, including po