# Documentation

This notebook demonstrates a workflow for extracting, processing, and querying information from PDF documents using LangChain and Azure OpenAI services.

## Workflow Overview

1. **Data Extraction**  
    - Load PDF documents using `PyPDFLoader`.
    - Each page is treated as a separate `Document`.

2. **Data Processing**  
    - Split documents into manageable text chunks using `RecursiveCharacterTextSplitter`.

3. **Data Ingestion**  
    - Generate embeddings for text chunks using `AzureOpenAIEmbeddings`.
    - Store embeddings in a FAISS vector store for efficient similarity search.

4. **Data Retrieval**  
    - Use a retriever to find relevant document chunks based on user queries.

5. **User Query & RAG (Retrieval-Augmented Generation)**  
    - Format retrieved documents as context.
    - Use a prompt template and `AzureChatOpenAI` to generate answers based on the context.

## Key Variables

- `documents`: List of `Document` objects loaded from the PDF.
- `splitted_doc`: List of text chunks after splitting.
- `azOpenAIembeddings`: Azure OpenAI embeddings model instance.
- `vectorstore`: FAISS vector store containing embedded chunks.
- `retriever`: Retriever object for similarity search.
- `azOpenAIllm`: Azure OpenAI language model instance.
- `prompt`: Prompt template for question answering.
- `rag_chain`: End-to-end chain for RAG-based question answering.

## How to Use

- Update the `file_path` variable to point to your PDF file.
- Run the notebook cells in order to extract, process, and query your document.
- Use the `rag_chain.invoke({"question": "<your question>"})` to get answers from your document.

---
For more details, refer to the comments in each code cell.

## 1. Data Extraction

In [None]:
import json
from langchain.schema import Document

def documents_to_json(docs: list[Document]):
    return json.dumps([
        {
            "page_content": doc.page_content,
            "metadata": doc.metadata
        } for doc in docs
    ], indent=2)

In [None]:
import os
from langchain.document_loaders import PyPDFLoader

os.getcwd()

file_path = os.path.join(os.getcwd(), "data", "MLC_user_guide.pdf")

documents = PyPDFLoader(file_path).load() # Each page is a separate document

print(f"Number of pages in the document: {len(documents)}")
print(documents_to_json(documents))

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    length_function=len,
)

splitted_doc = text_splitter.split_documents(documents)

print(f"Number of chunks after splitting: {len(splitted_doc)}")
print(documents_to_json(splitted_doc))

## 2. Data Ingestion

In [None]:
from langchain_openai import AzureOpenAIEmbeddings
from dotenv import load_dotenv

load_dotenv()

azOpenAIembeddings = AzureOpenAIEmbeddings(
    model="text-embedding-ada-002",
    api_version="2023-05-15",
)

# Example usage of the embeddings
query_embeddings = azOpenAIembeddings.embed_query("What is the capital of France?")

In [None]:
from langchain.vectorstores import FAISS

# FAISS is in memory vector store, so it will not persist across sessions
vectorstore = FAISS.from_documents(
    documents=splitted_doc,
    embedding=azOpenAIembeddings
)

# Example usage of the vector store
relavant_docs = vectorstore.similarity_search("When will your plan start?")

## 3. Data Retrieval

IMPORTANT: We can also convert the vector store into a Retriever object. This makes it easy to integrate with other LangChain methods, as many of them are designed to work with retrievers. Essentially, it serves as a convenient interface.

In [None]:
retriever = vectorstore.as_retriever(search_kwargs = {"k": 3})

result = retriever.invoke("When will your plan start?")

print(documents_to_json(result))

## 4. User Query

In [None]:
from langchain_openai import AzureChatOpenAI

azOpenAIllm = AzureChatOpenAI(
    azure_deployment="gpt-4o-mini",
    api_version="2025-01-01-preview",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

# Example usage of the LLM
azOpenAIllm.invoke("When will your plan start?").content

In [None]:
def format_docs(docs: list[Document]) -> str:
    return "\n\n".join([doc.page_content for doc in docs])

In [None]:
from langchain.prompts import PromptTemplate

prompt_template = """
    Answer the question based on the context below. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    Context: {context}
    Question: {question}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableLambda

# LCEL: Langchain Core Expression Language
# This is a runnable chain that takes the context and question, formats the context, and then
# passes it to the prompt, which is then passed to the LLM, and finally parses the output as a string.
# The final output is a string that contains the answer to the question based on the context.

rag_chain = (
    {
        "context": RunnableLambda(lambda x: format_docs(retriever.invoke(x["question"]))),
        "question": RunnablePassthrough()
    }
    | prompt | azOpenAIllm | StrOutputParser()
)

response = rag_chain.invoke({"question": "Could you tell me when will your plan start?"})
print(f"Response: {response}")

### Experiment 2

In [None]:
rag_template = """
You are an intelligent assistant designed to answer only from the provided context.
Do not hallucinate. If unsure, say you dont know.

Context:
{context}

Chat History:
{chat_history}

Question:
{question}

Answer (with a follow-up question):
"""

prompt = PromptTemplate(template=rag_template, input_variables=["context", "chat_history", "question"])

In [None]:
from langchain.schema import AIMessage, HumanMessage, SystemMessage

def get_chat_history_string(chat_history: list[AIMessage | HumanMessage | SystemMessage]) -> str:
    return "\n".join([f"{msg.__class__.__name__}: {msg.content}" for msg in chat_history])

In [None]:
def chat(user_message: str, chat_history: str) -> str:
    rag_chain = (
        {
            "context": RunnableLambda(lambda x: format_docs(retriever.invoke(x["question"]))),
            "chat_history": RunnableLambda(lambda x: x["chat_history"]),
            "question": RunnableLambda(lambda x: x["question"]),
        }
        | prompt | azOpenAIllm | StrOutputParser()
    )
    
    response = rag_chain.invoke({"question": user_message, "chat_history": chat_history})
    
    return response

In [None]:
from langchain.schema import AIMessage, HumanMessage, SystemMessage

chat_history: list[AIMessage | HumanMessage | SystemMessage] = []

while True:
    user_message = input("You: ")
    
    if user_message.lower() in ["exit", "quit"]:
        break
    
    chat_history.append(HumanMessage(content=user_message))
    
    response = chat(user_message, get_chat_history_string(chat_history))

    chat_history.append(AIMessage(content=response))
    
    print(chat_history)

### Experiment 3

In [None]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

contextualize_question_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Task: Rewrite the user's latest message into a standalone question ONLY if it relies on prior context.\n"
     "Context source: You will receive the full conversation as `chat_history` followed by the latest `input`.\n\n"
     "How to use chat_history:\n"
     "1) Resolve anaphora/ellipsis (e.g., 'and pricing?', 'that one', 'them', 'it') using entities from chat_history.\n"
     "2) Pull ONLY the minimal details from chat_history needed to make the question self-contained.\n"
     "3) If chat_history lacks enough info to resolve the reference, return the input unchanged.\n\n"
     "Never rewrite—just return exactly as written—when the input is:\n"
     "- Greetings/farewells (hi, hello, hey, bye, good morning, etc.)\n"
     "- Courtesy/acknowledgment (thanks, thank you, ok/okay, sounds good, cool)\n"
     "- Apologies or chit-chat (sorry, no worries)\n"
     "- Interjections, fillers, emojis, or punctuation (..., ??, 👍)\n\n"
     "Additional rules:\n"
     "- If the input is already a complete standalone question, return it unchanged.\n"
     "- Only rewrite follow-ups that depend on chat_history.\n"
     "- Do NOT answer the question; output only the rewritten text.\n"
     "- Preserve meaning and tone; do not add new requests.\n"
     "- Guardrail: if the input has ≤3 words and contains no question mark or why/how term, return unchanged."
    ),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

context_qa_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a helpful assistant that must answer only using the information in the provided context.\n"
     "Rules:\n"
     "- Begin with a brief, friendly greeting.\n"
     "- Use only the provided context; do not rely on outside knowledge or assumptions.\n"
     "- If the answer cannot be found in the context, respond exactly with: I don't know.\n"
     "- Keep the answer concise and no longer than three sentences total (including the greeting).\n"
     "- After the answer, on a new line, ask exactly one short, relevant follow-up question.\n"
     "- Still ask the follow-up question even if you replied with 'I don't know.'\n"
     "- Use chat_history only to resolve references and maintain continuity."
    ),
    ("system", "Context:\n{context}"),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

In [None]:
def chat_v2(user_message: str, chat_history: str) -> str:
    # 1) Rewrite user question with chat history context
    question_rewriter = (
        {
            "input": RunnableLambda(lambda x: x["question"]), 
            "chat_history": RunnableLambda(lambda x: x["chat_history"])
        }
        | contextualize_question_prompt | azOpenAIllm | StrOutputParser()
    )
    
    resultss = question_rewriter.invoke({"question": user_message, "chat_history": chat_history})

    print(resultss)

    # 2) Retrieve docs for rewritten question
    retrieve_docs = question_rewriter | retriever | format_docs

    # 3) Answer using retrieved context + original input + chat history
    chain = (
        {
            "context": retrieve_docs,
            "input": RunnableLambda(lambda x: x["question"]),
            "chat_history": RunnableLambda(lambda x: x["chat_history"]),
        }
        | context_qa_prompt | azOpenAIllm | StrOutputParser()
    )
    
    result = chain.invoke(
        {
            "question": user_message,
            "chat_history": chat_history,
        }
    )

    return result

In [None]:
from langchain.schema import AIMessage, HumanMessage, SystemMessage

chat_history: list[AIMessage | HumanMessage | SystemMessage] = []

while True:
    user_message = input("You: ")
    
    if user_message.lower() in ["exit", "quit"]:
        break
    
    chat_history.append(HumanMessage(content=user_message))
    
    response = chat_v2(user_message, chat_history)

    chat_history.append(AIMessage(content=response))
    
    print("---------------")
    print(chat_history)
    print("---------------")