<a href="https://colab.research.google.com/github/dipanjanS/mastering-intelligent-agents-langgraph-workshop-dhs2025/blob/main/Module-1-Introduction-to-Generative-AI-and-Agentic-AI/M1LC5_Build_a_RAG_System_with_LangGraph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Two-Node LangGraph RAG System (Retrieve → Generate)

**Objective:** Build a standard Retrieval-Augmented Generation (RAG) system using **LangGraph** with two nodes:
1. **Retrieval Node**: Uses a hybrid retriever to fetch relevant chunks and stores them in `retrieved_docs` in the state.
2. **Generation Node**: Formats retrieved docs, runs a RAG prompt with an LLM, and stores the final string answer in `answer`.

**Architecture:** `question → retrieve_node → generate_node → answer`

![](https://i.imgur.com/FtKPvC8.png)

This notebook demonstrates how to use a Retrieval-Augmented Generation (RAG) system to assist operations, audit, and transformation teams in querying past findings and recommendations. The goal is to enable fast, accurate answers based on internal process narratives, helping organizations:

- Understand root causes of workflow inefficiencies
- Recommend remediation strategies
- Identify patterns across departments
- Track outcomes of automation interventions (e.g., AutoFlow Insight)


## Environment & Dependencies
Install libraries and set the API key.

In [None]:
!pip install langgraph==0.6.5 langchain==0.3.27 langchain-openai==0.3.29 langchain-community==0.3.27 langchain-chroma==0.2.5 rank-bm25==0.2.2 --quiet

## Enter API Keys & Setup Environment Variables

In [None]:
import os
import getpass

# OpenAI API Key (for chat & embeddings)
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key (https://platform.openai.com/account/api-keys):\n")


## Data Loading & Preprocessing

We will:
- Download the JSONL dataset from Google Drive
- Load JSON lines into Python
- Build `Document` objects



In [None]:
!gdown 1u8ImzhGW2wgIib16Z_wYIaka7sYI_TGK

In [None]:
from pathlib import Path
import json
from langchain.docstore.document import Document

# ---- Configure dataset path (update if needed) ----
DATA_PATH = Path("./rag_demo_docs052025.jsonl")  # same file name as earlier notebook

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Expected dataset at {DATA_PATH}. "
        "Please place the JSONL file here or update DATA_PATH."
    )

# Load JSONL
raw_docs = []
with DATA_PATH.open("r", encoding="utf-8") as f:
    for line in f:
        raw_docs.append(json.loads(line))

# Convert to Document objects with metadata
documents = []
for i, d in enumerate(raw_docs):
    sect = d.get("sectioned_report", {})
    text = (
        f"Issue:\n{sect.get('Issue','')}\n\n"
        f"Impact:\n{sect.get('Impact','')}\n\n"
        f"Root Cause:\n{sect.get('Root Cause','')}\n\n"
        f"Recommendation:\n{sect.get('Recommendation','')}"
    )

    documents.append(Document(page_content=text))

In [None]:
print(documents[0].page_content)


## Embeddings & Vector Store (Chroma)

- OpenAI `text-embedding-3-small`
- Chroma with cosine space
- Persist to disk so you can reuse


In [None]:
!rm -rf reports_db

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

persist_dir = "./reports_db"
collection = "reports_db"

embedder = OpenAIEmbeddings(model="text-embedding-3-small")

# Build or rebuild the vector store
vectordb = Chroma.from_documents(
    documents=documents,
    embedding=embedder,
    collection_name=collection,
    collection_metadata={"hnsw:space": "cosine"},
    persist_directory=persist_dir
)


## Retrievers (Hybrid Search)

We use the following retrieval strategy:
- Semantic similarity (with a score threshold)
- BM25 keyword retriever
- Ensemble hybrid combination



In [None]:
# Reopen handle (demonstrates persistence)
vectordb = Chroma(
    embedding_function=embedder,
    collection_name=collection,
    persist_directory=persist_dir,
)
vectordb._collection.count()

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.retrievers import ContextualCompressionRetriever

# Base semantic retriever (cosine sim + threshold)
semantic = vectordb.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 5, "score_threshold": 0.2},
)

# BM25 keyword retriever
bm25 = BM25Retriever.from_documents(documents)
bm25.k = 3

# Ensemble (hybrid)
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25, semantic],
    weights=[0.6, 0.4],
    k=5
)

In [None]:
# Quick test
hybrid_retriever.invoke("What are the major issues in finance approval workflows?")[:3]


## LangGraph State Definition

We define a simple **state schema** with overwrite behavior (single-turn flow):

- `question: str`
- `retrieved_docs: list[Document]`
- `answer: str`


In [None]:

from typing import List, TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langchain.docstore.document import Document as LCDocument

# We keep overwrite semantics for all keys (no reducers needed for appends here).
class RAGState(TypedDict):
    question: str
    retrieved_docs: List[LCDocument]
    answer: str



## Node 1 — Retrieval Node

- Reads `state['question']`
- Calls `retriever.invoke(question)` (as used in your notebook)
- Writes documents into `state['retrieved_docs']`


In [None]:

def retrieve_node(state: RAGState) -> RAGState:
    query = state["question"]
    docs = hybrid_retriever.invoke(query)  # returns list[Document]
    return {"retrieved_docs": docs}



## Node 2 — Generation Node (RAG prompt)

- Formats retrieved docs into a context string
- Uses a grounded prompt: answer only from context, otherwise say you don't know
- Stores result text in `state['answer']`


In [None]:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for Analyzing internal reports for Operational Insights.
       Use the following pieces of retrieved context to answer the question.
       If you don't know the answer or there is no relevant context, just say that you don't know.
       give a well-structured and to the point answer using the context information.

       Question:
       {question}

       Context:
       {context}
    """
)

def _format_docs(docs: List[LCDocument]) -> str:
    return "\n\n".join(d.page_content for d in docs) if docs else ""

def generate_node(state: RAGState) -> RAGState:
    question = state["question"]
    docs = state.get("retrieved_docs", [])
    context = _format_docs(docs)
    prompt = PROMPT.format(question=question, context=context)
    resp = llm.invoke(prompt)
    return {"answer": resp.content}



## Build the Graph & Edges

`START → retrieve_node → generate_node → END`


In [None]:
builder = StateGraph(RAGState)

builder.add_node("retrieve", retrieve_node)
builder.add_node("generate", generate_node)

builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)

graph = builder.compile()

In [None]:
from IPython.display import Image, display, display_markdown

display(Image(graph.get_graph().draw_mermaid_png()))


## Run Examples

Invoke with a question and read back the final `answer` from the state.


In [None]:
example_q = "What are the major issues in finance approval workflows?"
final_state = graph.invoke({"question": example_q})
final_state

In [None]:
display_markdown(final_state["answer"], raw=True)

In [None]:
example_q = "What caused invoice SLA breaches in the last quarter?"
final_state = graph.invoke({"question": example_q})
display_markdown(final_state["answer"], raw=True)

In [None]:
example_q = "How did AutoFlow Insight improve SLA adherence?"
final_state = graph.invoke({"question": example_q})
display_markdown(final_state["answer"], raw=True)