### 1. Reading the OpenAI API Key

This cell imports the `os` module and reads the `OPENAI_API_KEY` from the environment variables.  
The key is stored in `openai_api_key` so that later cells can use OpenAI models securely without hard-coding the key in the notebook.


In [1]:
import os
openai_api_key=os.getenv("OPENAI_API_KEY")
#print(openai_api_key)

### 2. Verifying API Key Availability

Here we check for the presence of the environment variable `OPENAI_API_KEY`.
If the key is missing, the assertion will fail with a clear error message, preventing the rest of the notebook from running without proper configuration.

In [2]:
assert "OPENAI_API_KEY" in os.environ, "Set OPENAI_API_KEY env variable."

### 3. Installing Required Packages

This cell installs all external Python packages used in the notebook, including:

- `langchain-core`, `langchain-community`, `langchain-openai`, `langchain-text-splitters`
- `langgraph`, `langsmith`
- `faiss-cpu` (for vector search)
- `pypdf` (for PDF handling)

These dependencies provide the building blocks for the RAG (Retrieval-Augmented Generation) pipeline.


In [3]:
pip install  langchain-core langchain-community langchain-openai langchain-text-splitters langgraph langsmith faiss-cpu pypdf


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### 4. Core Python Imports

This cell imports core Python utilities:

- `os` – operating system and environment variables
- `textwrap` – for building readable prompt templates
- `typing` – type hints such as `List`, `Dict`, `Any`, `TypedDict`, and `Annotated`

These are used throughout the notebook for cleaner code and better structure.


In [4]:
import os
import textwrap
from typing import List, Dict, Any, TypedDict, Annotated

### 5. LangChain and OpenAI Imports

This cell imports all the main components from LangChain and related libraries:

- `ChatOpenAI`, `OpenAIEmbeddings` for calling OpenAI chat models and generating vector embeddings.
- `PyPDFLoader` for loading and parsing PDF documents.
- `FAISS` for building a vector store used in semantic search.
- `RecursiveCharacterTextSplitter` for splitting long documents into chunks.
- `ChatPromptTemplate`, `RunnablePassthrough`, `RunnableLambda`, `RunnableWithMessageHistory` for building composable RAG chains.

These imports are the foundation of the RAG pipeline built in the rest of the notebook.


In [5]:
assert "OPENAI_API_KEY" in os.environ, "Please set OPENAI_API_KEY environment variable."

# LangChain / OpenAI (modern modular imports)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import (
    RunnablePassthrough,
    RunnableLambda,
    RunnableWithMessageHistory,
)

### 6. Conversation History and LangGraph Imports

This cell imports:

- Message history abstractions (`BaseChatMessageHistory`, `ChatMessageHistory`) to store past chat turns.
- Message types (`HumanMessage`, `AIMessage`, `AnyMessage`) to represent interactions.
- LangGraph tools (`StateGraph`, `END`, `add_messages`, `MemorySaver`) to construct a simple agent with memory.

These are later used to build a stateful conversational agent on top of the RAG pipeline.


In [6]:
from langchain_core.chat_history import BaseChatMessageHistory
from langchain.memory import ChatMessageHistory
from langchain_core.messages import HumanMessage, AIMessage, AnyMessage

# LangGraph
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import MemorySaver


### 7. Utilities, Tracing, and Directory Setup

This cell:

- Imports `requests`, `pathlib`, `json`, and `re` for HTTP requests, filesystem paths, JSON handling, and regular expressions.
- Optionally configures LangChain/LangSmith tracing (disabled by default).
- Defines base directories (`data`, `processed`, `faiss_index`) and ensures they exist.

These paths are used to store the GDPR PDF, processed chunks, and the FAISS index.


In [7]:
import requests
import pathlib
import json
import re

# Enable LangSmith tracing (optional, will be no-op if you have no LangSmith setup)
os.environ.setdefault("LANGCHAIN_TRACING_V2", "false")
#os.environ.setdefault("LANGCHAIN_PROJECT", "gdpr-rag-project")

# Global paths
BASE_DIR = pathlib.Path(".")
DATA_DIR = BASE_DIR / "data"
PROCESSED_DIR = DATA_DIR / "processed"
FAISS_DIR = PROCESSED_DIR / "faiss_index"
DATA_DIR.mkdir(exist_ok=True)
PROCESSED_DIR.mkdir(exist_ok=True)


### 8. GDPR PDF Location and Source URL

Here we define:

- The local file path where the GDPR PDF will be stored.
- The official URL from which the GDPR PDF is downloaded.

This setup is used in the next cell to ensure the document is available locally.


In [8]:
gdpr_pdf_path = DATA_DIR / "GDPR.pdf"
gdpr_url = "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"

### 9. Downloading the GDPR PDF (If Needed)

This cell checks whether the GDPR PDF already exists locally.  
If it does not:

1. It downloads the PDF from the official URL using `requests`.
2. Saves the content to `gdpr_pdf_path`.

If it already exists, it simply prints a confirmation message.


In [9]:
if not gdpr_pdf_path.exists():
    print("Downloading GDPR PDF...")
    resp = requests.get(gdpr_url)
    resp.raise_for_status()
    with open(gdpr_pdf_path, "wb") as f:
        f.write(resp.content)
    print("Downloaded GDPR to:", gdpr_pdf_path)
else:
    print("GDPR PDF already exists at:", gdpr_pdf_path)


GDPR PDF already exists at: data/GDPR.pdf


### 10. Loading the GDPR PDF into Documents

Here we:

- Create a `PyPDFLoader` using the local GDPR PDF path.
- Load the document into `docs`, which is a list of page-level documents.
- Print how many pages were successfully loaded.

These `docs` form the raw text data that will later be chunked and indexed.


In [10]:
print("Loading GDPR PDF...")
loader = PyPDFLoader(str(gdpr_pdf_path))
docs = loader.load()
print(f"Loaded {len(docs)} pages.")
 

Loading GDPR PDF...
Loaded 88 pages.


### 11. Configuring the Text Splitter

This cell defines a `RecursiveCharacterTextSplitter`:

- `chunk_size=1000` characters
- `chunk_overlap=150` characters
- Custom separators based on newlines and punctuation

The splitter will break each page into overlapping chunks suitable for embedding and retrieval.


In [11]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " "],
)

### 12. Splitting the Document into Chunks

We now apply the configured splitter to `docs`:

- The GDPR text is divided into overlapping chunks.
- The total number of resulting chunks is printed.

These chunks will be embedded and stored in a vector database.


In [12]:
print("Splitting into chunks...")
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks.")

Splitting into chunks...
Created 461 chunks.


### 13. Tagging Chunks with Article Metadata

This cell inspects each chunk and tries to extract an `Article <number>` pattern using a regular expression.  

- If an article number is found, it is stored in the chunk’s metadata.
- Otherwise, the article is marked as `"unknown"`.

These metadata tags are later used in formatting and citing GDPR sections in answers.


In [13]:
for d in chunks:
    text = d.page_content
    match = re.search(r"Article\s+(\d+)", text)
    if match:
        d.metadata["article"] = match.group(1)
    else:
        d.metadata.setdefault("article", "unknown")


### 14. Saving a Preview of Chunks

This cell writes a JSON preview of the first 50 chunks into `chunks_preview.json`.  
For each chunk it stores:

- Page number
- Article number
- A short content preview

This makes it easy to inspect how the GDPR text was split and tagged.


In [14]:
with open(PROCESSED_DIR / "chunks_preview.json", "w", encoding="utf-8") as f:
    json.dump(
        [
            {
                "page": c.metadata.get("page"),
                "article": c.metadata.get("article"),
                "content_preview": c.page_content[:300],
            }
            for c in chunks[:50]
        ],
        f,
        indent=2,
    )

print("Chunks prepared and preview saved.")

Chunks prepared and preview saved.


### 15. Building and Saving the FAISS Vector Index

In this step we:

1. Create an `OpenAIEmbeddings` object using the `text-embedding-3-small` model.
2. Convert all chunks into embeddings and store them in a FAISS vector index.
3. Persist the FAISS index to disk under `FAISS_DIR`.

This index enables efficient semantic search over GDPR text.


In [15]:
print("Creating embeddings and FAISS index...")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectordb = FAISS.from_documents(chunks, embeddings)

FAISS_DIR.mkdir(parents=True, exist_ok=True)
vectordb.save_local(str(FAISS_DIR))

print("FAISS index saved to:", FAISS_DIR)

Creating embeddings and FAISS index...
FAISS index saved to: data/processed/faiss_index


### 16. Reloading the FAISS Index

This cell simulates a fresh start:

- It reloads the FAISS index from disk using the same embedding model.
- Prints a confirmation once the index is successfully loaded.

This demonstrates how the vector database can be reused across sessions.


In [16]:
print("Reloading FAISS index from disk...")
vectordb = FAISS.load_local(
    str(FAISS_DIR),
    embeddings,
    allow_dangerous_deserialization=True,
)
print("FAISS index loaded.")

Reloading FAISS index from disk...
FAISS index loaded.


### 17. Creating the Retriever and LLM

Here we:

- Wrap the FAISS vector store as a `retriever` that returns the top 4 similar chunks.
- Instantiate a deterministic chat model (`gpt-4o-mini` with `temperature=0`) to generate grounded answers.

The retriever + LLM combination forms the core of the RAG pipeline.


In [17]:
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

### 18. Formatting Retrieved Chunks

This helper function takes a list of retrieved documents and formats them as:

`[page X, article Y]`  
followed by the text content.

The formatted string is used as context in the prompt so the model can both answer and cite relevant GDPR sections.


In [18]:
def format_docs(docs):
    out = []
    for d in docs:
        page = d.metadata.get("page", "?")
        article = d.metadata.get("article", "unknown")
        header = f"[page {page}, article {article}]"
        out.append(header + "\n" + d.page_content)
    return "\n\n".join(out)

### 19. Defining the RAG Prompt Template

This cell constructs a `ChatPromptTemplate` that:

- Sets the role of the assistant as a GDPR specialist.
- Instructs it to use **only** the provided context.
- Encourages citing articles and pages wherever possible.

The template has two dynamic fields: `{context}` and `{question}`, which are filled in by the RAG chain.


In [19]:
rag_prompt = ChatPromptTemplate.from_template(
    textwrap.dedent(
        """
        You are a GDPR assistant. Answer ONLY using the context below.
        If the answer is not contained in the context, say you don't know
        or that the regulation does not specify.

        Always:
        - Mention article and page if available.
        - Quote short relevant passages with [page, article] tags.

        Context:
        {context}

        Question:
        {question}
        """
    )
)
 

### 20. Building the Baseline RAG Chain

This cell composes the baseline RAG pipeline:

1. Given a question, it retrieves relevant chunks and formats them as context.
2. Fills the `rag_prompt` template.
3. Sends the prompt to the chat model.

The result is a grounded GDPR answer generated by the LLM.


In [20]:
rag_chain = (
    {
        "context": retriever | RunnableLambda(format_docs),
        "question": RunnablePassthrough(),
    }
    | rag_prompt
    | llm
)
print("hello")  

hello


### 21. Disabling LangSmith / LangChain Tracing

This cell explicitly disables all tracing-related environment variables:

- Turns off LangChain and LangSmith tracing.
- Ensures no external logging or monitoring endpoints are used.

This makes the notebook self-contained and avoids sending trace data elsewhere.


In [21]:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "false"
os.environ.pop("LANGSMITH_API_KEY", None)
os.environ.pop("LANGCHAIN_API_KEY", None)

import os

# Hard-disable LangSmith / LangChain tracing in this notebook
for var in [
    "LANGCHAIN_TRACING_V2",
    "LANGCHAIN_TRACING",
    "LANGCHAIN_ENDPOINT",
    "LANGCHAIN_PROJECT",
    "LANGSMITH_API_KEY",
    "LANGCHAIN_API_KEY",
]:
    os.environ.pop(var, None)

print("Tracing disabled.")


Tracing disabled.


### 22. Testing the Baseline RAG Pipeline

Here we run a simple test query:

> *"What are the lawful bases for processing personal data under GDPR?"*

The question is passed through `rag_chain`, and the model’s grounded answer is printed.


In [22]:


test_question = "What are the lawful bases for processing personal data under GDPR?"
print("Testing baseline RAG...")
response = rag_chain.invoke(test_question)
print("\n=== Baseline RAG Answer ===\n")
print(response.content)

Testing baseline RAG...

=== Baseline RAG Answer ===

The lawful bases for processing personal data under GDPR include:

1. **Union law or Member State law**: The processing must be laid down by Union law or Member State law to which the controller is subject [page 35, article unknown].

2. **Public interest or official authority**: Processing is necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller [page 35, article unknown; page 8, article unknown].

3. **Legitimate interests**: Processing may be based on the legitimate interests pursued by the controller or by a third party, provided that these interests do not override the fundamental rights and freedoms of the data subject [page 41, article 6].

4. **Consent**: Processing can also be based on the consent of the data subject, which can be withdrawn at any time [page 41, article 6].

These bases ensure that the processing of personal data is lawfu

### 23. Section: Memory-Enhanced RAG

This comment marks the start of the section where the basic RAG pipeline is extended with conversational memory to support multi-turn interactions.


In [23]:
#  Memory-Enhanced RAG

### 24. Building a Memory-Aware RAG Chain

This cell:

- Creates an in-memory dictionary `store` to hold chat histories keyed by `session_id`.
- Defines `get_session_history` to fetch or create a `ChatMessageHistory`.
- Builds `rag_chain_for_memory`, which accepts both a question and a conversation history and then runs the same retrieval + prompt + LLM pipeline.

It sets the foundation for multi-turn conversations grounded in GDPR.


In [24]:
store: Dict[str, ChatMessageHistory] = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

from langchain_core.runnables import RunnableLambda

# This chain expects {"question": "...", "history": [...]}
rag_chain_for_memory = (
    {
        "context": RunnableLambda(lambda x: x["question"]) | retriever | RunnableLambda(format_docs),
        "question": RunnableLambda(lambda x: x["question"]),
    }
    | rag_prompt
    | llm
)

### 25. Wrapping the Chain with Message History

This cell wraps `rag_chain_for_memory` inside `RunnableWithMessageHistory`, so:

- The agent automatically keeps track of past messages.
- Future questions in the same session can reference previous context.

The result is `conversational_rag`, a conversational RAG agent with memory.


In [25]:
conversational_rag = RunnableWithMessageHistory(
    rag_chain_for_memory,
    get_session_history,
    input_messages_key="question",
    history_messages_key="history",
)

### 26. Helper Function for Memory-Based Chat

This helper function provides a simple interface:

- Takes a `question` and a `session_id`.
- Invokes the `conversational_rag` chain with that session.
- Returns only the generated answer text.

It is used to demonstrate memory-aware conversations in the next cell.


In [26]:
def chat_with_memory(question: str, session_id: str = "demo-session") -> str:
    result = conversational_rag.invoke(
        {"question": question},
        config={"configurable": {"session_id": session_id}},
    )
    return result.content

### 27. Demonstrating Memory-Enhanced RAG

This cell runs a two-turn conversation:

1. Asks about lawful bases for processing personal data.
2. Follows up with a clarification question about consent.

The agent uses the conversation history so the second answer can build on the first.


In [27]:
print("\n=== Memory RAG Demo ===")
print("Q1:", "What are the lawful bases for processing personal data?")
print("A1:", chat_with_memory("What are the lawful bases for processing personal data?"))
print("\nQ2:", "Can you explain the one related to consent in simpler terms?")
print("A2:", chat_with_memory("Can you explain the one related to consent in simpler terms?"))


=== Memory RAG Demo ===
Q1: What are the lawful bases for processing personal data?


Error in RootListenersTracer.on_chain_end callback: AttributeError("'ChatMessageHistory' object has no attribute 'add_messages'")


A1: The lawful bases for processing personal data include:

1. **Contractual necessity**: Processing is lawful where it is necessary in the context of a contract or the intention to enter into a contract [page 7, article unknown].

2. **Legal obligation**: Processing carried out in accordance with a legal obligation to which the controller is subject or necessary for the performance of a task carried out in the public interest or in the exercise of official authority [page 7, article unknown].

3. **Public interest**: Processing necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller [page 8, article unknown].

4. **Legitimate interests**: The legitimate interests of a controller, provided that the interests or fundamental rights and freedoms of the data subject are not overriding [page 8, article unknown]. 

5. **Vital interests**: Processing based on the vital interest of another natural person, whic

Error in RootListenersTracer.on_chain_end callback: AttributeError("'ChatMessageHistory' object has no attribute 'add_messages'")


A2: Consent under GDPR means that individuals must clearly agree to the processing of their personal data. This agreement can be given in various ways, such as through a written statement, ticking a box online, or other clear actions. However, simply being silent or having pre-ticked boxes does not count as consent. 

Consent must cover all purposes for which the data will be used, and if there are multiple purposes, consent must be obtained for each one. When asking for consent electronically, the request should be straightforward and not disrupt the service being provided. 

Additionally, individuals have the right to withdraw their consent at any time, and it should be just as easy to withdraw consent as it was to give it. Consent is not considered freely given if there is an imbalance of power between the individual and the organization asking for consent, such as when a public authority is involved [page 5, article unknown; page 36, article 7].


### 28. Section: Guardrails for Input and Output

This comment marks the start of the section where simple guardrails are introduced to:

- Filter unsafe or non-compliant user queries.
- Post-process model outputs to detect possible hallucinations.


In [2]:
# Guardrails: Input & Output Filters (Phase 4)

### 29. Input Safety Filter

This function implements a basic input guard:

- It scans the question for suspicious patterns that suggest attempts to bypass GDPR.
- If any pattern is found, it raises an error instead of answering.
- Otherwise, it returns the original question.

This is a minimal example of adding compliance guardrails on user input.


In [29]:
def input_safety_filter(question: str) -> str:
    """
    Very simple adversarial filter.
    You can extend this with more NLP-based detectors.
    """
    lower_q = question.lower()
    blocked_patterns = [
        "bypass gdpr",
        "evade gdpr",
        "ignore gdpr",
        "trick the regulator",
        "fake consent",
    ]
    if any(p in lower_q for p in blocked_patterns):
        raise ValueError(
            "Your query appears to ask for non-compliant or unethical behavior. "
            "I cannot help with that."
        )
    return question

### 30. Output Hallucination Guard

This guard checks the final answer:

- If the answer does not mention “GDPR” or “Article”, it may not be sufficiently grounded in the regulation.
- In that case, it adds a warning message before returning the answer.

This is a lightweight way to flag potentially ungrounded responses.


In [30]:
def hallucination_guard(answer: str) -> str:
    """
    Simple output guard:
    - Enforce that answer mentions at least 'GDPR' or 'Article'
    - If not, warn user.
    """
    if ("article" not in answer.lower()) and ("gdpr" not in answer.lower()):
        return (
            "Warning: The model's answer may be insufficiently grounded in GDPR text.\n\n"
            + answer
        )
    return answer


### 31. Combining Guardrails with the RAG Chain

This cell:

1. Wraps the baseline `rag_chain` with:
   - `input_safety_filter` before the model call.
   - `hallucination_guard` after the model call.
2. Demonstrates:
   - A blocked unsafe query.
   - A normal GDPR query that passes and gets answered.

The result is a safer RAG pipeline for GDPR-related questions.


In [31]:
safe_rag_chain = (
    RunnableLambda(lambda q: input_safety_filter(q))
    | rag_chain
    | RunnableLambda(lambda msg: hallucination_guard(msg.content))
)

print("\n=== Guardrails Demo ===")
try:
    bad_q = "How can I bypass GDPR to process user data without them knowing?"
    print("Bad query:", bad_q)
    print("Answer:", safe_rag_chain.invoke(bad_q))
except Exception as e:
    print("Blocked bad query:", e)

good_q = "Explain when consent is required under GDPR."
print("\nGood query:", good_q)
print("Answer:", safe_rag_chain.invoke(good_q))


=== Guardrails Demo ===
Bad query: How can I bypass GDPR to process user data without them knowing?
Blocked bad query: Your query appears to ask for non-compliant or unethical behavior. I cannot help with that.

Good query: Explain when consent is required under GDPR.
Answer: The regulation specifies that consent is required when processing is based on the data subject's consent. The controller must be able to demonstrate that the data subject has consented to the processing of their personal data [page 5, article unknown]. Additionally, consent must be given in a manner that is clearly distinguishable from other matters, intelligible, and easily accessible, using clear and plain language [page 36, article unknown]. 

Consent should cover all processing activities for the same purpose or purposes, and if there are multiple purposes, consent must be given for all of them [page 5, article unknown].


### 32. Section: Agentic RAG with Tools

This comment introduces the section where the system is extended with simple “tools” that the agent can use, such as retrieval and summarization utilities.


In [32]:
# Agentic RAG with Simple Tools (Phase 5)

### 33. Retrieval and Citation Tools

This cell defines two utility “tools”:

- `retrieve_tool`: fetches the top-k most relevant chunks for a given query.
- `citation_check_tool`: performs a basic check to see whether the final answer appears to reference pages/articles from the retrieved documents.

These tools can be used by a more agent-like pipeline to reason about context and citations.


In [33]:
def retrieve_tool(query: str, k: int = 5):
    """Tool: retrieve top-k relevant GDPR chunks."""
    docs = retriever.invoke(query)
    return docs[:k]

def citation_check_tool(answer: str, docs: List[Any]) -> bool:
    """
    Very naive citation check:
    - Just check if any page/article markers used in format_docs appear in answer.
    """
    markers = []
    for d in docs:
        page = d.metadata.get("page", "?")
        article = d.metadata.get("article", "unknown")
        markers.append(f"page {page}")
        markers.append(f"article {article}")
    return any(m.lower() in answer.lower() for m in markers)

### 34. Summarization Tool

This tool takes retrieved documents and a question, and uses the LLM to generate a focused summary or answer.  
It demonstrates how the RAG system can expose more specialized helper functions beyond basic retrieval.


In [34]:
def summarizer_tool(docs: List[Any], question: str) -> str:
    """Summarizer tool uses LLM to compress docs into a focused answer."""
    context = format_docs(docs)
    prompt = ChatPromptTemplate.from_template(
        textwrap.dedent(
            """
            You are a GDPR summarizer.
            Use ONLY the context below to answer the question.
            Provide a concise, well-structured explanation with citations.

            Context:
            {context}

            Question:
            {question}
            """
        )
    )
    chain = prompt | llm
    result = chain.invoke({"context": context, "question": question})
    return result.content

### 35. Agentic RAG Orchestration

This function coordinates the use of tools (retrieval, summarization, citation checking) to answer a question in a slightly more “agentic” manner, rather than a single-pass RAG call.  
It showcases how higher-level logic can be layered on top of the basic RAG building blocks.


In [35]:
def agentic_rag(question: str) -> str:
    """
    Simple agent:
    1. Apply input guardrails
    2. Use retrieve_tool
    3. Use summarizer_tool
    4. Run citation_check_tool; if fails, warn user
    """
    q = input_safety_filter(question)
    docs = retrieve_tool(q, k=5)
    if not docs:
        return "I couldn't find any relevant GDPR content to answer this question."

    answer = summarizer_tool(docs, q)
    has_citations = citation_check_tool(answer, docs)
    if not has_citations:
        answer = (
            "Note: The generated answer may be missing explicit citations; "
            "please verify against GDPR text.\n\n"
            + answer
        )
    return answer

print("\n=== Agentic RAG Demo ===")
print("Question:", "What are data subject rights under GDPR?")
print("Answer:", agentic_rag("What are data subject rights under GDPR?"))


=== Agentic RAG Demo ===
Question: What are data subject rights under GDPR?
Answer: Under the General Data Protection Regulation (GDPR), data subjects have several key rights designed to protect their personal data and ensure transparency in its processing. These rights include:

1. **Right of Access**: Data subjects have the right to obtain confirmation from the data controller regarding whether their personal data is being processed. If so, they can access their personal data and receive information about the purposes of processing, categories of data, recipients of the data, storage duration, and their rights to rectification or erasure (Article 15) (page 42).

2. **Right to Object**: Data subjects can object to the processing of their personal data, particularly when it is processed for scientific, historical, or statistical purposes, unless the processing is necessary for a task carried out in the public interest (Article 89) (page 45).

3. **Right to Transparent Information**: C

### 36. Section: Graph-RAG Style Retrieval

This comment introduces an experimental “Graph-RAG” approach, where the system can reason about related sections or neighbors in the text, instead of treating chunks as independent.


In [36]:
# Graph-RAG Style Retrieval (Phase 6)

### 37. Regulatory Rephrasing Helper

This helper function rewrites user questions into a more formal or regulation-oriented phrasing.  
The goal is to make queries align better with how GDPR is written, potentially improving retrieval quality.


In [37]:
def regulatory_rephrase(question: str) -> str:
    """
    Use the LLM to rephrase a natural question into more 'regulation-like' language.
    """
    prompt = ChatPromptTemplate.from_template(
        textwrap.dedent(
            """
            Rephrase the user question into a GDPR-regulation-style query,
            referencing articles or sections if appropriate.
            Do NOT answer the question, only rephrase it.

            User question: {question}
            """
        )
    )
    chain = prompt | llm
    result = chain.invoke({"question": question})
    return result.content.strip()


### 38. Retrieving Anchor and Neighbor Chunks

This function performs a retrieval step that not only finds the most relevant “anchor” chunks but also includes neighboring chunks.  
This can approximate a graph-style neighborhood around relevant GDPR passages.


In [38]:
def get_anchor_and_neighbors(question: str):
    """
    1. Rephrase to regulatory language
    2. Get anchor doc
    3. Get neighbors: same page +/- 1, or same article if available
    """
    ref_q = regulatory_rephrase(question)
    # NEW: use retriever.invoke instead of get_relevant_documents
    anchor_docs = retriever.invoke(ref_q)
    if not anchor_docs:
        return None, []

    anchor = anchor_docs[0]
    anchor_page = anchor.metadata.get("page", None)
    anchor_article = anchor.metadata.get("article", None)

    neighbors = []
    for d in chunks:
        page = d.metadata.get("page", None)
        article = d.metadata.get("article", None)

        # Same article OR neighboring pages
        if anchor_article != "unknown" and article == anchor_article:
            neighbors.append(d)
        elif anchor_page is not None and page in {anchor_page - 1, anchor_page, anchor_page + 1}:
            neighbors.append(d)

    # Deduplicate neighbors
    seen_ids = set()
    unique_neighbors = []
    for d in neighbors:
        key = (d.metadata.get("page"), d.metadata.get("article"), d.page_content[:50])
        if key not in seen_ids:
            seen_ids.add(key)
            unique_neighbors.append(d)

    return anchor, unique_neighbors

### 39. Graph-RAG Style Answering

This function uses the anchor-and-neighbor context to generate an answer that accounts for related GDPR passages.  
It illustrates how retrieval strategies beyond simple top-k search can be integrated into RAG.


In [39]:
def graph_rag_answer(question: str) -> str:
    """
    Graph-like RAG:
    - Rephrase
    - Anchor retrieval + neighbors
    - Summarize
    - Add page/article citations and optional full-page snippet
    """
    try:
        q = input_safety_filter(question)
    except ValueError as e:
        return str(e)

    anchor, neighbors = get_anchor_and_neighbors(q)
    if anchor is None:
        return "I couldn't locate relevant GDPR sections for this question."

    # Use anchor + top neighbors as context
    context_docs = [anchor] + neighbors[:8]
    context = format_docs(context_docs)

    prompt = ChatPromptTemplate.from_template(
        textwrap.dedent(
            """
            You are a GDPR legal assistant using graph-enhanced retrieval.
            Answer based ONLY on the context below.
            Make sure to:
            - Mention relevant Articles explicitly.
            - Describe any conditions or exceptions.
            - Provide page and article references.

            Context:
            {context}

            Question:
            {question}
            """
        )
    )
    chain = prompt | llm
    answer = chain.invoke({"context": context, "question": q}).content

    # Attach an optional snippet of the anchor page for traceability
    anchor_page = anchor.metadata.get("page", "?")
    trailer = textwrap.dedent(
        f"""
        ---
        Traceability Note:
        - Anchor page: {anchor_page}
        - Anchor article: {anchor.metadata.get("article", "unknown")}
        """
    )
    return answer + "\n" + trailer

print("\n=== Graph-RAG Demo ===")
print("Question:", "When is a Data Protection Impact Assessment (DPIA) required under GDPR?")
print("Answer:", graph_rag_answer("When is a Data Protection Impact Assessment (DPIA) required under GDPR?"))



=== Graph-RAG Demo ===
Question: When is a Data Protection Impact Assessment (DPIA) required under GDPR?
Answer: A Data Protection Impact Assessment (DPIA) is required under the General Data Protection Regulation (GDPR) when processing operations are likely to result in a high risk to the rights and freedoms of natural persons. Specifically, this requirement is outlined in the context of Article 35 of the GDPR, which mandates that the controller must carry out a DPIA to evaluate the origin, nature, particularity, and severity of the risk involved in the processing.

According to the context provided, if the DPIA indicates that the processing operations involve a high risk that cannot be mitigated by appropriate measures (considering available technology and costs of implementation), the controller must consult the supervisory authority prior to the processing (see page 15).

Additionally, the likelihood and severity of the risk should be assessed based on the nature, scope, context, a

### 40. Section: LangGraph Agent with Memory

This comment introduces a minimal LangGraph-based agent that combines:

- RAG retrieval,
- conversational memory,
- and stateful graph execution.

It serves as a higher-level “shell” around the existing components.


In [40]:
# LangGraph + Memory (Agent Shell) (Phase 5 + 3, minimal)

In [41]:
class GraphState(TypedDict):
    messages: Annotated[List[AnyMessage], add_messages]

def rag_node(state: GraphState) -> GraphState:
    """
    LangGraph node that uses the safe_rag_chain as backend.
    We read the latest human message and return an AI message.
    """
    # Last human message
    last_human = None
    for m in reversed(state["messages"]):
        if isinstance(m, HumanMessage):
            last_human = m
            break

    if last_human is None:
        return {"messages": [AIMessage(content="No user question found.")]}

    question = last_human.content
    try:
        answer = safe_rag_chain.invoke(question)
        if hasattr(answer, "content"):
            text = answer.content
        else:
            text = str(answer)
    except Exception as e:
        text = f"Error while answering question: {e}"

    return {"messages": [AIMessage(content=text)]}

# Build graph
graph_builder = StateGraph(GraphState)
graph_builder.add_node("rag_node", rag_node)
graph_builder.set_entry_point("rag_node")
graph_builder.add_edge("rag_node", END)

checkpointer = MemorySaver()
graph_app = graph_builder.compile(checkpointer=checkpointer)

print("\n=== LangGraph Conversation Demo ===")
initial_state: GraphState = {
    "messages": [HumanMessage(content="Summarize the principles of data processing under GDPR.")]
}
result_state = graph_app.invoke(
    initial_state,
    config={"configurable": {"thread_id": "demo-thread-1"}}
)
for msg in result_state["messages"]:
    print(type(msg).__name__, ":", msg.content)


=== LangGraph Conversation Demo ===
HumanMessage : Summarize the principles of data processing under GDPR.
AIMessage : The principles of data processing under GDPR are outlined in Article 5. They include:

1. **Lawfulness, Fairness, and Transparency**: Personal data must be processed lawfully, fairly, and in a transparent manner in relation to the data subject [page 34, article 5].

2. **Purpose Limitation**: Data should be collected for specified, explicit, and legitimate purposes and not further processed in a manner incompatible with those purposes. However, further processing for archiving in the public interest, scientific or historical research, or statistical purposes is permitted under certain conditions [page 34, article 5].

3. **Data Minimization**: Although not explicitly mentioned in the provided context, this principle generally requires that personal data collected should be adequate, relevant, and limited to what is necessary for the purposes for which they are process

In [42]:
# 10. Simple Robustness & Hallucination Test (Phase 7)

In [43]:
def robust_ask(question: str) -> str:
    """
    - Uses retriever directly
    - If no docs, refuses to answer
    - Otherwise uses agentic_rag for grounded answer
    """
    # In the new API, retriever is a Runnable, so we use .invoke(...)
    docs = retriever.invoke(question)
    if not docs:
        return "I could not find sufficiently relevant GDPR text. I prefer not to answer rather than hallucinate."

    return agentic_rag(question)

print("\n=== Robustness / Hallucination Demo ===")
print("Real question:", "What is Article 6 about?")
print("Answer:", robust_ask("What is Article 6 about?"))

print("\nFake question:", "What does non-existent Article 999 say?")
print("Answer:", robust_ask("What does non-existent Article 999 say?"))

print("\nAll phases executed. Your GDPR RAG system is up and running.")


=== Robustness / Hallucination Demo ===
Real question: What is Article 6 about?
Answer: Note: The generated answer may be missing explicit citations; please verify against GDPR text.

The provided context does not include specific information about Article 6 of the GDPR. Therefore, I cannot summarize or explain its content based on the given text. If you have access to the text of Article 6 or any additional context, please provide it for a more accurate response.

Fake question: What does non-existent Article 999 say?
Answer: The context provided does not contain any information regarding a non-existent Article 999. Therefore, I cannot provide a summary or explanation about it. The articles referenced include Article 52, which discusses the independence of supervisory authorities, Article 49, which outlines conditions for transferring personal data to third countries, and Article 94, which addresses the repeal of Directive 95/46/EC. If you have questions about these specific articles