# EU GDPR RAG Assistant ‚Äî Project Notebook


**Artifacts included:**
- End-to-end RAG pipeline (PDF - Article chunks - embeddings+FAISS - retrieval - generation)
- Evaluation table on sample questions
- Full Gradio app code


## 1. Problem Definition

**Goal:** Build a Retrieval-Augmented Generation (RAG) assistant that answers questions about the **EU GDPR** by retrieving relevant **Article** chunks and generating concise, grounded answers with **Article citations**.

**Why it matters:** GDPR is long and complex; the assistant enables faster navigation and verifiable answers (citations).

## 2. Data Source & Understanding

**Source:** EU GDPR PDF (Regulation (EU) 2016/679) from EUR-Lex.

**Data characteristics & challenges:**
- Public legal PDF
- PDF text extraction can introduce artifacts (soft hyphens, broken words, extra whitespace)
- Article headers can be inconsistently extracted - robust chunking needed

In [2]:

%pip install -q pypdf sentence-transformers faiss-cpu transformers accelerate requests pandas numpy gradio
#pip install tf-keras

import os, re
import numpy as np
import pandas as pd
import requests
from pypdf import PdfReader


Note: you may need to restart the kernel to use updated packages.


In [3]:
DATA_DIR = 'data'
RAW_DIR = os.path.join(DATA_DIR, 'raw')
os.makedirs(RAW_DIR, exist_ok=True)

PDF_URL = 'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX%3A32016R0679'
PDF_PATH = os.path.join(RAW_DIR, 'gdpr_2016_679_oj.pdf')
PDF_URL

'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX%3A32016R0679'

In [4]:
def download_file(url: str, out_path: str, chunk_size: int = 1024 * 1024):
    if os.path.exists(out_path) and os.path.getsize(out_path) > 0:
        print(f'PDF already exists: {out_path}')
        return
    print('Downloading GDPR PDF...')
    r = requests.get(url, stream=True, timeout=120)
    r.raise_for_status()
    with open(out_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=chunk_size):
            if chunk:
                f.write(chunk)
    print('Saved:', out_path)

download_file(PDF_URL, PDF_PATH)

PDF already exists: data/raw/gdpr_2016_679_oj.pdf


## 3. Data Preparation

### 3.1 Text Cleaning
Cleaning removes common PDF extraction artifacts while preserving line breaks to support header detection.

In [5]:
def clean_text(s: str) -> str:
    if not s:
        return ''
    s = s.replace('\u00ad', '')
    s = re.sub(r'-\s*\n\s*', '', s)
    s = re.sub(r'\n{3,}', '\n\n', s)
    s = re.sub(r'[ \t]+', ' ', s)
    s = re.sub(r'\b(?:[A-Za-z]\s){2,}[A-Za-z]\b', lambda m: m.group(0).replace(' ', ''), s)
    for _ in range(4):
        s = re.sub(r'(?<!\w)([A-Za-z]{2,})\s+([A-Za-z]{1,3})(?!\w)', r'\1\2', s)
    s = re.sub(r'\bAr\s+ticle\b', 'Article', s, flags=re.IGNORECASE)
    s = re.sub(r'\bChar\s+ter\b', 'Charter', s, flags=re.IGNORECASE)
    s = re.sub(r'\s+([.,;:])', r'\1', s)
    return s.strip()


In [6]:
def extract_pages(pdf_path: str) -> pd.DataFrame:
    reader = PdfReader(pdf_path)
    rows = []
    for i, page in enumerate(reader.pages):
        txt = clean_text(page.extract_text() or '')
        rows.append({'page': i + 1, 'text': txt, 'source_url': PDF_URL})
    df = pd.DataFrame(rows)
    df['n_chars'] = df['text'].str.len()
    return df

df_pages = extract_pages(PDF_PATH)
df_pages.head()

Unnamed: 0,page,text,source_url,n_chars
0,1,I \n(Legislative acts) \nREGULA TIONS \nREGULA...,https://eur-lex.europa.eu/legal-content/EN/TXT...,2628
1,2,(4) The processingof personal data shouldbe de...,https://eur-lex.europa.eu/legal-content/EN/TXT...,5150
2,3,(11) Effe ctive prote ctionof personal data th...,https://eur-lex.europa.eu/legal-content/EN/TXT...,4883
3,4,household activities could includecor responde...,https://eur-lex.europa.eu/legal-content/EN/TXT...,5381
4,5,(23) In orderto ensure that natural personsare...,https://eur-lex.europa.eu/legal-content/EN/TXT...,4678


In [7]:
df_pages['n_chars'].describe()

count      88.000000
mean     3914.386364
std      1009.295241
min       285.000000
25%      3221.500000
50%      3803.000000
75%      4869.500000
max      5620.000000
Name: n_chars, dtype: float64

## 4. System / Model Design

**RAG architecture:**
1. PDF ‚Üí cleaned text per page
2. Article-wise chunking
3. Embed each chunk (SentenceTransformers)
4. Store embeddings in FAISS index
5. Retrieve top-k chunks for a question
6. Generate answer using a free LLM (FLAN-T5) constrained to retrieved context

**Key design choice:** Article-wise chunks improve interpretability and allow citations as (Article X).

## 5. Implementation

### 5.1 Article-wise chunking (as used in the app)
The chunker detects Article headers, slices text into Article blocks, and deduplicates by keeping the longest chunk per Article.

In [8]:
HDR = re.compile(
    r"(?im)(?:^|\n)\s*(A\s*R\s*T\s*I\s*C\s*L\s*E|ARTICLE|Article)"
    r"\s*[\n ]*\(?\s*(\d{1,2})\s*\)?\.?"
)

def chunk_by_articles(df_pages: pd.DataFrame, min_chars: int = 200):
    full_text = "\n".join(df_pages.sort_values('page')['text'].tolist())
    matches = list(HDR.finditer(full_text))
    chunks = []
    for i, m in enumerate(matches):
        art = int(m.group(2))
        if not (1 <= art <= 99):
            continue
        start = m.start()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(full_text)
        block = full_text[start:end].strip()
        if len(block) < min_chars:
            continue
        chunks.append({'article': art, 'heading': f'Article {art}', 'text': block, 'len': len(block)})
    df_raw = pd.DataFrame(chunks)
    df_best = (
        df_raw.sort_values(['article', 'len'], ascending=[True, False])
              .drop_duplicates('article', keep='first')
              .sort_values('article')
    )
    df_chunks = pd.DataFrame({
        'chunk_id': df_best['article'].apply(lambda x: f'article_{int(x)}'),
        'article': df_best['article'].astype(int),
        'heading': df_best['heading'],
        'text': df_best['text'],
        'source_url': PDF_URL,
    })
    missing = sorted(set(range(1, 100)) - set(df_chunks['article']))
    return df_chunks, missing

df_chunks, missing = chunk_by_articles(df_pages)
print('Chunks:', len(df_chunks))
print('Missing articles:', missing)
df_chunks.head()

Chunks: 77
Missing articles: [1, 5, 12, 13, 16, 21, 23, 24, 32, 35, 37, 40, 44, 51, 55, 60, 63, 68, 77, 85, 92, 94]


Unnamed: 0,chunk_id,article,heading,text,source_url
1,article_2,2,Article 2,Article 2 \nMaterial scope \n1. This Regulatio...,https://eur-lex.europa.eu/legal-content/EN/TXT...
2,article_3,3,Article 3,Article 3 \nT err itorial scope \n1. This Regu...,https://eur-lex.europa.eu/legal-content/EN/TXT...
3,article_4,4,Article 4,Article 4 \nDef initionsFor thepur posesof thi...,https://eur-lex.europa.eu/legal-content/EN/TXT...
4,article_6,6,Article 6,Article 6 \nLawfulnessof processing \n1. Proce...,https://eur-lex.europa.eu/legal-content/EN/TXT...
5,article_7,7,Article 7,Article 7 \nConditionsfor consent \n1. Where p...,https://eur-lex.europa.eu/legal-content/EN/TXT...


### 5.2 Embeddings + FAISS Index

In [9]:
from sentence_transformers import SentenceTransformer
import faiss

EMBED_MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
embedder = SentenceTransformer(EMBED_MODEL_NAME)

emb = embedder.encode(df_chunks['text'].tolist(), normalize_embeddings=True, show_progress_bar=True)
emb = np.asarray(emb, dtype=np.float32)

index = faiss.IndexFlatIP(emb.shape[1])
index.add(emb)
print('Index size:', index.ntotal, 'dim:', emb.shape[1])

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Index size: 77 dim: 384


### 5.3 Retrieval (top-k Articles)

In [10]:
def retrieve(query: str, k: int = 6):
    q_emb = embedder.encode([query], normalize_embeddings=True).astype(np.float32)
    scores, idxs = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], idxs[0]):
        row = df_chunks.iloc[int(idx)]
        results.append({
            'score': float(score),
            'chunk_id': row['chunk_id'],
            'article': int(row['article']),
            'heading': row['heading'],
            'text': row['text'],
            'source_url': row['source_url'],
        })
    return results

retrieve('What conditions must be met for consent to be valid under GDPR?', k=5)[:2]

[{'score': 0.6129947900772095,
  'chunk_id': 'article_7',
  'article': 7,
  'heading': 'Article 7',
  'text': "Article 7 \nConditionsfor consent \n1. Where processingis basedon consent, the controller shallbe ableto demonstrate thatthe data subjecthas \nconsentedto processingof hisorher personal data. \n2. Ifthe data subject's consentis giveninthe contextofawr itten declaration which also concerns other matters, the \nrequestfor consent shallbe presentedina manner whichis clearly distinguishable fromthe other matters, inan \nintelli gibleand easily accessibleform, using clearand plain language. Anypartof sucha declaration which constitutesan infr ingementof this Regulation shallnotbe binding. \n3. The data subject shallhave ther ightto withdrawhis orher consentatany time. The withdrawalof consent shallnot affect thela wfulnessof processing basedon consentbef oreits withdrawal. Priorto giving consent, the data subject \nshallbe informed thereof. It shallbeas easyto withdraw asto give co

### 5.4 Answer generation (Free LLM: FLAN-T5)

The generator is prompted to answer only from retrieved context and to output citations as (Article X).

In [11]:
from transformers import pipeline

LLM_NAME = 'google/flan-t5-base'
llm = pipeline('text2text-generation', model=LLM_NAME, device=-1)

def build_prompt(question: str, contexts: list):
    c = contexts[0]
    excerpt = re.sub(r'\s{2,}', ' ', c['text'].replace('\n', ' ')).strip()[:700]
    return (
        'You are a GDPR document assistant.\n'
        'Use ONLY the context.\n'
        'Return 3-6 bullet points.\n'
        'Each bullet MUST end with (Article X).\n\n'
        f'Question: {question}\n\n'
        f"Context:\n(Article {c['article']}) {excerpt}\n\n"
        'Answer:\n'
    )

def ensure_citations(answer: str, contexts: list) -> str:
    art = contexts[0]['article']
    cite = f'(Article {art})'
    lines = []
    for l in answer.splitlines():
        l = l.strip()
        if not l:
            continue
        if not re.sub(r'\(Article\s+\d+\)', '', l).strip():
            continue
        if not l.startswith('-'):
            l = '- ' + l
        if '(Article' not in l:
            l += ' ' + cite
        lines.append(l)
    return '\n'.join(lines).strip()

def rag_answer(question: str, k: int = 6):
    ctx = retrieve(question, k=k)
    prompt = build_prompt(question, ctx)
    raw = llm(prompt, max_new_tokens=180, do_sample=False)[0]['generated_text']
    ans = ensure_citations(raw, ctx)
    return {'question': question, 'answer': ans, 'contexts': ctx}

demo = rag_answer('When must a personal data breach be notified to the supervisory authority?', k=6)
print(demo['answer'])

Device set to use cpu


- not later than 72 hoursaf terhaving become aware ofit, notifythe personal data breach tothe super visory authority competentin accordance withAr ticle 55, unlessthe personal data breachis unlikelyto resultinar iskto ther ightsand freedomsof natural persons. (Article 33)


## 6. Evaluation

We evaluate qualitatively on a small question set:
- verify the retrieved top Article
- inspect answer previews and citation format

In [12]:
EVAL_QUESTIONS = [
    'What are the lawful bases for processing personal data?',
    'What conditions must be met for consent to be valid under GDPR?',
    'What information must be provided under the right of access?',
    'What does GDPR require regarding security of processing?',
    'When must a personal data breach be notified to the supervisory authority?',
]

rows = []
for q in EVAL_QUESTIONS:
    out = rag_answer(q, k=6)
    top = out['contexts'][0]
    rows.append({
        'question': q,
        'top_article': top['article'],
        'top_score': round(top['score'], 3),
        'answer_preview': (out['answer'][:240] + '...') if len(out['answer']) > 240 else out['answer'],
    })

pd.DataFrame(rows)

Unnamed: 0,question,top_article,top_score,answer_preview
0,What are the lawful bases for processing perso...,2,0.562,- a) inthe courseofan activity whichf alls out...
1,What conditions must be met for consent to be ...,7,0.613,"- 1. Where processingis basedon consent, the c..."
2,What information must be provided under the ri...,90,0.513,- personal data whichthe controlleror processo...
3,What does GDPR require regarding security of p...,3,0.552,"- a) the offering of goods orser vices, ir res..."
4,When must a personal data breach be notified t...,33,0.62,- not later than 72 hoursaf terhaving become a...


## 7. Ethical Considerations

- **Not legal advice:** This assistant summarizes GDPR text.
- **Grounding & transparency:** Answers are constrained to retrieved context and include Article citations.
- **Privacy:** Uses a public document and does not process personal user data.
- **Failure modes:** PDF extraction noise and LLM hallucination risk are mitigated by strict prompting and citations.

## 8. Conclusion & Future Work

**Conclusion:** Implemented an end-to-end RAG pipeline for GDPR with Article-wise retrieval and free-model generation.

**Future work:**
- Improve Article segmentation with structure-aware parsing
- Add reranking (cross-encoder)
- Use a larger-context open-source instruction model
- Add automated evaluation with a curated QA set

## Appendix A ‚Äî Gradio App Code

The following cell contains the complete Gradio app.


In [13]:
# gradio_app.py
# EU GDPR RAG Assistant

import os
import re
import json
import numpy as np
import pandas as pd

import requests
from pypdf import PdfReader

from sentence_transformers import SentenceTransformer
import faiss

from transformers import pipeline
import gradio as gr


# -------------------- CONFIG --------------------
PDF_URL = "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX%3A32016R0679"
DATA_DIR = "data"
RAW_DIR = os.path.join(DATA_DIR, "raw")
os.makedirs(RAW_DIR, exist_ok=True)

DEFAULT_PDF_PATH = os.path.join(RAW_DIR, "gdpr_2016_679_oj.pdf")
GDPR_SOURCE_URL = PDF_URL

EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
FREE_LLM_NAME = "google/flan-t5-base"  # can change to flan-t5-small for faster CPU


# -------------------- IO: DOWNLOAD PDF --------------------
def download_file(url: str, out_path: str, chunk_size: int = 1024 * 1024):
    if os.path.exists(out_path) and os.path.getsize(out_path) > 0:
        return out_path
    r = requests.get(url, stream=True, timeout=120)
    r.raise_for_status()
    with open(out_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=chunk_size):
            if chunk:
                f.write(chunk)
    return out_path


# -------------------- TEXT CLEANING --------------------
def clean_text(s: str) -> str:
    if not s:
        return ""

    s = s.replace("\u00ad", "")                 # soft hyphen
    s = re.sub(r"-\s*\n\s*", "", s)             # join hyphenated line breaks

    # keep line breaks (helpful for header detection)
    s = re.sub(r"\n{3,}", "\n\n", s)
    s = re.sub(r"[ \t]+", " ", s)

    # Join spaced letters: A r t i c l e -> Article
    s = re.sub(r"\b(?:[A-Za-z]\s){2,}[A-Za-z]\b",
               lambda m: m.group(0).replace(" ", ""), s)

    # Join small mid-word splits: Ar ticle -> Article (repeat)
    for _ in range(4):
        s = re.sub(r"(?<!\w)([A-Za-z]{2,})\s+([A-Za-z]{1,3})(?!\w)", r"\1\2", s)

    s = re.sub(r"\bAr\s+ticle\b", "Article", s, flags=re.IGNORECASE)
    s = re.sub(r"\bChar\s+ter\b", "Charter", s, flags=re.IGNORECASE)

    # tidy spacing before punctuation
    s = re.sub(r"\s+([.,;:])", r"\1", s)

    return s.strip()


# -------------------- PDF -> PAGES --------------------
def extract_pages(pdf_path: str) -> pd.DataFrame:
    reader = PdfReader(pdf_path)
    rows = []
    for i, page in enumerate(reader.pages):
        txt = clean_text(page.extract_text() or "")
        rows.append({"page": i + 1, "text": txt, "source_url": GDPR_SOURCE_URL})
    df = pd.DataFrame(rows)
    df["n_chars"] = df["text"].str.len()
    return df


# -------------------- ARTICLE CHUNKING (ROBUST LINE-BASED) --------------------
ARTICLE_WORD = re.compile(
    r"(?i)^(?:A\s*R\s*T\s*I\s*C\s*L\s*E|ARTICLE|Article)\b\s*(\d+)?\b.*$"
)

def parse_article_number(s: str):
    s = s.strip()

    # allow: "49", "49.", "(49)", "(49)."
    m = re.match(r"^\(?\s*(\d{1,3})\s*\)?\.?\s*$", s)
    if m:
        return int(m.group(1))

    # allow spaced digits: "4 9" or "9 2"
    m2 = re.match(r"^(\d)\s+(\d)\s*$", s)
    if m2:
        return int(m2.group(1) + m2.group(2))

    return None


def chunk_by_articles(df_pages: pd.DataFrame, min_chars: int = 50, lookahead: int = 10):
    full_text = "\n".join(df_pages.sort_values("page")["text"].tolist())
    lines = full_text.splitlines()

    headers = []  # (line_index, article_number, heading_line)
    i = 0
    while i < len(lines):
        line = lines[i].strip()
        m = ARTICLE_WORD.match(line)
        if m:
            num = m.group(1)
            heading = line

            # Look ahead for number if missing
            if num is None:
                for look in range(1, lookahead + 1):
                    if i + look >= len(lines):
                        break
                    cand = lines[i + look].strip()
                    maybe = parse_article_number(cand)
                    if maybe is not None and 1 <= maybe <= 99:
                        num = str(maybe)
                        heading = f"{heading} {cand}"
                        break

            if num is not None:
                art_num = int(num)
                if 1 <= art_num <= 99:
                    headers.append((i, art_num, heading))
        i += 1

    # Slice blocks between headers
    chunks = []
    for idx, (line_i, art_num, heading) in enumerate(headers):
        start = line_i
        end = headers[idx + 1][0] if idx + 1 < len(headers) else len(lines)
        block = "\n".join(lines[start:end]).strip()
        chunks.append({"article": art_num, "heading": heading, "text": block, "len": len(block)})

    df_raw = pd.DataFrame(chunks)
    if df_raw.empty:
        df_chunks = pd.DataFrame(columns=["chunk_id", "article", "heading", "text", "source_url"])
        return df_chunks, list(range(1, 100))

    # Deduplicate: keep the longest chunk per article (removes TOC duplicates / repeated headers)
    df_best = (
        df_raw.sort_values(["article", "len"], ascending=[True, False])
              .drop_duplicates("article", keep="first")
              .sort_values("article")
    )

    # Drop only tiny junk
    df_best = df_best[df_best["len"] >= min_chars]

    df_chunks = pd.DataFrame({
        "chunk_id": df_best["article"].apply(lambda x: f"article_{int(x)}"),
        "article": df_best["article"].astype(int),
        "heading": df_best["heading"],
        "text": df_best["text"],
        "source_url": GDPR_SOURCE_URL,
    })

    missing = sorted(set(range(1, 100)) - set(df_chunks["article"]))
    return df_chunks, missing


# -------------------- OPTIONAL: DETERMINISTIC ARTICLE 6 LIST EXTRACTION --------------------
def lawful_bases_from_article6(article6_text: str):
    """
    Extract ONLY the first (a)-(f) list in Article 6(1) (the lawful bases).
    Stops after (f) to avoid picking up later (a)-(f) lists in other paragraphs.
    """
    t = re.sub(r"\s{2,}", " ", article6_text.replace("\n", " ")).strip()

    # Start at "1." if present
    m1 = re.search(r"\b1\.\s", t)
    if m1:
        t = t[m1.start():]

    bullets = []
    for key in ["a", "b", "c", "d", "e", "f"]:
        m = re.search(rf"\({key}\)\s*(.*?)(?=\([a-f]\)\s*|$)", t, flags=re.IGNORECASE)
        if not m:
            return None
        content = m.group(1).strip().rstrip(";. ")
        bullets.append(f"- ({key}) {content}. (Article 6)")
        t = t[m.end():]

    return "\n".join(bullets)


# -------------------- BUILD INDEX (CACHE IN MEMORY) --------------------
_CACHE = {}

def build_or_get_index(pdf_path: str):
    """
    Builds: df_chunks, faiss_index, embedder, llm, missing
    Cached for speed.
    """
    key = os.path.abspath(pdf_path)
    if key in _CACHE:
        return _CACHE[key]

    df_pages = extract_pages(pdf_path)
    df_chunks, missing = chunk_by_articles(df_pages, min_chars=50, lookahead=10)

    embedder = SentenceTransformer(EMBED_MODEL_NAME)

    texts = df_chunks["text"].tolist()
    emb = embedder.encode(texts, normalize_embeddings=True, show_progress_bar=False).astype(np.float32)

    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)

    llm = pipeline("text2text-generation", model=FREE_LLM_NAME, device=-1)

    _CACHE[key] = (df_chunks, index, embedder, llm, missing)
    return _CACHE[key]


# -------------------- RETRIEVER --------------------
def retrieve(df_chunks: pd.DataFrame, index, embedder, query: str, k: int = 6):
    q_emb = embedder.encode([query], normalize_embeddings=True).astype(np.float32)
    scores, idxs = index.search(q_emb, k * 8)

    q = query.lower()
    rescored = []
    for base, idx in zip(scores[0], idxs[0]):
        row = df_chunks.iloc[int(idx)]
        bonus = 0.0
        # simple boost for lawful bases questions
        if ("lawful" in q or "legal basis" in q or "lawful bases" in q) and int(row["article"]) == 6:
            bonus += 0.5
        rescored.append((float(base) + bonus, int(idx)))

    rescored.sort(reverse=True, key=lambda x: x[0])
    top = rescored[:k]

    out = []
    for score, i in top:
        row = df_chunks.iloc[i]
        out.append({
            "score": float(score),
            "chunk_id": row["chunk_id"],
            "article": int(row["article"]),
            "heading": row["heading"],
            "text": row["text"],
            "source_url": row["source_url"],
        })
    return out


# -------------------- GENERATION (FREE LLM) --------------------
def build_prompt(question: str, contexts: list, max_ctx: int = 1, per_ctx_chars: int = 700) -> str:
    blocks = []
    for c in contexts[:max_ctx]:
        excerpt = re.sub(r"\s{2,}", " ", c["text"].replace("\n", " ")).strip()[:per_ctx_chars]
        blocks.append(f"(Article {c['article']}) {excerpt}")

    return (
        "You are a GDPR document assistant.\n"
        "Use ONLY the provided context excerpts.\n"
        "Return 3-6 bullet points.\n"
        "Each bullet MUST end with a citation like (Article X).\n"
        "If the answer is not in the context, say: I could not find that in the provided GDPR excerpts.\n\n"
        f"Question: {question}\n\n"
        "Context:\n" + "\n".join(blocks) + "\n\n"
        "Answer:\n"
    )


def ensure_citations(answer: str, contexts: list) -> str:
    top_art = contexts[0]["article"] if contexts else None
    cite = f"(Article {top_art})" if top_art is not None else "(Article N/A)"

    lines = []
    for l in answer.splitlines():
        l = l.strip()
        if not l:
            continue
        if not re.sub(r"\(Article\s+\d+\)", "", l).strip():
            continue
        if not l.startswith("-"):
            l = "- " + l
        if "(Article" not in l:
            l += " " + cite
        lines.append(l)

    out = "\n".join(lines).strip()
    return out if out else answer.strip()


def truncate_answer(out: str, max_bullets: int = 8, max_chars: int = 1200) -> str:
    lines = [l for l in out.splitlines() if l.strip()]
    if len(lines) > max_bullets:
        out = "\n".join(lines[:max_bullets])
    if len(out) > max_chars:
        out = out[:max_chars].rstrip() + "..."
    return out


def rag_answer_core(pdf_path: str, question: str, top_k: int = 6, show_passages: bool = True):
    if not os.path.exists(pdf_path):
        return (
            f"‚ùå PDF not found at: `{pdf_path}`\n\n"
            f"Put the file there or download from: {PDF_URL}",
            ""
        )

    df_chunks, index, embedder, llm, missing = build_or_get_index(pdf_path)

    hits = retrieve(df_chunks, index, embedder, question, k=top_k)
    if not hits:
        return "I could not find that in the provided GDPR excerpts.", ""

    q = question.lower()

    # Deterministic Article 6 lawful bases extraction (optional)
    out = None
    if ("lawful" in q and ("basis" in q or "bases" in q or "legal basis" in q)) and hits[0]["article"] == 6:
        out = lawful_bases_from_article6(hits[0]["text"])

    if not out:
        prompt = build_prompt(question, hits, max_ctx=1, per_ctx_chars=700)
        raw = llm(prompt, max_new_tokens=180, do_sample=False)[0]["generated_text"]
        out = ensure_citations(raw, hits)

    out = truncate_answer(out, max_bullets=8, max_chars=1200)

    # Build "retrieved passages" text
    passages_md = ""
    if show_passages:
        parts = []
        for h in hits[:top_k]:
            snippet = re.sub(r"\s{2,}", " ", h["text"].replace("\n", " ")).strip()
            snippet = snippet[:900] + ("..." if len(snippet) > 900 else "")
            parts.append(
                f"### Article {h['article']} ‚Äî score {h['score']:.3f}\n"
                f"**Heading:** {h['heading']}\n\n"
                f"{snippet}\n\n"
                f"Source: {h['source_url']}\n"
            )
        passages_md = "\n\n---\n\n".join(parts)

    # Optional: missing articles note (kept minimal; can remove if you don't want it)
    if missing:
        out += f"\n\n> Note: Some articles were not detected by PDF extraction/chunking: {missing}"

    return out, passages_md


# -------------------- GRADIO UI --------------------
SAMPLE_QUESTIONS = [
    "What are the lawful bases for processing personal data?",          # Article 6
    "What conditions must be met for consent to be valid under GDPR?",  # Article 7
    "What information must be provided under the right of access?",     # Article 15
    "When must a personal data breach be notified to the supervisory authority?",  # Article 33
    "What fines can be imposed under GDPR?",                            # Article 83
]


def fill_from_sample(x):
    return x or ""

def ui_answer(pdf_path, question, top_k, show_passages):
    ans, passages = rag_answer_core(pdf_path, question, top_k=top_k, show_passages=show_passages)
    return ans, passages


with gr.Blocks(title="EU GDPR RAG Assistant") as demo:
    gr.Markdown(
        """
# üìò EU GDPR RAG Assistant (Gradio)

**Pipeline:** PDF ‚Üí Article-wise chunks ‚Üí embeddings + FAISS ‚Üí retrieve ‚Üí free LLM answer + citations  
‚ö†Ô∏è **Informational only** ‚Äî not legal advice.
"""
    )

    with gr.Row():
        pdf_path_in = gr.Textbox(label="GDPR PDF path", value=DEFAULT_PDF_PATH)
        top_k_in = gr.Slider(label="Top-k retrieved Articles", minimum=3, maximum=10, value=6, step=1)
        show_passages_in = gr.Checkbox(label="Show retrieved passages", value=True)

    with gr.Row():
        sample_dd = gr.Dropdown(label="Sample questions", choices=[""] + SAMPLE_QUESTIONS, value="")
        question_in = gr.Textbox(label="Question", lines=2, placeholder="Ask a GDPR question...")

    sample_dd.change(fn=fill_from_sample, inputs=sample_dd, outputs=question_in)

    run_btn = gr.Button("Answer")

    answer_out = gr.Markdown(label="Answer")
    passages_out = gr.Markdown(label="Top Retrieved Articles")

    run_btn.click(
        fn=ui_answer,
        inputs=[pdf_path_in, question_in, top_k_in, show_passages_in],
        outputs=[answer_out, passages_out],
    )

    gr.Markdown(
        """
### Run notes
- If the PDF doesn't exist, download it from EUR-Lex and save it at the path above.
- You can also run this as a script:
  - `python gradio_app.py`
  - then open the printed local URL.
"""
    )


# -------------------- ENTRYPOINT --------------------
if __name__ == "__main__":
    # Ensure PDF is present (download if missing)
    try:
        if not os.path.exists(DEFAULT_PDF_PATH):
            download_file(PDF_URL, DEFAULT_PDF_PATH)
    except Exception:
        pass

    demo.launch(inline=False)


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.


## **Demo Launch**

In [14]:
demo.launch(inline=True, prevent_thread_lock=True)

Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
* To create a public link, set `share=True` in `launch()`.


