# LLM Applications on a Legal Corpus (Python + Hugging Face)

This notebook demonstrates end-to-end LLM-based applications for a corpus of Brazilian legislative texts (PLs). It assumes a dataframe named `corpus_docs` with columns `id` and `text`.

**Modules shown:**
1. Setup & data load
2. Embeddings & Semantic Search (Sentence-Transformers)
3. Zero-shot Classification (XNLI)
4. Named Entity Recognition (NER)
5. Summarization (mT5)
6. Extractive QA (span-based)
7. RAG: Retrieval-Augmented Generation (Retriever + Generator)
8. Controlled Generation (bulleted summaries)
9. Anonymization via NER

---
### Intuition vs. Technique (Quick Recap)
- **Embeddings**: map texts/queries to a vector space; nearest neighbors = semantic similarity.
- **Zero-shot**: cast labels as natural-language hypotheses and compute entailment.
- **NER**: token classification to tag entities; aggregate subword spans.
- **Summarization**: encoder-decoder seq2seq; hierarchical for long docs.
- **Extractive QA**: predict start/end indices of an answer span in a context.
- **RAG**: retrieve top-k chunks → condition a generator on them → answer with citations.
- **Controlled Generation**: instruction-tuned decoding constrained by prompts.
- **Anonymization**: detect entities and mask them; combine with regex rules when needed.


## 0) Setup
Install dependencies. If running on Colab/Kaggle, you can use `pip`. On local environments, consider creating a venv/conda env first.

⚠️ **Note**: Restart the kernel after installation if packages were just installed.

In [None]:
# !pip install -U transformers==4.44.2 sentence-transformers==3.0.1 torch accelerate
# !pip install -U faiss-cpu evaluate numpy pandas umap-learn hdbscan
# Optional for PDF parsing: pdfminer.six or pypdf
# !pip install -U pdfminer.six

## 1) Imports & Utilities

In [None]:
from typing import List, Dict, Any
import numpy as np
import pandas as pd
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import textwrap

def chunk_text(text: str, chunk_size: int = 1200) -> List[str]:
    """Split text into roughly chunk_size-character segments, by sentences when possible."""
    if text is None:
        return []
    text = str(text)
    if len(text) <= chunk_size:
        return [text]
    # naive sentence split
    sents = text.replace('\r', ' ').split('. ')
    out, cur = [], ''
    for s in sents:
        s2 = s if s.endswith('.') else s + '.'
        if len(cur) + len(s2) + 1 <= chunk_size:
            cur = (cur + ' ' + s2).strip()
        else:
            if cur:
                out.append(cur)
            cur = s2
    if cur:
        out.append(cur)
    return out

def cosine_sim_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    A = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-9)
    B = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-9)
    return A @ B.T


## 2) Load your corpus
Expecting a CSV with columns: `id,text`. If you already have a dataframe, skip this and set `corpus_docs` directly.

In [None]:
# Example: load from CSV (one id per row)
# corpus_docs = pd.read_csv("corpus_docs.csv")
# For demo purposes, build a tiny placeholder:
corpus_docs = pd.DataFrame({
    'id': ['PL1','PL2'],
    'text': [
        'Altera o Código Penal para estabelecer novas regras... Justificativa: ...',
        'Dispõe sobre política pública de saúde... Art. 1º ... prazo de 30 dias ...'
    ]
})
len(corpus_docs), corpus_docs.head()

## 3) Embeddings & Semantic Search (Bi-Encoder)
**Intuition.** Map documents and queries into the same vector space; nearest neighbors ≈ semantically closest.

**Technique.** Sentence-Transformers bi-encoder trained with contrastive/NLI objectives; cosine (or dot) similarity; scalable via ANN.


In [None]:
st_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedder = SentenceTransformer(st_model_name)

docs = corpus_docs['text'].fillna('').astype(str).tolist()
doc_emb = embedder.encode(docs, normalize_embeddings=True)
doc_emb = np.asarray(doc_emb)

def semantic_search(query: str, k: int = 5) -> pd.DataFrame:
    qv = embedder.encode([query], normalize_embeddings=True)
    S = cosine_sim_matrix(np.asarray(qv), doc_emb)[0]
    idx = np.argsort(-S)[:k]
    return pd.DataFrame({
        'rank': np.arange(1, len(idx)+1),
        'id': corpus_docs.loc[idx, 'id'].values,
        'score': S[idx],
        'snippet': [textwrap.shorten(corpus_docs.loc[i,'text'], width=180) for i in idx]
    })

semantic_search("prisões cautelares", k=5)

## 4) Zero-shot Classification (XNLI)
**Intuition.** Turn labels into hypotheses (natural language), then measure entailment probability.

**Technique.** NLI head estimates P(entailment | premise=text, hypothesis=label). Multi-label via threshold per label.


In [None]:
zs = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli")
labels = ["segurança pública","tributação","direitos humanos","direito penal",
          "processo civil","política social","administração pública"]

def classify_zero_shot(text: str, candidate_labels=labels) -> pd.DataFrame:
    out = zs(text, candidate_labels=candidate_labels, multi_label=True)
    return pd.DataFrame({"label": out["labels"], "score": out["scores"]}).sort_values("score", ascending=False)

classify_zero_shot(corpus_docs.loc[0,'text'])

## 5) Named Entity Recognition (NER)
**Intuition.** Tag names of people, organizations, and locations to enrich metadata or anonymize.

**Technique.** Token classification; aggregate subword spans into entity spans.

In [None]:
ner = pipeline("token-classification", model="Davlan/xlm-roberta-base-ner-hrl", aggregation_strategy="simple")

def ner_extract(text: str) -> pd.DataFrame:
    ents = ner(text)
    if not ents:
        return pd.DataFrame(columns=["entity","word","score","start","end"])    
    return pd.DataFrame([{ 'entity':e.get('entity_group', e.get('entity')), 'word':e['word'], 'score':e['score'], 'start':e['start'], 'end':e['end']} for e in ents])

ner_extract(corpus_docs.loc[0,'text'])

## 6) Summarization (mT5)
**Intuition.** Ask a seq2seq model to compress the core content of each document.

**Technique.** mT5 (multilingual) trained on news-like summarization. For long texts, do hierarchical: chunk → local summaries → meta-summary.

In [None]:
summarizer = pipeline("summarization", model="csebuetnlp/mT5_multilingual_XLSum")

def summarize_doc(text: str, min_len: int=60, max_len: int=180, chunk_chars: int=2500) -> str:
    parts = []
    for ch in chunk_text(text, chunk_chars):
        if not ch.strip():
            continue
        out = summarizer(ch, min_length=min_len, max_length=max_len)
        parts.append(out[0]['summary_text'])
    return "\n".join(parts)

print(summarize_doc(corpus_docs.loc[0,'text']))

## 7) Extractive QA (span-based)
**Intuition.** For factoid questions, predict the exact answer span from a given context.

**Technique.** MRC with start/end indices; you must select an appropriate context chunk first.

In [None]:
qa_extractive = pipeline("question-answering", model="deepset/xlm-roberta-base-squad2")

# choose a simple context (for demo); in practice select best chunk via semantic search
context = corpus_docs.loc[1,'text']
qa_extractive(question="Qual é o prazo mencionado?", context=context)

## 8) RAG: Retrieval-Augmented Generation
**Intuition.** Retrieve the most relevant chunks, then generate an answer strictly grounded in that context (with citations).

**Technique.** Bi-encoder retrieval (Sentence-Transformers) + instruction-tuned generator (e.g., FLAN-T5).

In [None]:
# Build chunk-level index
def make_chunk_df(df: pd.DataFrame, chunk_chars: int=900) -> pd.DataFrame:
    rows = []
    for _, row in df.iterrows():
        chunks = chunk_text(row['text'], chunk_chars)
        for i, ch in enumerate(chunks, start=1):
            if ch.strip():
                rows.append({'id': row['id'], 'chunk_id': i, 'text': ch})
    return pd.DataFrame(rows)

corpus_chunks = make_chunk_df(corpus_docs, chunk_chars=900)
chunk_emb = embedder.encode(corpus_chunks['text'].tolist(), normalize_embeddings=True)
chunk_emb = np.asarray(chunk_emb)

def retrieve(query: str, k: int=4) -> pd.DataFrame:
    qv = embedder.encode([query], normalize_embeddings=True)
    S = cosine_sim_matrix(np.asarray(qv), chunk_emb)[0]
    idx = np.argsort(-S)[:k]
    return pd.DataFrame({
        'rank': np.arange(1, len(idx)+1),
        'id': corpus_chunks.loc[idx, 'id'].values,
        'chunk_id': corpus_chunks.loc[idx, 'chunk_id'].values,
        'score': S[idx],
        'text': corpus_chunks.loc[idx, 'text'].values
    })

generator = pipeline("text2text-generation", model="google/flan-t5-base")

def answer_with_context(question: str, k: int=4, max_new_tokens: int=220) -> Dict[str, Any]:
    ctx = retrieve(question, k)
    prompt = (
        "Responda em português, usando APENAS o CONTEXTO a seguir. Cite trechos entre aspas.\n" +
        f"Pergunta: {question}\n\nContexto:\n" +
        "\n".join(["- " + textwrap.shorten(t, width=500) for t in ctx['text'].tolist()]) +
        "\n\nResposta:" )
    out = generator(prompt, max_new_tokens=max_new_tokens)
    return {"answer": out[0]['generated_text'], "retrieved": ctx}

qa = answer_with_context("Quais dispositivos são alterados?", k=3)
qa['answer']

## 9) Controlled Generation (bulleted summaries)
**Intuition.** Instruct a model to produce concise, structured bullet points in plain language.

**Technique.** Instruction-tuned generation with decoding constraints (max tokens, repetition penalty if needed).

In [None]:
def bullet_summary(text: str, bullets: int=5) -> str:
    prompt = (
        f"Explique o seguinte Projeto de Lei em português simples, em {bullets} tópicos curtos e objetivos.\n\n" +
        textwrap.shorten(text, width=4000) +
        "\n\nTópicos:" )
    out = generator(prompt, max_new_tokens=bullets*60)
    return out[0]['generated_text']

print(bullet_summary(corpus_docs.loc[0,'text'], bullets=5))

## 10) Anonymization via NER
**Intuition.** Detect entities and mask them for privacy while preserving analytical utility.

**Technique.** Run NER and replace spans from end to start to avoid offset drift.

In [None]:
def mask_pii(text: str) -> str:
    ents = ner(text)
    if not ents:
        return text
    out = text
    # reverse by start index to avoid shifting
    for e in sorted(ents, key=lambda x: x['start'], reverse=True):
        label = e.get('entity_group', e.get('entity'))
        if label in {"PER","ORG","LOC"}:  # adjust as needed
            s, t = int(e['start']), int(e['end'])
            out = out[:s] + '█'*(t-s) + out[t:]
    return out

mask_pii("João da Silva, do Ministério da Justiça, apresentou proposta em Brasília.")

---
### Notes & Tips
- For Portuguese summarization, `csebuetnlp/mT5_multilingual_XLSum` tends to produce more natural PT than BART-CNN.
- For better RAG faithfulness, add a cross-encoder re-ranker to re-score the top-k.
- Tune chunk sizes (900–1500 chars) and overlaps (10–20%).
- Zero-shot is sensitive to label wordings; treat verbalizers and thresholds as hyperparameters.
- Consider quantization (int8/int4) for larger generators (Llama/Mistral) if you have GPU constraints.
