# Jaanch Lite - Full Pipeline Demo

This notebook demonstrates the simplified legal document intelligence pipeline:

1. **Parse** - Landing AI ADE (grounded chunks with bounding boxes)
2. **Embed** - Voyage AI voyage-law-2 (legal-specific embeddings)
3. **Extract Citations** - Hybrid regex + Instructor/Pydantic
4. **Verify Citations** - Against pre-indexed acts library
5. **Search** - RAG with Voyage rerank-2.5 (instruction-following)

## Comparison with Full Jaanch

| Component | Full Jaanch | Jaanch Lite |
|-----------|-------------|-------------|
| Lines of Code | ~50,000 | ~1,000 |
| Bbox Linking | 71 files | 0 (native from ADE) |
| Embeddings | OpenAI ada-002 | Voyage voyage-law-2 |
| Reranking | Cohere | Voyage rerank-2.5 |

## Setup

Install dependencies and configure API keys.

In [None]:
# Install dependencies (run once)
# !pip install -e ..

In [None]:
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add parent directory to path
sys.path.insert(0, str(Path.cwd().parent))

# Load environment variables
load_dotenv()

# Verify API keys are set
assert os.getenv('VISION_AGENT_API_KEY'), "Set VISION_AGENT_API_KEY in .env"
assert os.getenv('VOYAGE_API_KEY'), "Set VOYAGE_API_KEY in .env"
assert os.getenv('OPENAI_API_KEY'), "Set OPENAI_API_KEY in .env"

print("API keys configured!")

## 1. Parse Document with Landing AI ADE

ADE provides:
- OCR for scanned PDFs
- Chunking with semantic boundaries
- **Native bounding boxes** (no fuzzy matching!)
- Table and figure extraction

In [None]:
from src.parsers.ade_parser import parse_document

# Parse a sample document
# Replace with your own legal document
SAMPLE_PDF = Path("../data/samples/sample_legal_doc.pdf")

if SAMPLE_PDF.exists():
    chunks = parse_document(
        SAMPLE_PDF,
        document_id="sample_doc",
        matter_id="demo_matter"
    )
    print(f"Parsed {len(chunks)} chunks")
else:
    print(f"Sample file not found: {SAMPLE_PDF}")
    print("Place a legal PDF in data/samples/ to test")
    chunks = []

In [None]:
# Examine a chunk with visual grounding
if chunks:
    chunk = chunks[0]
    print(f"Chunk ID: {chunk.chunk_id}")
    print(f"Page: {chunk.page}")
    print(f"Type: {chunk.chunk_type}")
    print(f"BBox: {chunk.bbox}")
    print(f"\nText preview:\n{chunk.text[:500]}...")

## 2. Extract Citations (Hybrid Regex + LLM)

Two-stage extraction:
1. **Fast regex** - catches ~80% of citations
2. **LLM validation** - handles edge cases with Instructor/Pydantic

In [None]:
from src.citations.extractor import CitationExtractor

# Initialize extractor
extractor = CitationExtractor(
    use_llm=True,  # Set to False for regex-only mode
    model="gpt-4o-mini"  # Fast and cheap
)

# Test on sample text
sample_text = """
The accused was charged under Section 302 of the Indian Penal Code, 1860 
for murder. The prosecution also invoked Section 34 IPC for common intention.

During the trial, the defense cited Article 21 of the Constitution regarding 
the right to fair trial. The court also considered Section 138 of the 
Negotiable Instruments Act, 1881 in relation to the dishonored cheque.

Under u/s 420 of the Indian Penal Code, the accused was also charged with cheating.
"""

result = extractor.extract(sample_text)
print(f"Found {len(result.citations)} citations\n")

for cit in result.citations:
    print(f"- Section {cit.section} of {cit.act_name}")
    print(f"  Confidence: {cit.confidence:.2f}, Method: {cit.extraction_method}")

In [None]:
# Extract from document chunks (with visual grounding)
if chunks:
    doc_citations = extractor.extract_from_chunks(chunks)
    print(f"Found {len(doc_citations.citations)} citations in document\n")
    
    for cit in doc_citations.citations[:5]:
        print(f"- S. {cit.section} {cit.act_name}")
        print(f"  Page: {cit.source_page}, BBox: {cit.source_bbox}\n")

## 3. Voyage AI Embeddings

Using `voyage-law-2` - a legal-specific embedding model that understands:
- Legal terminology
- Citation formats
- Statutory language

In [None]:
from src.embeddings.voyage import VoyageEmbedder

# Initialize embedder
embedder = VoyageEmbedder(model="voyage-law-2")

# Embed some legal texts
texts = [
    "Section 138 of the Negotiable Instruments Act deals with dishonour of cheque",
    "The accused was convicted under Section 302 IPC for murder",
    "Article 21 guarantees the right to life and personal liberty",
]

embeddings = embedder.embed_documents(texts)
print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {len(embeddings[0])}")

In [None]:
# Embed a query
query = "What is the punishment for cheque bounce?"
query_embedding = embedder.embed_query(query)
print(f"Query embedding dimension: {len(query_embedding)}")

## 4. Voyage AI Reranking

Using `rerank-2.5` with instruction-following capabilities.
Can specify what type of legal documents to prioritize.

In [None]:
from src.embeddings.voyage import VoyageReranker, LEGAL_RERANK_INSTRUCTIONS

# Initialize reranker
reranker = VoyageReranker(model="rerank-2.5")

# Sample documents to rerank
documents = [
    "Section 138 of NI Act provides for punishment of imprisonment up to 2 years",
    "In Kumar v. State, the court held that cheque bounce is a serious offense",
    "The drawer of a dishonored cheque shall be deemed to have committed an offense",
    "Legal commentary on Section 138 suggests strict liability",
    "The Negotiable Instruments Act, 1881 was amended in 2015",
]

query = "What is the punishment for cheque bounce?"

# Rerank with instruction to prioritize statutes
results = reranker.rerank(
    query=query,
    documents=documents,
    top_k=3,
    instruction=LEGAL_RERANK_INSTRUCTIONS["statutes"]
)

print(f"Query: {query}\n")
print("Reranked results (prioritizing statutes):\n")
for r in results:
    print(f"Score: {r['relevance_score']:.3f}")
    print(f"Text: {r['document']}\n")

## 5. Build Acts Library

Index Indian Central Acts for citation verification.
This is a one-time setup.

In [None]:
from src.acts.indexer import ActsIndexer
from src.acts.india_code import KNOWN_ACTS

# Show known acts
print(f"Known acts: {len(KNOWN_ACTS)}\n")
for act in KNOWN_ACTS[:5]:
    print(f"- {act['canonical_name']}")
print("...")

In [None]:
# Initialize acts indexer
# Note: Run scripts/index_acts.py to build the full index
acts_indexer = ActsIndexer()
print(f"Acts index stats: {acts_indexer.get_stats()}")

## 6. Verify Citations

Check if citations reference real sections in actual acts.

In [None]:
from src.acts.verifier import ActsVerifier
from src.core.models import Citation

# Initialize verifier
verifier = ActsVerifier()

# Create a citation to verify
citation = Citation(
    act_name="Negotiable Instruments Act, 1881",
    section="138",
    raw_text="Section 138 of NI Act",
    confidence=0.9,
)

# Verify
result = verifier.verify(citation)
print(f"Citation: S. {citation.section} of {citation.act_name}")
print(f"Status: {result.status.value}")
print(f"Similarity: {result.similarity_score:.2f if result.similarity_score else 'N/A'}")
if result.matched_text:
    print(f"\nMatched text:\n{result.matched_text[:300]}...")

## 7. Full RAG Search

Combines:
- ChromaDB vector store
- Voyage embeddings
- Voyage reranking
- Visual grounding from ADE

In [None]:
from src.search.rag import RAGSearch

# Initialize RAG search
rag = RAGSearch()

# Add document chunks
if chunks:
    count = rag.add_document(chunks)
    print(f"Added {count} chunks to document store")

In [None]:
# Search with reranking
if chunks:
    query = "What are the key legal issues?"
    
    results = rag.search(
        query=query,
        matter_id="demo_matter",
        top_k=3,
        rerank=True,
        legal_category="statutes"
    )
    
    print(f"Query: {query}\n")
    print(f"Results: {len(results.results)} (reranked: {results.reranked})\n")
    
    for r in results.results:
        print(f"--- Result #{r.rank} ---")
        print(f"Page: {r.page}, Score: {r.score:.3f}")
        print(f"BBox: {r.bbox}")
        print(f"Text: {r.chunk.text[:200]}...\n")

## Summary

This notebook demonstrated the Jaanch Lite pipeline:

| Step | Component | What it does |
|------|-----------|-------------|
| 1 | Landing AI ADE | Parse PDF with native grounding |
| 2 | Instructor/Pydantic | Extract citations with schema |
| 3 | Voyage voyage-law-2 | Legal-specific embeddings |
| 4 | Voyage rerank-2.5 | Instruction-following reranking |
| 5 | ChromaDB | Vector storage |
| 6 | Acts Library | Citation verification |

**Total complexity: ~1,000 lines of code** (vs ~50,000 in full Jaanch)