

---

## Problem Statement for Ex5:

Organizations often need to analyze **large volumes of legal contracts** stored as PDFs, which contain nuanced clauses like **termination**, **non-compete**, and **confidentiality terms**. Purely semantic search (vector-based) or keyword search (BM25) alone may miss context or relevance.

**Objective:**
To build a more powerful **Hybrid RAG system** over multiple legal contracts that combines:

* **Semantic search** via FAISS for meaning-based retrieval
* **Keyword search** via BM25 (Whoosh) for exact match and legal precision
* **Instruction-tuned generation** using FLAN-T5 to answer questions using merged results

All achieved **without cloud APIs**, using open-source tools only.

---

## What's *additionally* happening compared to Ex4?

| Feature                   | Ex4: Multi-PDF RAG with FLAN-T5 | Ex5: Hybrid RAG over Legal Contracts                            |
| ------------------------- | ------------------------------- | --------------------------------------------------------------- |
| **Document Type**         | General HR policy PDFs          | Focused on **legal contracts**                                  |
| **Search Type**           | Only semantic (vector-based)    | **Hybrid retrieval** (semantic + BM25 keyword search)           |
| **Keyword Search Engine** | Not used                        | **Whoosh** (BM25-based inverted index)                          |
| **Retrieval Logic**       | FAISS top-k chunks              | Combines **FAISS + Whoosh**, removes duplicates                 |
| **Metadata Support**      | Minimal                         | Stores file-level metadata with chunks                          |
| **Summarization**         | Basic summary of top chunks     | Allows summarization of **each PDF** in a legal context         |
| **Use Case**              | HR Assistant                    | **Legal Contract QA System** with enhanced clause understanding |

---

## What this version demonstrates:

* A **hybrid retrieval pipeline** merging two fundamentally different techniques:

  * Semantic similarity (context awareness)
  * Exact term matching (clause compliance)
* A more **robust and accurate retrieval system** for documents where precision matters (legal, compliance, procurement).
* The **ability to summarize or extract clauses** from complex legal documents using FLAN-T5
* A scalable, script-based version that can be converted to a **Streamlit UI** or **API backend**

---

## Summary of Progression So Far:

| Version | Major Capability                                 |
| ------- | ------------------------------------------------ |
| **Ex1** | Naive RAG (script + static PDF)                  |
| **Ex2** | Streamlit UI                                     |
| **Ex3** | Dynamic PDF Upload                               |
| **Ex4** | Multi-PDF Upload + Summarization                 |
| **Ex5** | **Hybrid Retrieval (FAISS + BM25)** for Legal QA |

---

In [None]:
# Install dependencies
!pip install sentence-transformers faiss-cpu transformers PyPDF2 whoosh -q

# whoosh
# What is Whoosh?
# Whoosh is a fast, featureful pure-Python search engine library. It allows you to add search capabilities to your applications, enabling you to index and search text data efficiently.

In [3]:

# Imports
import os
import PyPDF2
import faiss
import numpy as np

# Whoosh imports
# Schema defines the structure of the indexed documents.
# TEXT is used for full-text search fields, and ID is used for unique identifiers.
# Precisely, we will use TEXT for the content of the documents and ID for their unique identifiers.
from whoosh.fields import Schema, TEXT, ID

# IndexWriter is used to add documents to the index.
# create_in is used to create a new index in a specified directory.
# Precisely, we will use IndexWriter to add documents to the index and create_in to create a new index in a specified directory.
from whoosh.index import create_in

# QueryParser is used to parse search queries.
# It allows you to define how queries are interpreted and executed against the indexed documents.
# Precisely, we will use QueryParser to parse user queries and search the indexed documents.
from whoosh.qparser import QueryParser


from sentence_transformers import SentenceTransformer
from transformers import pipeline
import tempfile # Temporary directory for storing files


In [5]:
# Step 1: Load & chunk PDFs from contracts/
def load_contract_chunks(folder_path, chunk_size=300):
    chunks = []
    filenames = os.listdir(folder_path)
    for filename in filenames:
        if filename.endswith(".pdf"):
            path = os.path.join(folder_path, filename)
            reader = PyPDF2.PdfReader(path)
            text = ""
            for page in reader.pages:
                text += page.extract_text()
            text = text.replace("\n", " ")
            for i in range(0, len(text), chunk_size):
                chunk = text[i:i + chunk_size]
                chunks.append((chunk, filename))
    return chunks

chunks_with_meta = load_contract_chunks("contracts", chunk_size=300)
chunks = [c[0] for c in chunks_with_meta]
print(f"Loaded {len(chunks)} chunks from contracts.")


Loaded 466 chunks from contracts.


In [6]:

# Step 2: FAISS vector index (semantic search)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedder.encode(chunks, convert_to_tensor=False)
dimension = len(embeddings[0])
faiss_index = faiss.IndexFlatL2(dimension)
faiss_index.add(np.array(embeddings))
print("FAISS index ready.")


  return forward_call(*args, **kwargs)


FAISS index ready.


In [7]:

# Step 3: BM25 index using Whoosh (keyword search)
schema = Schema(content=TEXT(stored=True), path=ID(stored=True))
# content=TEXT(stored=True) means that the content of the document will be indexed and stored for retrieval.
# path=ID(stored=True) means that the path of the document will be indexed and stored for retrieval.
# This allows us to retrieve the original document path when searching.
# stored=True means that the field will be stored in the index and can be retrieved later.
# This is useful for retrieving the original content or metadata of the indexed documents.
# ID is used for unique identifiers, which allows us to retrieve the original document path when searching.
# TEXT is used for full-text search fields, which allows us to perform keyword searches on the content of the documents.
# Precisely, we will use TEXT for the content of the documents and ID for their unique identifiers.
# create_in creates a new index in the specified directory.
# It takes the directory path and the schema as arguments.
# The index will be created in the specified directory, and the schema defines the structure of the indexed documents.
# The index will be used to store the indexed documents and perform keyword searches.


index_dir = tempfile.mkdtemp() # Create a temporary directory for the index
ix = create_in(index_dir, schema) # Create a new index in the temporary directory
# ix is the index object that will be used to add documents and perform searches.
# ix.writer() creates a writer object that allows you to add documents to the index.
# The writer object is used to add documents to the index and commit changes.
writer = ix.writer()
for i, (chunk, fname) in enumerate(chunks_with_meta):
    writer.add_document(content=chunk, path=fname)
writer.commit()
print("Whoosh BM25 index ready.")


Whoosh BM25 index ready.


In [9]:
# View the first indexed document
with ix.searcher() as searcher:
    first_doc = searcher.document()
    print(f"First indexed document: {first_doc['content'][:100]}... (path: {first_doc['path']})")

First indexed document:   Page 1 Sample Contract    Contract No.___________  PROFESSIONAL SERVICES AGREEMENT      THIS AGREE... (path: 1SampleCo1ntract-Shuttle.pdf)


In [10]:

# Step 4: Hybrid retrieval
def hybrid_retrieve(query, top_k=3):
    # 1. Vector search (semantic)
    q_vec = embedder.encode([query])
    _, indices = faiss_index.search(np.array(q_vec), top_k)
    semantic_results = [chunks[i] for i in indices[0]]
    
    # 2. Keyword search (BM25)
    with ix.searcher() as searcher:
        parser = QueryParser("content", schema=ix.schema)
        parsed_query = parser.parse(query)
        results = searcher.search(parsed_query, limit=top_k)
        keyword_results = [r['content'] for r in results]

    # Merge and dedupe
    hybrid_results = list(dict.fromkeys(semantic_results + keyword_results))
    return hybrid_results[:top_k]


In [11]:

# Step 5: LLM - FLAN-T5
qa = pipeline("text2text-generation", model="google/flan-t5-base", max_length=256)


Device set to use mps:0


In [12]:

# Step 6: Answer query with hybrid context
def answer_query(query):
    contexts = hybrid_retrieve(query)
    full_context = "\n".join(contexts)
    prompt = f"Context:\n{full_context}\n\nQuestion: {query}\n\nAnswer:"
    print(f"Prompt:\n{prompt}\n")
    # Generate answer using the LLM
    result = qa(prompt)[0]["generated_text"]
    return result.strip()


In [13]:
# Test queries
print("\n🧠 Query 1:")
print(answer_query("What does the termination clause say?"))



🧠 Query 1:
Prompt:
Context:
 to perform his  obligations when that performance falls due.  9. At law, the right to terminate for breach arises  in three situations:   ( a )  r e p u d i a t i o n  –  w h e r e  a  p a r t y  e v i n c e s  a  c l e a r  a n d absolute refusal to  perform;   ( b )  i m p o s s i b i l i t y  – 
 performance. However, termination does not release the injured party from  his duty to perform obligations which accrued befor e termination. If the  injured party fails to exercise his option to termi nate, or positively affirms the  contract, the contract remains in force and each pa rty is bound
ght to terminate is distinct  from a common law right to terminate for breach, wh ich is discussed below).   B TERMINATION                                                    13 Shirlaw v Southern Foundries (1926) Ltd  [1939] 2 KB 206, 227 per MacKinnon LJ.  14 The Moorcock [1889] 14 PB 64. (i) Breac

Question: What does the termination clause say?

Answer:

does not 

In [14]:
print("\n🧠 Query 2:")
print(answer_query("Explain the non-compete obligations."))


🧠 Query 2:
Prompt:
Context:
n sells his business along with the goodwill  to another person, agrees  not to carry on same line of business in certain reasonable local limits, such an  agreement is valid.    An agreement of service  through which an employee commits not to compete with his  employer is not in restraint of trad
perform his obligations under a contr act. Breach brings an end  to the obligations created by a contract.     F) Discharge by impossibility of performance  –   Impossibility of performance results in the discharge of the contract. An agreement which is  impossible is void, because law does not comp
or other forms of compensation; a nd selection for training (including apprenticeship),  employment, upgrading, demotion, or transfer. The CONSULTANT agrees to post in conspicuous places,  available to employees and applicants for employme nt, notice setting forth the provisions of this non- discrim

Question: Explain the non-compete obligations.

Answer:

An agreeme

In [15]:
print("\n🧠 Query 3:")
print(answer_query("Describe the confidentiality terms."))



🧠 Query 3:
Prompt:
Context:
erms may be implied from th e nature of the relationship  between the parties – for example, contracts for pr ofessional services require  the professional to act with reasonable standards o f competence, a lawyer  must act in his client's best interests and a docto r has a duty of confidentiality  
oses a special duty to act with the utmost good faith  i.e., to disclose all material information. Failure to disclose such information will render the  contract voidable at the option of the ot her party   Examples –  a) Contract of insurance of all kinds   b) Company prospectus   c) Contract for t
f which the consideration or  object is unlawful – Section 23  - Agreement in restraint of legal  proceedings – Section 28   - Agreement of which the consideration or  object is unlawful in part – Section 24  - Agreements void for uncertainty –  Section 29     - Agreement made without consideration 

Question: Describe the confidentiality terms.

Answer:

a docto r h