# FAISS-Based RAG System Implementation

This notebook implements a Retrieval-Augmented Generation (RAG) system using FAISS for efficient similarity search and OpenAI for response generation. The system provides legal document search and question-answering capabilities.

### Overview
- Install and setup FAISS vector database
- Implement RAG system with BM25 and semantic search
- Generate embeddings for legal documents using Legal-BERT
- Build query processing and response generation pipeline
- Integrate with OpenAI API for natural language responses
- Provide comprehensive legal document search functionality

## 1. Installation and Dependencies

Install required packages for FAISS vector search and BM25 ranking algorithms.

In [1]:
!pip install faiss-cpu rank-bm25

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25, faiss-cpu
Successfully installed faiss-cpu-1.11.0.post1 rank-bm25-0.2.2


## 2. RAG System Class Implementation

Implement the core RAG system class with FAISS indexing, BM25 ranking, and OpenAI integration for legal document retrieval and question answering.

In [2]:
import os
import re
import pandas as pd
import numpy as np
import faiss
import joblib
from typing import List, Tuple
from sentence_transformers import SentenceTransformer
from openai import OpenAI
from rank_bm25 import BM25Okapi


class RAGSystem:
    def __init__(self, tsv_path: str, openai_api_key: str, reload: bool = False):
        """
        Initialize the RAG system with TSV file and OpenAI API key.
        
        Args:
            tsv_path (str): Path to the TSV file (compressed .tsv.gz)
            openai_api_key (str): OpenAI API key
            reload (bool): Force re-indexing
        """
        self.model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')
        self.client = OpenAI(api_key=openai_api_key)
        self.index = None
        self.documents = []
        self.metadata = []
        self.dimension = 768  # Legal BERT embedding dimension
        self.bm25 = None
        self.bm25_corpus = []

        self.load_and_index_documents(tsv_path, reload)

    def load_and_index_documents(self, tsv_path: str, reload: bool = False) -> None:
        base_path = os.path.splitext(os.path.splitext(tsv_path)[0])[0]  # remove .tsv.gz
        index_path = f"{base_path}.faiss"
        data_path = f"{base_path}_data.pkl"
        bm25_path = f"{base_path}_bm25.pkl"

        # Load cached data if it exists and reload not requested
        if (
            not reload
            and os.path.exists(index_path)
            and os.path.exists(data_path)
            and os.path.exists(bm25_path)
        ):
            self.index = faiss.read_index(index_path)
            data = joblib.load(data_path)
            self.documents = data['documents']
            self.metadata = data['metadata']
            self.bm25_corpus = joblib.load(bm25_path)
            self.bm25 = BM25Okapi(self.bm25_corpus)
            return

        # Otherwise, load TSV and index from scratch
        df = pd.read_csv(tsv_path, sep="\t", compression="gzip")
        if not all(col in df.columns for col in ['name', 'type', 'content']):
            raise ValueError("TSV must contain 'name', 'type', and 'content' columns")

        self.documents = df['content'].tolist()
        self.metadata = df[['name', 'type']].to_dict('records')

        # Create embeddings and index
        embeddings = self.model.encode(self.documents, batch_size=32, show_progress_bar=True)
        self.index = faiss.IndexFlatL2(self.dimension)
        self.index.add(np.array(embeddings, dtype=np.float32))

        # Save FAISS index and data
        faiss.write_index(self.index, index_path)
        joblib.dump({'documents': self.documents, 'metadata': self.metadata}, data_path)

        # Build and save BM25 index
        self.build_bm25()
        joblib.dump(self.bm25_corpus, bm25_path)

    def build_bm25(self):
        self.bm25_corpus = [re.findall(r"\w+", doc.lower()) for doc in self.documents]
        self.bm25 = BM25Okapi(self.bm25_corpus)

    def retrieve(self, query: str, k: int = 5) -> List[Tuple[str, dict, float]]:
        # FAISS embedding retrieval
        query_embedding = self.model.encode([query])[0]
        distances, indices = self.index.search(np.array([query_embedding], dtype=np.float32), k)
        results = []
        for idx, distance in zip(indices[0], distances[0]):
            if idx < len(self.documents):
                score = 1 / (1 + distance)  # Convert L2 distance to similarity score
                results.append((self.documents[idx], self.metadata[idx], score))
        return results

    def bm25_retrieve(self, query: str, k: int = 5) -> List[Tuple[str, dict, float]]:
        # BM25 keyword retrieval
        tokens = re.findall(r"\w+", query.lower())
        scores = self.bm25.get_scores(tokens)
        top_indices = np.argsort(scores)[::-1][:k]
        results = []
        for i in top_indices:
            if scores[i] > 0:
                results.append((self.documents[i], self.metadata[i], float(scores[i])))
        return results

    def hybrid_retrieve(self, query: str, k: int = 5) -> List[Tuple[str, dict, float]]:
        faiss_docs = self.retrieve(query, k)
        bm25_docs = self.bm25_retrieve(query, k)

        combined = {doc[0]: doc for doc in faiss_docs}
        for doc in bm25_docs:
            if doc[0] not in combined:
                combined[doc[0]] = doc

        # Sort descending by score and limit to k
        return sorted(combined.values(), key=lambda x: -x[2])[:k]

    def generate_response(self, query: str, retrieved_docs: List[Tuple[str, dict, float]]) -> str:
        context = "\n\n".join([f"Document: {doc[0]}\nMetadata: {doc[1]}" for doc in retrieved_docs])
        prompt = f"""You are a legal assistant powered by a RAG system. Use the following context to answer the query accurately and concisely. If the context doesn't provide enough information, say so.

Context:
{context}

Query:
{query}

Answer:
"""
        response = self.client.chat.completions.create(
            model="gpt-4.1-nano",
            messages=[
                {"role": "system", "content": "You are a helpful legal assistant."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=500,
            temperature=0.7
        )
        return response.choices[0].message.content.strip()

    def query(self, query: str, k: int = 5) -> dict:
        retrieved_docs = self.hybrid_retrieve(query, k)
        answer = self.generate_response(query, retrieved_docs)
        return {
            "query": query,
            "answer": answer,
            "retrieved_documents": [
                {"content": doc[0], "metadata": doc[1], "score": doc[2]}
                for doc in retrieved_docs
            ]
        }

2025-07-30 02:19:45.055996: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753841985.246029      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753841985.305278      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 3. Data Preparation and Chunking

Download and prepare the legal documents data by chunking large documents into smaller, manageable pieces for better retrieval performance.

In [3]:
!gdown 1--7L-BtJwrQB7yXcfPgV9_zQfn9nK-S5

df = pd.read_csv("bills.tsv.gz", sep="\t", compression="gzip")
df = df.rename(columns={"filename": "name"})
df["type"] = "bill"

def chunk_df_by_words(df, chunk_size=500, overlap=100):
    chunks = []
    for _, row in df.iterrows():
        words = row['content'].split()
        for i in range(0, len(words), chunk_size - overlap):
            chunk = ' '.join(words[i:i + chunk_size])
            chunks.append({
                'name': row['name'],
                'type': row['type'],
                'chunk': i // (chunk_size - overlap),
                'content': chunk
            })
    return pd.DataFrame(chunks)

df = chunk_df_by_words(df)

df.to_csv('bills.tsv.gz', sep='\t', index=False, compression='gzip')

df.head()

Downloading...
From: https://drive.google.com/uc?id=1--7L-BtJwrQB7yXcfPgV9_zQfn9nK-S5
To: /kaggle/working/bills.tsv.gz
100%|██████████████████████████████████████| 6.17M/6.17M [00:00<00:00, 93.6MB/s]


Unnamed: 0,name,type,chunk,content
0,2010-10-16-2010_E.txt,bill,0,THE GAZETTE OF THE DEMOCRATIC SOCIALIST REPUBL...
1,2010-10-16-2010_E.txt,bill,1,Local Authorities Elections Ordinance (Cap. 26...
2,2010-10-16-2010_E.txt,bill,2,Order made under section 3C of the Local Autho...
3,2010-10-16-2010_E.txt,bill,3,beginning from the words “Where a budget or su...
4,2010-10-16-2010_E.txt,bill,4,"and fraction, the integer shall be deemed to b..."


## 4. RAG System Testing and Query Processing

Initialize the RAG system with the processed data and test its functionality with sample legal queries.

In [4]:
from kaggle_secrets import UserSecretsClient
openai_api_key = UserSecretsClient().get_secret("openai_api_key")
tsv_path = "bills.tsv.gz"

# Initialize RAG system
rag = RAGSystem(tsv_path, openai_api_key)

# Example query
query = "What are the main objectives of the Jayanthipura association in community welfare and environment?"
result = rag.query(query, k=3)

# Print results
print(f"Query: {result['query']}")
print(f"Answer: {result['answer']}")
print("\nRetrieved Documents:")
for doc in result['retrieved_documents']:
    print(f"\nContent: {doc['content'][:100]}...")
    print(f"Metadata: {doc['metadata']}")
    print(f"Score: {doc['score']:.4f}")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Batches:   0%|          | 0/383 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: What are the main objectives of the Jayanthipura association in community welfare and environment?
Answer: The main objectives of the Jayanthipura association in community welfare and environment are to induce, foster, and promote the mutual welfare of its members and their dependents; protect and safeguard their rights, liberty, and privileges; and ensure that the environment in which they reside is not adversely affected. Specifically, they aim to prevent damage, destruction, or encroachment upon natural surroundings, greenery, and wetlands, and to maintain a congenial atmosphere. Additionally, they seek to protect and provide support during natural calamities, floods, cyclones, earthquakes, and fires, thereby contributing to community welfare and environmental preservation.

Retrieved Documents:

Content: PARLIAMENT OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA JAYANTHIPURA SUBASADAKA SANGAMAYA (INCO...
Metadata: {'name': '2011-5-21-2011_E.txt', 'type': 'bill'}
Score: 42.

## 5. Additional Testing and Performance Validation

Perform additional tests to validate the RAG system's performance and accuracy with various legal queries.