# Bills Preprocessing Pipeline

This notebook processes legislative bills by extracting text content from PDF files and preparing the data for embedding generation and indexing in the RAG system.

### Overview
- Load and filter bills metadata
- Extract text content from PDF files using OCR
- Clean and preprocess the extracted text
- Save processed data in structured format
- Handle different file formats and text extraction challenges

In [None]:
import numpy as np
import pandas as pd

## 1. Data Download

Download the preprocessed bills data and content files from Google Drive.

In [2]:
!gdown 1boSwHHySw3H6AttBX2iTxrNUnRfFgB_e
!gdown 1--7L-BtJwrQB7yXcfPgV9_zQfn9nK-S5

Downloading...
From (original): https://drive.google.com/uc?id=1boSwHHySw3H6AttBX2iTxrNUnRfFgB_e
From (redirected): https://drive.google.com/uc?id=1boSwHHySw3H6AttBX2iTxrNUnRfFgB_e&confirm=t&uuid=387b8bc6-1ac9-40ec-a96b-09c3a49101a7
To: /kaggle/working/gazettes.tsv.gz
100%|████████████████████████████████████████| 111M/111M [00:02<00:00, 42.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1--7L-BtJwrQB7yXcfPgV9_zQfn9nK-S5
To: /kaggle/working/bills.tsv.gz
100%|██████████████████████████████████████| 6.17M/6.17M [00:00<00:00, 6.79MB/s]


## 2. Data Loading and Inspection

Load the bills dataset and examine the text content structure for preprocessing requirements.

In [8]:
df = pd.read_csv("bills.tsv.gz", sep="\t", compression="gzip")


In [9]:
text = df['content'].iloc[5]
text

"THE GAZETTE OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA Part II of May 07, 2010 SUPPLEMENT (Issued on 10. 05. 2010) PINA ORGANISATION (INCORPORATION) BILL to incorporate the Pina Organization To be presented in Parliament by Hon. Tissa Attanayake, M.P. for Kandy District PRINTED AT THE DEPARTMENT OF GOVERNMENT PRINTING, SRI LANKA TO BE PURCHASED AT THE GOVERNMENT PUBLICATIONS BUREAU, COLOMBO 5. Price : Rs. 6.00 Postage : Rs. 5.00 (Private Member's Bill)\n\nPina Organisation (Incorporation) 2 -PL 004825-80 (05/2010) Short title. Preamble. AN ACT TO INCORPORATE THE PINA ORGANISATION WHEREAS an Organisation called and known as the “Pina Organisation” has been established in Sri Lanka, for the purpose of effectually carrying out and transacting all objects and matters connected with the said Organisation according to the rules agreed to by its members: AND WHEREAS the said Organisation has heretofore successfully carried out and transacted several objects and matters for which it wa

## 3. Text Preprocessing and Cleaning

Implement text cleaning functions to process OCR-extracted content and prepare it for embedding generation.

In [22]:
import re
import unicodedata
from typing import Optional

def normalize_unicode(text: str) -> str:
    """Normalize unicode characters to their closest ASCII representation."""
    try:
        return unicodedata.normalize('NFKC', text).encode('ASCII', 'ignore').decode('ASCII')
    except Exception:
        return text

def normalize_whitespace(text: str) -> str:
    """Normalize whitespace, preserving essential structure."""
    try:
        # Replace multiple spaces with a single space
        text = re.sub(r'[ \t]+', ' ', text)
        # Normalize newlines (keep single newlines, remove excessive ones)
        text = re.sub(r'\n{2,}', '\n', text)
        return text.strip()
    except re.error:
        return text.strip()

def clean_legal_metadata(text: str) -> str:
    """Remove repetitive headers and footers while preserving unique legal content and presenter details."""
    patterns = [
        # Gazette headers
        r'^THE GAZETTE OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA\s*Part\s+[IVX]+\s*of\s+[A-Za-z]+\s+\d+,\s+\d+\s*SUPPLEMENT\s*$',
        r'^\(Issued\s+on\s+\d+\.\s*\d+\.\s*\d+\)\s*$',
        # Printing and purchase information
        r'^PRINTED AT THE DEPARTMENT OF GOVERNMENT PRINTING.*?$',
        r'^TO BE PURCHASED AT THE GOVERNMENT PUBLICATIONS BUREAU.*?$',
        r'^Price\s*:\s*Rs\.\s*\d+\.\d+\s*Postage\s*:\s*Rs\.\s*\d+\.\d+\s*$',
        # Bill identifiers
        r'^\d+\s*-PL\s+\d+-\d+\s*\(\d+/\d+\)\s*$',
        # Subscription details
        r'^Annual subscription of English Bills and Acts of the Parliament.*?$',
        r'^Payable to the SUPERINTENDENT, GOVERNMENT PUBLICATIONS BUREAU.*?$',
        # Other repetitive metadata
        r'^N\.B\.- Part [A-Z0-9]+\s*of the Gazette No\.\s*\d+[,\d]*\s*of\s*\d{2}\.\d{2}\.\d{4}\s*was not published\.\s*$',
        r'^Published by Authority\s*$',
    ]
    try:
        for pattern in patterns:
            text = re.sub(pattern, '', text, flags=re.MULTILINE | re.IGNORECASE)
        # Remove empty lines after metadata cleaning
        text = '\n'.join(line for line in text.split('\n') if line.strip())
        return text
    except re.error:
        return text

def remove_special_characters(text: str) -> str:
    """Remove special characters, preserving essential punctuation for legal text."""
    try:
        # Preserve alphanumeric, spaces, and common legal punctuation (.,;:-/()&)
        text = re.sub(r'[^\w\s.,;:\-/()&]', '', text)
        return text
    except re.error:
        return text

def preprocess_legal_document(text: str) -> Optional[str]:
    """Main function to preprocess legal document text while preserving core content and unique details."""
    if not text or not text.strip():
        return None

    try:
        # Step 1: Normalize unicode
        text = normalize_unicode(text)

        # Step 2: Clean specific metadata
        text = clean_legal_metadata(text)

        # Step 3: Remove special characters
        text = remove_special_characters(text)

        # Step 4: Normalize whitespace
        text = normalize_whitespace(text)

        return text.strip()
    except Exception:
        # Fallback to minimal cleaning
        return normalize_whitespace(text)

In [23]:
cleaned_text = preprocess_legal_document(text)
print(cleaned_text)

THE GAZETTE OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA Part II of May 07, 2010 SUPPLEMENT (Issued on 10. 05. 2010) PINA ORGANISATION (INCORPORATION) BILL to incorporate the Pina Organization To be presented in Parliament by Hon. Tissa Attanayake, M.P. for Kandy District PRINTED AT THE DEPARTMENT OF GOVERNMENT PRINTING, SRI LANKA TO BE PURCHASED AT THE GOVERNMENT PUBLICATIONS BUREAU, COLOMBO 5. Price : Rs. 6.00 Postage : Rs. 5.00 (Private Members Bill)
Pina Organisation (Incorporation) 2 -PL 004825-80 (05/2010) Short title. Preamble. AN ACT TO INCORPORATE THE PINA ORGANISATION WHEREAS an Organisation called and known as the Pina Organisation has been established in Sri Lanka, for the purpose of effectually carrying out and transacting all objects and matters connected with the said Organisation according to the rules agreed to by its members: AND WHEREAS the said Organisation has heretofore successfully carried out and transacted several objects and matters for which it was estab

In [24]:
df['content'] = df['content'].apply(preprocess_legal_document)


In [25]:
df.to_csv('bills_clean.tsv.gz', sep='\t', index=False, compression='gzip')
