# Text embedding pipeline

## Import libraries

In [1]:
from datasets import load_dataset, concatenate_datasets
import nltk

## Data
Read the datasets

### Pile of law
A curated large corpus of legal and administrative data (text).

**Best Instances for RAG**:
These datasets contain authoritative, structured, and extensively reasoned legal materials directly suited to RAG systems:
- courtListener_opinions: Official U.S. court opinions, clearly structured arguments, extensive references to statutes, regulations, and precedents.
- scotus_filings: High-quality legal argumentation and reasoning by litigants before the Supreme Court. Rich citation networks, clear structure, excellent for retrieval of authoritative precedents.

In [2]:
n_sample = 10000
courtlistener_opinions = load_dataset("pile-of-law/pile-of-law", "courtlistener_opinions", split="train", streaming=True).take(n_sample)
scotus_filings = load_dataset("pile-of-law/pile-of-law", "scotus_filings", split="train", streaming=True).take(n_sample)

Loading Dataset Infos from /home/raati/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
Loading Dataset Infos from /home/raati/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


### Project Oyez
Audio dataset of SCOTUS oral arguments transcribed into text with OpenAI Whisper

In [3]:
project_oyez = load_dataset("Juh6973/project_oyez_oral_arguments_2000-2024")

Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4
Found cached dataset project_oyez_oral_arguments_2000-2024 (/home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4)
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4


Make an extra dataset out of the case key components

In [4]:
def create_document(example):
    facts = (example['facts_of_the_case'] or '').strip()
    question = (example['question'] or '').strip()
    conclusion = (example['conclusion'] or '').strip()

    example["document"] = f"Facts: {facts}\nQuestion: {question}\nConclusion: {conclusion}"
    return example

scotus_case_concise = project_oyez.map(create_document)
scotus_case_concise = scotus_case_concise.remove_columns(['facts_of_the_case', 'question', 'conclusion', 'audio_links', 'oral_arguments', '__index_level_0__'])

Loading cached processed dataset at /home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4/cache-89b200791f12feb1.arrow


### Harvard Caselaw
A dataset of scanned PDFs converted into text with Google's Tesseract OCR model

In [5]:
harvard_caselaw = load_dataset("Juh6973/caselaw_latest_volumes_by_state")

Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb
Found cached dataset caselaw_latest_volumes_by_state (/home/raati/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb)
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb


## Preprocessing

Unify the data schemas

In [6]:
print(courtlistener_opinions)
print(scotus_filings)
print(scotus_case_concise)
print(harvard_caselaw)
print(project_oyez)


IterableDataset({
    features: ['text', 'created_timestamp', 'downloaded_timestamp', 'url'],
    num_shards: 16
})
IterableDataset({
    features: ['text', 'created_timestamp', 'downloaded_timestamp', 'url'],
    num_shards: 1
})
DatasetDict({
    train: Dataset({
        features: ['id', 'name', 'description', 'document'],
        num_rows: 1693
    })
})
DatasetDict({
    train: Dataset({
        features: ['state', 'volume_number', 'pdf_url', 'term', 'jurisdictions', 'text'],
        num_rows: 50
    })
})
DatasetDict({
    train: Dataset({
        features: ['id', 'name', 'facts_of_the_case', 'question', 'conclusion', 'description', 'audio_links', 'oral_arguments', '__index_level_0__'],
        num_rows: 1693
    })
})


In [8]:
project_oyez["train"][0]

{'id': 54856,
 'name': 'United States v. Mead Corporation',
 'facts_of_the_case': 'Under the Harmonized Tariff Schedule of the United States, the United States Customs Service is authorized to classify and fix the rate of duty on imports under rules and regulations issued by the Secretary of the Treasury. Under the Secretary\'s regulations, any port-of-entry Customs office and the Customs Headquarters Office may issue "ruling letters" setting tariff classifications for particular imports. The Mead Corporation\'s imported "day planners," were classified as duty-free until the Customs Headquarters issued a ruling letter classifying them as bound diaries subject to tariff. Subsequently, Mead filed suit in the Court of International Trade. The court granted the Government summary judgment. In reversing, the Court of Appeals found that ruling letters should not be treated like Customs regulations, which receive the highest level of deference, because they are not preceded by notice and comm

In [11]:
from datasets import DatasetDict

unified_dataset = DatasetDict()
# courtlistener_opinions
co_filtered = courtlistener_opinions.remove_columns(['created_timestamp', 'downloaded_timestamp'])
co_filtered = co_filtered.rename_column('text', 'document')
unified_dataset['courtlistener_opinions'] = co_filtered

# scotus_filings
sf_filtered = scotus_filings.remove_columns(['created_timestamp', 'downloaded_timestamp'])
sf_filtered = sf_filtered.rename_column('text', 'document')
unified_dataset['scotus_filings'] = sf_filtered

# scotus_case_concise
unified_dataset['scotus_case_concise'] = scotus_case_concise["train"]


# harvard_caselaw
harvard_caselaw_filtered = harvard_caselaw.remove_columns(['volume_number'])
harvard_caselaw_filtered = harvard_caselaw_filtered.rename_column('text', 'document')
harvard_caselaw_filtered = harvard_caselaw_filtered.rename_column('pdf_url', 'url')
unified_dataset['harvard_caselaw'] = harvard_caselaw_filtered["train"]

# project_oyez
project_oyez_filtered = project_oyez.remove_columns(['__index_level_0__', 'audio_links'])
project_oyez_filtered = project_oyez_filtered.rename_column('oral_arguments', 'document')
unified_dataset['project_oyez'] = project_oyez_filtered["train"]

In [12]:
unified_dataset

DatasetDict({
    courtlistener_opinions: IterableDataset({
        features: ['document', 'url'],
        num_shards: 16
    })
    scotus_filings: IterableDataset({
        features: ['document', 'url'],
        num_shards: 1
    })
    scotus_case_concise: Dataset({
        features: ['id', 'name', 'description', 'document'],
        num_rows: 1693
    })
    harvard_caselaw: Dataset({
        features: ['state', 'url', 'term', 'jurisdictions', 'document'],
        num_rows: 50
    })
    project_oyez: Dataset({
        features: ['id', 'name', 'facts_of_the_case', 'question', 'conclusion', 'description', 'document'],
        num_rows: 1693
    })
})

Split the documents into chunks

In [None]:
import os
import re
import chromadb
from chromadb.config import Settings
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from uuid import uuid4

Create or get vector database

In [17]:
# persist directory
persist_directory = "./database"
collection_name = "full_collection"

embedding_function = HuggingFaceEmbeddings(model_name="intfloat/e5-large-v2")

vector_store = Chroma(
    collection_name=collection_name,
    embedding_function=embedding_function,
    persist_directory=persist_directory,  # Where to save data locally, remove if not necessary
)

Remove special characters with regex

In [18]:
# Clean function

def clean_text(text):
    text = re.sub(r'\b[x][0-9a-fA-F]+\b', '', text)
    # Step 1: Replace escaped newlines with actual newlines
    text = text.replace('\\n', '\n')
    
    # Step 3: Normalize paragraph breaks (multiple newlines become single newlines)
    text = re.sub(r'\n{2,}', '\n\n', text)
    # Step 4: Remove excessive whitespace while preserving paragraph structure
    text = re.sub(r' {2,}', ' ', text)
    
    # Step 5: Clean up any remaining special characters but preserve paragraphs
    text = re.sub(r'[^\w\s.,;:?!()"\'-]', ' ', text)
    
    # Step 6: Clean up any double spaces created during cleaning
    text = re.sub(r' {2,}', ' ', text)
    
    # Step 7: Trim leading/trailing whitespace
    text = text.strip()
    
    return text

Create Document objects

In [19]:
# Iterate over the datasets and add them to the Chroma database

documents = []
nonetype_documents = []

for dataset_name, dataset in unified_dataset.items():
    print(f"Processing dataset: {dataset_name}")
    for example in dataset:

        # allow description with no text
        if "description" in example and example["description"] is None:
            example["description"] = ""

        # Remove data with None values
        if None in example.values():
            nonetype_documents.append(example)
            continue


        # Extract the text and metadata from the example
        text = example.pop("document")
        metadata = example
        metadata["dataset"] = dataset_name
        
        if dataset_name == "scotus_filings":
            text = clean_text(text)

        # Create a document object
        document = Document(
            page_content=text,
            metadata=metadata,
        )
        
        documents.append(document)
        

print(f"Number of documents: {len(documents)}")
print(f"Number of documents with None values: {len(nonetype_documents)}")

Processing dataset: courtlistener_opinions
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.courtlisteneropinions.0.jsonl.xz
Processing dataset: scotus_filings
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.scotus_docket_entries.jsonl.xz
Processing dataset: scotus_case_concise
Processing dataset: harvard_caselaw
Processing dataset: project_oyez
Number of documents: 23404
Number of documents with None values: 32


In [20]:
print(f"Adding {len(documents)} documents to the Chroma database")

Adding 23404 documents to the Chroma database


View first document

In [21]:
print(documents[0])

page_content='—Appeal by the defendant from a judgment of the Supreme Court, Queens County (Brennan, J.), rendered December 1, 1983, adjudicating him a youthful offender, upon his plea of guilty to robbery in the first degree (two counts), robbery in the second degree, and assault in the first degree (two counts), and imposing sentence.
Ordered that the judgment is affirmed.
We have reviewed the record and agree with the defendant’s assigned counsel that there are no meritorious issues which could be raised on appeal. Counsel’s application for leave to withdraw as counsel is granted (see, Anders v Califor*580nia, 386 US 738; People v Paige, 54 AD2d 631; cf., People v Gonzalez, 47 NY2d 606). Mollen, P. J., Bracken, Rubin and Spatt, JJ., concur.' metadata={'url': 'https://www.courtlistener.com/api/rest/v3/opinions/5901619/', 'dataset': 'courtlistener_opinions'}


In [22]:
# harward_caselaw example
#print(documents[21700])

Chunk documents

In [23]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunked_documents = text_splitter.split_documents(documents)

print(f"Split {len(documents)} documents into {len(chunked_documents)} chunks")

Split 23404 documents into 702276 chunks


Remove short chunks, which length are below 50.

In [None]:
# get documents under 100 characters and above 50 characters
short_documents = []
for doc in chunked_documents:
    if len(doc.page_content) > 50 and len(doc.page_content) < 100:
        short_documents.append(doc)

print(f"Number of short documents: {len(short_documents)}")

Number of short documents: 7050


In [26]:
# remove short documents
chunked_documents = [doc for doc in chunked_documents if len(doc.page_content) > 50]

## Upload to chromadb

Shuffle documents (if setting limit to documents)

In [28]:
# shuffle the documents
import random
random.shuffle(chunked_documents)

In [29]:
chunked_documents[0]

Document(metadata={'url': 'http://www.supremecourt.gov/DocketPDF/21/21-78/184083/20210716110934046_210160a%20Appendix%20for%20efiling.pdf', 'dataset': 'scotus_filings'}, page_content='m o ot n e s s d o ctri n e i s si m pl y\n\ni n a p pli c a bl e i n t hi s c a s e b e c a u s e t h e a p p e al d o e s n ot dir e ctl y c o n c er n t h e\nb a n kr u pt c y c o urt s or d er c o nfir mi n g\n\nWi n d str e a m s pl a n of r e or g a ni z ati o n.\n\nWe\n\ndi s a gr e e.\nO ur pr e c e d e nt i s cl e ar t h at e q uit a bl e m o ot n e s s c a n b e a p pli e d x9ci n a r a n g e of\nc o nt e xt s , i n cl u di n g a p p e al s i n v ol vi n g all m a n n er of b a n kr u pt c y c o urt or d er s.\nB GI , 7 7 2 F. 3 d at 1 0 9 n. 1 2 ( c oll e cti n g c a s e s). I n f a ct, C h a te a u g a y it s elf a p pli e d t h e\nd o ctri n e t o di s mi s s a cr e dit or s c h all en g e t o v ari o u s or d er s, s e v er al of w hi c h w er e\ni n d e p e n d e nt of t h e b a n kr u pt c

Init uuids for documents

In [30]:
uuids = [str(uuid4()) for _ in range(len(chunked_documents))]
print(len(uuids))

690719


Clear cache

In [31]:
# clear cache from the session
del documents
del scotus_filings
del courtlistener_opinions
del scotus_case_concise
del harvard_caselaw
del project_oyez

Upload to database

In [32]:
from tqdm import tqdm
# Add the chunked documents to the Chroma database
limit = 100000 # limit to 100000 documents
failed = []
step_len = 5000
for i in tqdm(range(0, len(chunked_documents), step_len)):
    try:
        vector_store.add_documents(
            documents=chunked_documents[i:i+step_len],
            uuids=uuids[i:i+step_len],
            )
    except Exception as e:
        print(f"Failed to add documents {i} to {i+step_len} to the Chroma database: {e}")
        failed.append(i)


100%|██████████| 139/139 [8:35:25<00:00, 222.49s/it]  


Test connection, make sure chromadb is running on port 5000

In [9]:
import chromadb

client = chromadb.PersistentClient(path="./database")
print(client.list_collections())
client = chromadb.HttpClient(host="localhost", port=5000)
# List available collections
print(client.list_collections())

['full_collection']
['test_collection']


In [2]:
collection = client.get_collection('full_collection')
print(collection.count())

690719


Query results

In [35]:
# Query the Chroma database 
query = "Can police search my phone?"
results = vector_store.similarity_search(query, k=5)
print(f"Query: {query}\n")
for i, result in enumerate(results):
    print(f"Result {i}: {result.page_content}\n---\nMetadata: {result.metadata}\n")
    print()

Query: Can police search my phone?

Result 0: a

Modern cell phones are not just another technological con-
venience. With all they contain and all they may reveal, they
hold for many Americans “the privacies of life[.]” The fact
that technology now allows an individual to carry such
information in his hand does not make the information any
less worthy of the protection for which the Founders fought.
Our answer to the question of what police must do before
searching a cell phone seized incident to an arrest is accord-
ingly simple—get a warrant,
Id, at 2494-95,
---
Metadata: {'dataset': 'harvard_caselaw', 'jurisdictions': 'South Carolina', 'state': 'South Carolina', 'term': '2018-2018', 'url': 'https://static.case.law/sc/421.pdf'}


Result 1: Well, you could say the same thing about a cigarette pack that -- that has cocaine in it.
Or a gun.
Or -- or a gun.
And the police may seize and examine those containers--
Right.
--to see whether or not.
And why not the phone.
That's exactly the q