# Text embedding pipeline

## Import libraries

In [1]:
from datasets import load_dataset, concatenate_datasets
import nltk

## Data
Read the datasets

### Pile of law
A curated large corpus of legal and administrative data (text).

**Best Instances for RAG**:
These datasets contain authoritative, structured, and extensively reasoned legal materials directly suited to RAG systems:
- courtListener_opinions: Official U.S. court opinions, clearly structured arguments, extensive references to statutes, regulations, and precedents.
- scotus_filings: High-quality legal argumentation and reasoning by litigants before the Supreme Court. Rich citation networks, clear structure, excellent for retrieval of authoritative precedents.

In [2]:
n_sample = 10000
courtlistener_opinions = load_dataset("pile-of-law/pile-of-law", "courtlistener_opinions", split="train", streaming=True).take(n_sample)
scotus_filings = load_dataset("pile-of-law/pile-of-law", "scotus_filings", split="train", streaming=True).take(n_sample)

Loading Dataset Infos from /home/raati/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
Loading Dataset Infos from /home/raati/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


### Project Oyez
Audio dataset of SCOTUS oral arguments transcribed into text with OpenAI Whisper

In [3]:
project_oyez = load_dataset("Juh6973/project_oyez_oral_arguments_2000-2024")

Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4
Found cached dataset project_oyez_oral_arguments_2000-2024 (/home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4)
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4


Make an extra dataset out of the case key components

In [4]:
def create_document(example):
    facts = (example['facts_of_the_case'] or '').strip()
    question = (example['question'] or '').strip()
    conclusion = (example['conclusion'] or '').strip()

    example["document"] = f"Facts: {facts}\nQuestion: {question}\nConclusion: {conclusion}"
    return example

scotus_case_concise = project_oyez.map(create_document)
scotus_case_concise = scotus_case_concise.remove_columns(['facts_of_the_case', 'question', 'conclusion', 'audio_links', 'oral_arguments', '__index_level_0__'])

Loading cached processed dataset at /home/raati/.cache/huggingface/datasets/Juh6973___project_oyez_oral_arguments_2000-2024/default/0.0.0/e34cb3b6656734dbf42311585865c4d2f6345ed4/cache-89b200791f12feb1.arrow


### Harvard Caselaw
A dataset of scanned PDFs converted into text with Google's Tesseract OCR model

In [5]:
harvard_caselaw = load_dataset("Juh6973/caselaw_latest_volumes_by_state")

Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb
Found cached dataset caselaw_latest_volumes_by_state (/home/raati/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb)
Loading Dataset info from /home/raati/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb


## Preprocessing

Unify the data schemas

In [6]:
print(courtlistener_opinions)
print(scotus_filings)
print(scotus_case_concise)
print(harvard_caselaw)
print(project_oyez)


IterableDataset({
    features: ['text', 'created_timestamp', 'downloaded_timestamp', 'url'],
    num_shards: 16
})
IterableDataset({
    features: ['text', 'created_timestamp', 'downloaded_timestamp', 'url'],
    num_shards: 1
})
DatasetDict({
    train: Dataset({
        features: ['id', 'name', 'description', 'document'],
        num_rows: 1693
    })
})
DatasetDict({
    train: Dataset({
        features: ['state', 'volume_number', 'pdf_url', 'term', 'jurisdictions', 'text'],
        num_rows: 50
    })
})
DatasetDict({
    train: Dataset({
        features: ['id', 'name', 'facts_of_the_case', 'question', 'conclusion', 'description', 'audio_links', 'oral_arguments', '__index_level_0__'],
        num_rows: 1693
    })
})


In [7]:
project_oyez["train"][0]

{'id': 54856,
 'name': 'United States v. Mead Corporation',
 'facts_of_the_case': 'Under the Harmonized Tariff Schedule of the United States, the United States Customs Service is authorized to classify and fix the rate of duty on imports under rules and regulations issued by the Secretary of the Treasury. Under the Secretary\'s regulations, any port-of-entry Customs office and the Customs Headquarters Office may issue "ruling letters" setting tariff classifications for particular imports. The Mead Corporation\'s imported "day planners," were classified as duty-free until the Customs Headquarters issued a ruling letter classifying them as bound diaries subject to tariff. Subsequently, Mead filed suit in the Court of International Trade. The court granted the Government summary judgment. In reversing, the Court of Appeals found that ruling letters should not be treated like Customs regulations, which receive the highest level of deference, because they are not preceded by notice and comm

In [8]:
harvard_caselaw["train"][0]

{'state': 'Alabama',
 'volume_number': 295,
 'pdf_url': 'https://static.case.law/ala/295.pdf',
 'term': '1975-1975',
 'jurisdictions': 'Alabama',

In [9]:
from datasets import DatasetDict

unified_dataset = DatasetDict()
# courtlistener_opinions
co_filtered = courtlistener_opinions.remove_columns(['created_timestamp', 'downloaded_timestamp'])
co_filtered = co_filtered.rename_column('text', 'document')
unified_dataset['courtlistener_opinions'] = co_filtered

# scotus_filings
sf_filtered = scotus_filings.remove_columns(['created_timestamp', 'downloaded_timestamp'])
sf_filtered = sf_filtered.rename_column('text', 'document')
unified_dataset['scotus_filings'] = sf_filtered

# scotus_case_concise
unified_dataset['scotus_case_concise'] = scotus_case_concise["train"]


# harvard_caselaw
harvard_caselaw_filtered = harvard_caselaw.remove_columns(['volume_number'])
harvard_caselaw_filtered = harvard_caselaw_filtered.rename_column('text', 'document')
harvard_caselaw_filtered = harvard_caselaw_filtered.rename_column('pdf_url', 'url')
unified_dataset['harvard_caselaw'] = harvard_caselaw_filtered["train"]

# project_oyez
project_oyez_filtered = project_oyez.remove_columns(['__index_level_0__', 'audio_links'])
project_oyez_filtered = project_oyez_filtered.rename_column('oral_arguments', 'document')
unified_dataset['project_oyez'] = project_oyez_filtered["train"]

In [10]:
unified_dataset

DatasetDict({
    courtlistener_opinions: IterableDataset({
        features: ['document', 'url'],
        num_shards: 16
    })
    scotus_filings: IterableDataset({
        features: ['document', 'url'],
        num_shards: 1
    })
    scotus_case_concise: Dataset({
        features: ['id', 'name', 'description', 'document'],
        num_rows: 1693
    })
    harvard_caselaw: Dataset({
        features: ['state', 'url', 'term', 'jurisdictions', 'document'],
        num_rows: 50
    })
    project_oyez: Dataset({
        features: ['id', 'name', 'facts_of_the_case', 'question', 'conclusion', 'description', 'document'],
        num_rows: 1693
    })
})

In [11]:
import re

tmp_text = unified_dataset["harvard_caselaw"][0]["document"].strip()
leaned_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', tmp_text)
#tmp_graphs = re.split(r'\n{2,}', tmp_text)
tmp_graphs = leaned_text.split('\n\n')

short = []
passed = []
for i, g in enumerate(tmp_graphs):
    #print(f"Graph {i}, lenght{len(g)}: {g}")
    #print()
    if len(g) < 75:
        short.append(g)
    else:
        passed.append(g)

print(f"Short: {len(short)}")
print(f"Passed: {len(passed)}")

Short: 2696
Passed: 2743


In [None]:
for i, s in enumerate(short):
    print(f"Short {i}: {s}")
    print()

    if i > 100:
        break

In [None]:
for k in unified_dataset["harvard_caselaw"]:
    for i in k:
        print(i, ":",k[i])
        print("----")
    break

Remove special characters with regex

Split the documents into chunks

In [2]:
import os
import chromadb
from chromadb.config import Settings
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from uuid import uuid4

In [3]:
import os
# print path to the current directory
print(os.getcwd())

# print files
print(os.listdir())

/home/raati/multimodal_legal_rag/chroma
['database', 'text_embedding_pipeline.ipynb', 'chroma_test.ipynb', 'Dockerfile', 'requirements.txt']


Create or get vector database

In [4]:
# persist directory
persist_directory = "./database"
collection_name = "test_collection"

embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

vector_store = Chroma(
    collection_name=collection_name,
    embedding_function=embedding_function,
    persist_directory=persist_directory,  # Where to save data locally, remove if not necessary
)

Create Document objects

In [22]:
# Iterate over the datasets and add them to the Chroma database

documents = []
nonetype_documents = []

for dataset_name, dataset in unified_dataset.items():
    print(f"Processing dataset: {dataset_name}")
    for example in dataset:

        # Remove data with None values
        if None in example.values():
            nonetype_documents.append(example)
            continue

        # Extract the text and metadata from the example
        text = example.pop("document")
        metadata = example
        metadata["dataset"] = dataset_name


        # Create a document object
        document = Document(
            page_content=text,
            metadata=metadata,
        )
        
        documents.append(document)
        

print(f"Number of documents: {len(documents)}")
print(f"Number of documents with None values: {len(nonetype_documents)}")

Processing dataset: courtlistener_opinions
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.courtlisteneropinions.0.jsonl.xz
Processing dataset: scotus_filings
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.scotus_docket_entries.jsonl.xz
Processing dataset: scotus_case_concise
Processing dataset: harvard_caselaw
Processing dataset: project_oyez
Number of documents: 22270
Number of documents with None values: 1166


In [24]:
print(f"Adding {len(documents)} documents to the Chroma database")

Adding 22270 documents to the Chroma database


In [25]:
print(documents[0])

page_content='—Appeal by the defendant from a judgment of the Supreme Court, Queens County (Brennan, J.), rendered December 1, 1983, adjudicating him a youthful offender, upon his plea of guilty to robbery in the first degree (two counts), robbery in the second degree, and assault in the first degree (two counts), and imposing sentence.
Ordered that the judgment is affirmed.
We have reviewed the record and agree with the defendant’s assigned counsel that there are no meritorious issues which could be raised on appeal. Counsel’s application for leave to withdraw as counsel is granted (see, Anders v Califor*580nia, 386 US 738; People v Paige, 54 AD2d 631; cf., People v Gonzalez, 47 NY2d 606). Mollen, P. J., Bracken, Rubin and Spatt, JJ., concur.' metadata={'url': 'https://www.courtlistener.com/api/rest/v3/opinions/5901619/', 'dataset': 'courtlistener_opinions'}


In [26]:
print(nonetype_documents[0])

{'id': 54856, 'name': 'United States v. Mead Corporation', 'description': None, 'document': 'Facts: Under the Harmonized Tariff Schedule of the United States, the United States Customs Service is authorized to classify and fix the rate of duty on imports under rules and regulations issued by the Secretary of the Treasury. Under the Secretary\'s regulations, any port-of-entry Customs office and the Customs Headquarters Office may issue "ruling letters" setting tariff classifications for particular imports. The Mead Corporation\'s imported "day planners," were classified as duty-free until the Customs Headquarters issued a ruling letter classifying them as bound diaries subject to tariff. Subsequently, Mead filed suit in the Court of International Trade. The court granted the Government summary judgment. In reversing, the Court of Appeals found that ruling letters should not be treated like Customs regulations, which receive the highest level of deference, because they are not preceded b

In [None]:
# harward_caselaw example
#print(documents[21700])

In [30]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
chunked_documents = text_splitter.split_documents(documents)

#tmp_doc = documents[21700:21701]
#chunked_documents = text_splitter.split_documents(tmp_doc)

print(f"Split {len(documents)} documents into {len(chunked_documents)} chunks")

Split 22270 documents into 279363 chunks


Init uuids for documents

In [31]:
uuids = [str(uuid4()) for _ in range(len(chunked_documents))]
print(len(uuids))

279363


View chunked documents

In [33]:
# view chunked documents
for i, doc in enumerate(chunked_documents[-15:]):
    print(f"{len(doc.page_content)}Document {i}: {doc.page_content}")
    print("---")

    if i > 10:
        break


1916Document 0: I do want to emphasize, though, that my friends have pointed to January 19th or nine days from now as a moment when TikTok might go dark.
At the outset, of course, Congress was hoping to prompt a divestiture, but I think the more important thing to  --to focus on now is that even if that were to happen, Congress specifically anticipated it and provided authority to lift these restrictions as soon as there's a qualified divestiture.
And the reason for that is because foreign adversaries do not willingly give up their control over this mass communications  channel in the United States, and I think Congress expected we might see something like a game of chicken, ByteDance saying we can't do it; China will never let us do it.
But, when push comes to shove and these restrictions take effect, I think it will fundamentally change the landscape with respect to what ByteDance is willing to consider, and it might be just the jolt that Congress expected the company would need to a

## Upload to chromadb

In [34]:
chunked_documents[0]

Document(metadata={'url': 'https://www.courtlistener.com/api/rest/v3/opinions/5901619/', 'dataset': 'courtlistener_opinions'}, page_content='—Appeal by the defendant from a judgment of the Supreme Court, Queens County (Brennan, J.), rendered December 1, 1983, adjudicating him a youthful offender, upon his plea of guilty to robbery in the first degree (two counts), robbery in the second degree, and assault in the first degree (two counts), and imposing sentence.\nOrdered that the judgment is affirmed.\nWe have reviewed the record and agree with the defendant’s assigned counsel that there are no meritorious issues which could be raised on appeal. Counsel’s application for leave to withdraw as counsel is granted (see, Anders v Califor*580nia, 386 US 738; People v Paige, 54 AD2d 631; cf., People v Gonzalez, 47 NY2d 606). Mollen, P. J., Bracken, Rubin and Spatt, JJ., concur.')

Shuffle documents (if setting limit to documents)

In [35]:
# shuffle the documents
import random
random.shuffle(chunked_documents)

In [36]:
chunked_documents[0]

Document(metadata={'url': 'http://www.supremecourt.gov/DocketPDF/20/20-382/169864/20210224143811931_2021-02-22%20Guam%20JA.pdf', 'dataset': 'scotus_filings'}, page_content='Security Act of 1947, as\\namended in 1949.\\n\\n2\\n\\nU.S. EPA Final Record of Decision, Ordot Landfill\\nSuperfund Site, September 1988.\\n\\n\\x0cJA-64\\nNew London located at Groton, Connecticut 06349.\\nDefendant Navy may be served via certified mail\\nreturn receipt requested at the following three\\naddresses:\\nUnited States of America\\nDepartment of the Navy\\nGeneral Litigation Division\\nAuthorized Agent for Service of Legal\\nDocuments\\n875 N Randolph Street\\nArlington, VA 22217\\nUnited States of America\\nDepartment of the Navy\\nGeneral Litigation Division\\nAuthorized Agent for Service of Legal\\nDocuments\\n720 Kennon Street, SE\\nWashington, DC 20374\\nUnited States of America\\nDepartment of the Navy\\nGeneral Litigation Division\\nAuthorized Agent for Service of Legal\\nDocuments\\n1322 Patte

Upload to database

In [None]:
from tqdm import tqdm
# Add the chunked documents to the Chroma database
limit = 100000 # limit to 100000 documents
failed = []
step_len = 5000
for i in tqdm(range(0, len(chunked_documents[:100000]), step_len)):
    try:
        vector_store.add_documents(
            documents=chunked_documents[i:i+step_len],
            uuids=uuids[i:i+step_len],
            )
    except Exception as e:
        print(f"Failed to add documents {i} to {i+step_len} to the Chroma database: {e}")
        failed.append(i)


100%|██████████| 20/20 [29:29<00:00, 88.46s/it]


Query results

In [14]:
# Query the Chroma database 
query = "Can police search my phone?"
results = vector_store.similarity_search(query, k=5)
print(f"Query: {query}\n")
for i, result in enumerate(results):
    print(f"Result {i}: {result.page_content}\n---\nMetadata: {result.metadata}\n")
    print()

Query: Can police search my phone?

Result 0: a warrant solely to obtain the telephone number and owner-
ship identification. In the present case, police removed the
phone’s SIM card and processed it for the limited purpose of
obtaining the telephone number. I recognize that even small
manipulations of personal property have been held to be
Fourth Amendment searches. See Arizona v. Hicks, 480 U.S.
821, 324-25, 107 S.Ct. 1149, 94 L.Ed.2d 347 (1987) (holding a
search occurred when a police officer briefly moved stereo
equipment inside a defendant’s apartment in order to record
the equipment’s serial numbers). However, under the facts of
this case, law enforcement’s limited search of the SIM card to
obtain the phone number did not constitute an unreasonable
search under the Fourth Amendment because Moore had no
reasonable expectation of privacy in the number itself.

Of significance here is the fact that police obtained a
warrant before performing further analysis to examine the
phone’s c

Chatgpt enhanced version of the query

In [15]:
# Query the Chroma database 
query = "Find Supreme Court rulings and case law regarding digital privacy, Fourth Amendment protections, and warrant requirements for phone searches. Include Riley v. California and Carpenter v. United States."
results = vector_store.similarity_search(query, k=5)
print(f"Query: {query}\n")
for i, result in enumerate(results):
    print(f"Result {i}: {result.page_content}\n---\nMetadata: {result.metadata}\n")
    print()

Query: Find Supreme Court rulings and case law regarding digital privacy, Fourth Amendment protections, and warrant requirements for phone searches. Include Riley v. California and Carpenter v. United States.

Result 0: Learn more about the Roberts Court and the Fourth Amendment in Shifting Scales, a nonpartisan Oyez resource.
---
Metadata: {'dataset': 'scotus_case_concise', 'description': "A case in which the Court held that prisons that conduct suspicion-less strip searches on incoming inmates do not violate the prisoner's Fourth Amendment rights.", 'id': 55852, 'name': 'Florence v. Board of Chosen Freeholders of the County of Burlington'}


Result 1: 2:16-cr-00216-KJD-VCF Document 43 Filed 04/17/17 Page 5 of 18\n\n2. Stephen Torrez\xe2\x80\x99s Search of the SD Card Does Not Constitute Government Action\n\xe2\x80\x9c[T]he Fourth Amendment generally does not protect against unreasonable intrusions by private\n\n2\n3\n\nindividuals.\xe2\x80\x9d United States v. Reed, 15 F.3d 928, 930-

In [39]:
client = chromadb.PersistentClient(path="./database")

# List available collections
print(client.list_collections())

['test_collection']


In [None]:
collection = client.get_collection("test_collection")

# Get the number of documents in the collection
print(collection.count())

# Get the metadata of the first document

100000


AttributeError: 'Collection' object has no attribute 'get_metadata'

## Embedding

Remove stopwords from embeddings

In [None]:
# function to remove stopwords and stem the words
def preprocess_text(text):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    stemmer = nltk.stem.PorterStemmer()

    tokens = nltk.word_tokenize(text)
    tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    return " ".join(tokens)

Stem words for embeddings

Make pipeline for preprocessing queries (removing stopwords and stemming)

Calculate embeddings