# Text embedding pipeline

## Import libraries

In [5]:
from datasets import load_dataset, concatenate_datasets
import nltk

## Data
Read the datasets

### Pile of law
A curated large corpus of legal and administrative data (text).

**Best Instances for RAG**:
These datasets contain authoritative, structured, and extensively reasoned legal materials directly suited to RAG systems:
- courtListener_opinions: Official U.S. court opinions, clearly structured arguments, extensive references to statutes, regulations, and precedents.
- scotus_filings: High-quality legal argumentation and reasoning by litigants before the Supreme Court. Rich citation networks, clear structure, excellent for retrieval of authoritative precedents.

In [23]:
n_sample = 10000
courtlistener_opinions = load_dataset("pile-of-law/pile-of-law", "courtlistener_opinions", split="train", streaming=True).take(n_sample)
scotus_filings = load_dataset("pile-of-law/pile-of-law", "scotus_filings", split="train", streaming=True).take(n_sample)

Loading Dataset Infos from /Users/juhis/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
Loading Dataset Infos from /Users/juhis/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


### Project Oyez
Audio dataset of SCOTUS oral arguments transcribed into text with OpenAI Whisper

In [3]:
project_oyez = load_dataset("Juh6973/project_oyez_oral_arguments_2000-2024")

Make an extra dataset out of the case key components

In [7]:
def create_document(example):
    facts = (example['facts_of_the_case'] or '').strip()
    question = (example['question'] or '').strip()
    conclusion = (example['conclusion'] or '').strip()

    example["document"] = f"Facts: {facts}\nQuestion: {question}\nConclusion: {conclusion}"
    return example

scotus_case_concise = project_oyez.map(create_document)
scotus_case_concise = scotus_case_concise.remove_columns(['facts_of_the_case', 'question', 'conclusion', 'audio_links', 'oral_arguments', '__index_level_0__'])

Map:   0%|          | 0/1693 [00:00<?, ? examples/s]

### Harvard Caselaw
A dataset of scanned PDFs converted into text with Google's Tesseract OCR model

In [30]:
harvard_caselaw = load_dataset("Juh6973/caselaw_latest_volumes_by_state")

README.md:   0%|          | 0.00/461 [00:00<?, ?B/s]

Generating dataset caselaw_latest_volumes_by_state (/Users/juhis/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb)
Downloading and preparing dataset caselaw_latest_volumes_by_state/default to /Users/juhis/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb...


train-00000-of-00001.parquet:   0%|          | 0.00/43.6M [00:00<?, ?B/s]

Downloading took 0.0 min
Checksum Computation took 0.0 min
Generating train split


Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

All the splits matched successfully.
Dataset caselaw_latest_volumes_by_state downloaded and prepared to /Users/juhis/.cache/huggingface/datasets/Juh6973___caselaw_latest_volumes_by_state/default/0.0.0/130a8e46ab5ea41e0d77e113e5146a621c3d46eb. Subsequent calls will reuse this data.


## Preprocessing

Unify the data schemas

Remove special characters with regex

Split the documents into chunks

## Embedding

Remove stopwords from embeddings

Stem words for embeddings

Make pipeline for preprocessing queries (removing stopwords and stemming)

Calculate embeddings