# About Document-Word Matrix Construction for Legal Topic Modeling

This notebook is a core part of the **NM Law Data Pipeline**. It transforms cleaned legal texts into a **document-word matrix (DWM)** suitable for unsupervised topic modeling via **Hierarchical Nonnegative Matrix Factorization (HNMFk)** and related methods.

---

##  Purpose

Legal texts—such as statutes, constitutional clauses, and case opinions—are rich in content but complex in structure. Before they can be used for clustering, topic modeling, or graph generation, they must be transformed into a numeric format that preserves semantic meaning.

This notebook performs that transformation by:
1. **Loading** cleaned legal data from prior scraping and formatting steps,
2. **Cleaning and consolidating** the vocabulary using domain-specific text processors,
3. **Extracting a refined vocabulary**, and
4. **Constructing a sparse matrix** mapping each document to the frequency of its terms.

---
---
 


# Define Paths 

This section initializes the file system paths required for reading the structured CSVs and saving intermediate results. These paths should point to:

- Preprocessed legal documents (e.g., statutes, cases),
- Vocabulary consolidation tools and configuration,
- Output directories for document-word matrices and vocabularies.

These paths form the backbone of this pipeline stage and must be consistent across notebook runs.


In [None]:
import pathlib, os

CSV_PATH = None 
RESULTS_DIR = pathlib.Path('./results')
RESULTS_DIR.resolve().mkdir(parents=True, exist_ok=True)
RESULTS_FILE = pathlib.Path('operated_documents')

# Load Data

### Define column to extract as documents

Loads the cleaned and structured legal texts generated in earlier pipeline steps. You must specify:

- The target CSV file (e.g., `STATUTE.csv`, `SUPREME.csv`, etc.),
- The column containing the primary text data (usually `content`, `provision_text`, or `opinion_text`).

This column is extracted and loaded into memory for vocabulary construction and matrix generation.


In [None]:
import pandas as pd, pickle

df = pd.read_csv(CSV_PATH)
df['title_abs'] = df['title'] + ' ' + df['abstract']
df = df.dropna(subset=['title_abs'])
df.info()
documents = df.title_abs.to_dict()

# Vocabulary refinement through cleaning

This step standardizes and cleans the raw text using Vulture—a modular pre-processing framework.

Tasks performed include:
- Lowercasing,
- Removing non-alphanumeric tokens,
- Lemmatization,
- Dropping stopwords and non-English phrases,
- Optional filtering of frequent or infrequent terms.

This ensures that the vocabulary used in downstream matrix decomposition is domain-specific, relevant, and noise-free.


In [None]:
from TELF.pre_processing import Vulture
from TELF.pre_processing.Vulture.modules import AcronymDetector
from TELF.pre_processing.Vulture.modules import SimpleCleaner
from TELF.pre_processing.Vulture.default_stop_words import STOP_WORDS
from TELF.pre_processing.Vulture.default_stop_phrases import STOP_PHRASES
from TELF.pre_processing import Beaver
from TELF.pre_processing.Vulture.tokens_analysis.vocab_consolidator import VocabularyConsolidator

In [None]:
vulture = Vulture(n_jobs  = -1, 
                  verbose = 10,  # Disable == 0, Verbose >= 1
                 )
steps = [SimpleCleaner( min_characters=3,
                        stop_words = STOP_WORDS,
                        stop_phrases = STOP_PHRASES,
                        order = [
                                'remove_numbers',
                                'standardize_hyphens',
                                'remove_stop_phrases',
                                'isolate_frozen',
                                'remove_copyright_statement',
                                'make_lower_case',
                                'remove_formulas',
                                'normalize',
                                'remove_next_line',
                                'remove_email',
                                'remove_()',
                                'remove_[]',
                                'remove_special_characters',
                                'remove_nonASCII_boundary',
                                'remove_nonASCII',
                                'remove_tags',
                                'remove_stop_words',
                                'remove_standalone_numbers',
                                'remove_extra_whitespace',
                                'min_characters',
        ])]

CLEAN_DOCS = os.path.join(RESULTS_DIR, "clean_documents")

vulture.clean(  documents, 
                steps=steps,
                save_path=CLEAN_DOCS)         

clean_documents = pickle.load(open(CLEAN_DOCS, 'rb'))

# Build acronyms


Constructs a mapping of common acronyms found in the legal text (e.g., "UCC", "SCOTUS") to their expanded forms.

This improves semantic resolution by:
- Avoiding misleading token frequency boosts for acronyms,
- Linking short forms to their original terms in the matrix.

Useful for clustering tasks and document understanding where expanded terms carry more meaning.


In [None]:
OPERATION_RESULTS = (RESULTS_DIR / RESULTS_FILE)
OPERATION_RESULTS.mkdir(parents=True, exist_ok=True)

vulture.operate(clean_documents, steps=[AcronymDetector(replace_raw=True)], save_path=RESULTS_DIR, file_name=RESULTS_FILE)               
operated_documents = pickle.load(open((OPERATION_RESULTS / '_AcronymDetector.p'), 'rb'))

def to_df(documents, operated_documents):
    data = {
        'id': [],
        'text': [],
        'acronyms': [],
        'acronym_replaced_text': [],
    }
    for _id, text in documents.items():
        data['id'].append(_id)
        data['text'].append(text)

        data['acronyms'].append(operated_documents.get(_id).get('Acronyms'))
        data['acronym_replaced_text'].append(operated_documents.get(_id).get('replaced_text'))
    return pd.DataFrame.from_dict(data)

    
df = to_df(documents, operated_documents)
df.head()

In [None]:
substitutions = {}
for id, acronym_data in operated_documents.items():
    for src_txt, acronym in acronym_data['Acronyms'].items():
        # print(src_txt, acronym)
        sub_to = '_'.join(src_txt.split())
        substitutions[src_txt] = sub_to
        substitutions[acronym] = sub_to

for src, sub in substitutions.items():
    print(f'{src} : {sub}')

from TELF.pre_processing.Vulture.modules import SubstitutionCleaner
initial_sub = SubstitutionCleaner(substitutions, permute=False, lower=True, lemmatize=False)
step1 = [initial_sub] 
dataframe_clean_args = {
    "df": df,
    "steps": step1,
    "columns": ['text',],
    "append_to_original_df": True,
    "concat_cleaned_cols": True,
}

df = vulture.clean_dataframe(**dataframe_clean_args) 
df.info()


# Consolidate terms

This step refines the vocabulary by merging different lexical variants of the same concept.

For example:
- Plurals → Singulars (e.g., "laws" → "law"),
- Variants → Canonical forms (e.g., "defences" → "defense").

The goal is to reduce redundancy in the vocabulary and improve topic resolution during matrix factorization.


In [None]:
ACS_RESULT = RESULTS_DIR / 'clean_acronyms.csv'
df.to_csv(ACS_RESULT)
documents = df.clean_text.to_dict()
CONSOLIDATE_PATH = (RESULTS_DIR / 'VOCAB_CONSOLIDATOR')
CONSOLIDATE_PATH.mkdir(parents=True,exist_ok=True)
CONSOLIDATE_OUT = CONSOLIDATE_PATH / '_SubstitutionOperator.p'

consolidator = VocabularyConsolidator()
changes_made_file = 'VOCAB_CONSOLIDATOR_changes.csv'
changes_made_save_path = (RESULTS_DIR / changes_made_file)
o = consolidator.consolidate_terms( vocabulary=None,
                                    texts=documents,
                                    ignore_pairs=[('','')],
                                    changes_made_save_path= changes_made_save_path,
                                    operated_text_save_path= str(CONSOLIDATE_PATH))

# Restore data to df


Once the cleaned and consolidated vocabulary is finalized, this section restores it to a structured DataFrame that includes:

- Document IDs,
- Cleaned document text (tokenized or raw),
- Metadata columns, if needed (e.g., title, source, year).

This step ensures compatibility with matrix generation functions and enables traceability.


In [None]:
df_changed = pd.read_csv(changes_made_save_path)
substitution_consolidate = pickle.load(open(CONSOLIDATE_OUT, "rb"))
print(substitution_consolidate.keys())
consolidate_data= []
for i,k in substitution_consolidate.items():
    consolidate_data.append(k.get('replaced_text', ''))
df['cleaned_acs_consolidated'] = consolidate_data
df.to_csv(RESULTS_DIR/'cleaned_consolidated_supreme.csv', index=False)
df.info()

# Extract Vocabulary


Builds the final list of unique vocabulary terms across all cleaned documents.

This vocabulary will define the columns of the document-word matrix (DWM), where each term is a dimension in the resulting sparse matrix. This also forms the base for co-occurrence graphs and topic keyword extraction.


In [None]:
from TELF.pre_processing import Beaver

beaver = Beaver()
settings = {
    "dataset":df,
    "target_column":"cleaned_acs_consolidated",
    "min_df":50,
    "max_df":0.7,
    'save_path':RESULTS_DIR
}

vocabulary = beaver.get_vocabulary(**settings)
len(vocabulary)

In [None]:
with open('./results/Vocabulary.txt') as f:
    VOC = [w.strip() for w in f]
len(VOC)


# Build the document word matrix for decomposition
Generates the actual document-word matrix (DWM), where:

- **Rows** represent individual legal documents,
- **Columns** represent unique vocabulary terms,
- **Values** are  TF-IDF scores.

This matrix serves as the input for **Hierarchical Nonnegative Matrix Factorization (HNMFk)** or similar decomposition algorithms. The result enables interpretable topic models, legal concept discovery, and semantic clustering of legal texts.


In [None]:
settings = {
    "dataset" : df,
    "target_column" : "cleaned_acs_consolidated",
    "options" : { "vocabulary" : VOC },
    "matrix_type" : "tfidf",
    "save_path" : RESULTS_DIR
}

beaver.documents_words(**settings)

# HNMFk

## Use the generated document-word matrix  to operate the HNMFk code in the local file called `03-1_run_hnmfk.py`

### HNMFk is defined in TELF [here](../../HNMFk/00-HNMFk.ipynb)