<div class="steps">

# <b>NEWSROOM LEGISLATION COMPARISON TOOL</b>
    
## COMPARE STATE BILLS TO EACH OTHER, AND TO FEDERAL, LOBBYING, and INDUSTRY DOCUMENTS.
<br>
<div class="steps-title3">Customizing notes:</div>
<div class="steps-body">

- Edit EMBEDDING_CONFIG below with folder paths and file patterns 
- Run load_all_embeddings() once at the start
- Make sure to upload documents with streamlined naming conventions for wildcard search (*)
- Make sure to utilize "reshape" whenever loading one document as embedding: \
   if Y_embedding.ndim == 1:  
       Y_embedding = Y_embedding.reshape(1, -1)</div>

<div class="steps-title3">Function usage:</div>
<div class="steps-body">
<br>
<u>MEGA-FUNCTION:</u> Use compare_any_doc() to compare one document to all categories at once AND to three specific docs of your choice
<br>
<u>ONE-TO-ONE FUNCTION:</u> Use compare_state_to_state to quickly compare state bills to each other 
<br>
<u>HELPER FUNCTIONS:</u>
    
- Use load_embeddings to load any kinds of documents all at once from EMBEDDING_CONFIG
- Use list_all_docs() to list documents in one category folder
- list_docs_by_category() to remind yourself what docs you have where</div>

<br>
<div class="steps-title3">File Structure and Naming Conventions:</div>
<div class="steps-body">
    
- all pdfs of any category go in one folder at the level of your notebook
- all pdfs should be labeled with their category before the full doc name, i.e. "lobbying_dec12_purposestmt.pdf"
- this way you will only have to navigate through a few folders instead of many sub-folders</div>

</div>

In [78]:
from IPython.display import HTML

HTML("""
<style>
.steps {
  border: 2px solid #2A27F5;
  padding-left: 14px;
  color: #080342;
  background: #F0F0FA;
}


.steps-title {
  font-size: 28px;
  font-weight: bold;
}

.steps-title2 {
  font-size: 22px;
  font-weight: bold;
}

.steps-title3 {
  font-size: 18px;
  font-weight: bold;
}

.steps-body {
  font-size: 14px;
}

.my-takeaways {
  font-size: 30px;
  font-weight: bold;
  padding: 6px;
  background: #F7E9E6;
  color: #F2441D;
}
</style>
""")



In [4]:
# pip install pymupdf tqdm

In [6]:
# pip install sentence-transformers

In [10]:
##import tools/libraries
import fitz
import re
from pathlib import Path
from tqdm import tqdm
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
import glob
import os
from sklearn.metrics.pairwise import cosine_similarity

<div class="steps">

  <div class="steps-title">
    Step 1: Create directories to documents for import
  </div>

  <div class="steps-body">
      
  - Make sure all state docs in the folder with "state_" before name   
  - Federal provision bill language with "fed_"
  - Lobbying group as "lobbying_" 
  - Rename industry doc as "industry_" 
  </div>

</div>

In [16]:
##create directories to docs
PDF_DIR = Path("new_PDFS/") ##all documents, all categories of documents
TXT_DIR = Path("new_textfiles/") ##where they will go as .txt
TXT_DIR.mkdir(exist_ok=True)

<div class="steps">

  <div class="steps-title">Step 2: Create functions to: clean all documents, convert from pdf to txt</div>
  </div>

In [19]:
##create functions to clean

def clean_text(text):
    # lowercase
    text = text.lower()

    # remove multiple spaces/newlines
    text = re.sub(r'\s+', ' ', text)

    # remove page numbers like "Page 3"
    text = re.sub(r'page\s*\d+', ' ', text)

    # remove standalone citations like [1], (a), (b)(3)
    text = re.sub(r'\[\d+\]', ' ', text)
    text = re.sub(r'\(\w\)', ' ', text)

    # remove section numbers like 12004(b)(3)
    text = re.sub(r'\d+\([a-z]\)(\(\d+\))?', ' ', text)

    return text.strip()


def pdf_to_clean_txt(pdf_path, txt_path):
    doc = fitz.open(pdf_path)
    full_text = ""

    for page in doc:
        full_text += page.get_text()

    cleaned = clean_text(full_text)

    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(cleaned)

<div class="steps">

  <div class="steps-title">Step 3: Convert PDFs to .TXTs using cleaning-converting multi-function</div>
  <div class="steps-body">
      
- Substitute PDF_DIR and TXT_DIR with your own paths created above 
    </div>

</div>
    

In [22]:
# Run conversion on every PDF (all docs in one folder)
for pdf_file in tqdm(PDF_DIR.glob("*.pdf")):
    txt_file = TXT_DIR / (pdf_file.stem + ".txt") ##change name in pdf to exact same name but with .txt extension
    pdf_to_clean_txt(pdf_file, txt_file)

65it [00:07,  9.14it/s]


<div class="steps">

  <div class="steps-title">Step 4: Create Embeddings from the cleaned .TXT files</div>
  <div class="steps-body">
      
- One .npy per .txt file, all stored in one folder
</div>

</div>

<div class="steps">
  <div class="steps-title3">Step 4A: import model, define paths, create single-doc embeddings (each doc in every cateogry becomes its own .npy)</div></div>

In [88]:
model = SentenceTransformer("bwang0911/jev2-legal") ##import the legal model from sentence-transformers

TXT_DIR = Path("new_textfiles/")
EMB_DIR = Path("new_embeddings/")
EMB_DIR.mkdir(exist_ok=True)

for txt_path in TXT_DIR.glob("*.txt"):
    with open(txt_path, "r", encoding="utf-8") as f:
        text = f.read()

    emb = model.encode([text])      # shape (1, d)
    out_path = EMB_DIR / f"{txt_path.stem}.npy"
    np.save(out_path, emb)
    print(f"Savaed {out_path}") ##check to make sure it worked

Savaed new_embeddings/state_CALIFORNIA - RHTP-Project-Abstract.npy
Savaed new_embeddings/state_NORTH CAROLINA - Rural Health Transformation Application_Final for sharing_0.npy
Savaed new_embeddings/state_MARYLAND MD-Project-Narrative-CMS-RHT-26-001-Cover-Updated-Watermark-(3).npy
Savaed new_embeddings/lobbying_Nov19_ModernizeCoord.npy
Savaed new_embeddings/state_CALIFORNIA CA Combined-Associations-RHTP-support-letter_10.28.25.npy
Savaed new_embeddings/state_OREGON - Dear Partners_OR RHTP_Application.npy
Savaed new_embeddings/state_TEXAS rural-txstrongproject-supp-mtrls.npy
Savaed new_embeddings/state_OREGON - Oregon RHT Program Application_Budget Narrative-2.npy
Savaed new_embeddings/state_COLORADO Project Summary (1).npy
Savaed new_embeddings/state_SOUTH DAKOTA project-summary.npy
Savaed new_embeddings/state_CONNECTICUT cms_project_narrative_20251105.npy
Savaed new_embeddings/state_NORTH DAKOTA nd-rhtp-project-summary.npy
Savaed new_embeddings/state_IDAHO RHTP Project Narrative.npy
Sa

<div class="steps">
  <div class="steps-title3">Step 4B: Now create combined embeddings for the categories with multiple docs (states, lobbying) for use in future functions</div>
    
  <div class="steps-body">
      
- This allows us to choose how we compare categories and docs. we can now compare a whole state's files to another, or all the state docs to all the lobbying docs, and any permutation in between</div>
</div>

In [92]:
EMB_DIR = Path("new_embeddings") ## mention our path for single embeddings again, this is where the combined ones will go 

# combined STATES embedding (average of all state_*.npy)
state_files = sorted(glob.glob(str(EMB_DIR / "state_*.npy")))
state_vecs = [np.load(f).reshape(1, -1) for f in state_files]
if state_vecs:
    combined_states = np.mean(np.vstack(state_vecs), axis=0, keepdims=True)
    np.save(EMB_DIR / "new_combined_states.npy", combined_states)
    print(f"Saved {EMB_DIR / 'new_combined_states.npy'}")

# combined LOBBYING embedding (average of all lobbying_*.npy)
lob_files_for_combined = sorted(glob.glob(str(EMB_DIR / "lobbying_*.npy")))
lob_vecs = [np.load(f).reshape(1, -1) for f in lob_files_for_combined]
if lob_vecs:
    combined_lobbying = np.mean(np.vstack(lob_vecs), axis=0, keepdims=True)
    np.save(EMB_DIR / "new_combined_lobbying.npy", combined_lobbying)
    print(f"Saved {EMB_DIR / 'new_combined_lobbying.npy'}")

Saved new_embeddings/new_combined_states.npy
Saved new_embeddings/new_combined_lobbying.npy


<div class="steps">

  <div class="steps-title">Step 5: Load all embeddings (individual and combined) into variables/ dictionaries</div>
  <div class="steps-body">
      
- For multi-document categories, create a dictionary and sort
- For single-document categories, just do a simple np.load
- For combined categories that we made into single .npys above, also do a simple np.load
- Reshape all to 2-D to run cosine similarity</div>

In [101]:
# states (individual)
state_files = sorted(glob.glob("new_embeddings/state_*.npy"))
state_embeddings_dict = {Path(f).stem: np.load(f) for f in state_files}

# lobbying (individual)
lob_files = sorted(glob.glob("new_embeddings/lobbying_*.npy"))
lob_embeddings_dict = {Path(f).stem: np.load(f) for f in lob_files}

# single-file categories (federal, industry)
fed_embedding = np.load("new_embeddings/federal_RHTPbill.npy")
industry_embedding = np.load("new_embeddings/industry_BioIntell.npy")

# combined categories
combined_states   = np.load("new_embeddings/new_combined_states.npy")
combined_lobbying = np.load("new_embeddings/new_combined_lobbying.npy")

# reshape all embeddings to 2-D
for k, v in state_embeddings_dict.items(): ##reshape state dict
    if v.ndim == 1:
        state_embeddings_dict[k] = v.reshape(1, -1)

for k, v in lob_embeddings_dict.items(): ## reshape lobbying dic
    if v.ndim == 1:
        lob_embeddings_dict[k] = v.reshape(1, -1)

for name in ["fed_embedding", "industry_embedding", ##reshape each document in each category
             "combined_states", "combined_lobbying"]:
    arr = globals()[name] ##i used Perplexity to help me with this global thing, it looks up the actual variable by name in the notebook 
    if arr.ndim == 1: ##check if arrray is 1-D
        globals()[name] = arr.reshape(1, -1) ##overwrites original variable with a 2-D reshaped version

print(f"Loaded {len(state_embeddings_dict)} state documents")
print(f"Loaded {len(lob_embeddings_dict)} lobbying documents")
print(f"Loaded combined_states and combined_lobbying documents")



Loaded 56 state documents
Loaded 7 lobbying documents
Loaded combined_states and combined_lobbying documents


<div class="steps">
  <div class="steps-title3">Step 5B: Function _load_doc_embedding returns numeric embedding of any document name from above to use on functions</div>
    
  <div class="steps-body">
      
- Specifically needed for the final mega-function, but can be resued anytime you want to turn a document name into its embedding without looking up dict/variable its in
- This function is the only one we don't really need to use on its own</div>

In [189]:
def _load_doc_embedding(doc_name):
    """
    Take doc name and return embedding
    """
    if doc_name in state_embeddings_dict:
        return state_embeddings_dict[doc_name]
    if doc_name in lob_embeddings_dict:
        return lob_embeddings_dict[doc_name]
    if doc_name in ["federal_RHTPbill", "fed_RHTP"]:
        return fed_embedding
    if doc_name in ["industry_BioIntell", "industry_biointel"]:
        return industry_embedding
    if doc_name == "combined_states":
        return combined_states
    if doc_name == "combined_lobbying":
        return combined_lobbying

    print(f"Document '{doc_name}' not found in embeddings.")
    return None


<div class="steps">
    <div class="steps-title">
Step 6: Create dual helper functions that list out docs/return embeddings in a category</div>
<div class="steps-title3">
Step 6A: list_state_docs function lists out all documents in chosen state</div>
  <div class="steps-body">
      
- For other kinds of projects, can replace "state" with whatever category folder has a large number of different documents that vary by sub-category (i.e., there are four docs for North Dakota but only two for Texas)</div>

<div class="steps-title3">Step 6B: list_docs_by_category lists out document names in each category</div
  <div class="steps-body">
      
- These will help look through categories to see what can be compared </div>

In [193]:
def list_state_docs(state_name=None): ##returns all state documents with the same uppercase string in the name, ie "ALABAMA"
    """List all documents, optionally filtered by state name"""
    if state_name:
        state_name = state_name.upper()
        return [name for name in state_embeddings_dict.keys() if state_name in name]
    return list(state_embeddings_dict.keys())

def list_docs_by_category(category=None):
    """
    List document names by category.

    category options:
      - 'states'   : all state_* docs
      - 'lobbying' : all lobbying_* docs
      - 'federal'  : the federal bill doc name
      - 'industry' : the industry doc name
      - 'combined' : combined_states and combined_lobbying
      - None       : return a dict with all of the above
    """
    cats = {
        "states":   list(state_embeddings_dict.keys()),
        "lobbying": list(lob_embeddings_dict.keys()),
        "federal":  ["federal_RHTPbill"],
        "industry": ["industry_BioIntell"],
        "combined": ["combined_states", "combined_lobbying"],
    }

    if category is None:
        return cats
    return cats.get(category, [])




<div class="steps">
    <div class="steps-title3">Run Step 6A</div></div>

In [196]:
# See that states are visible
print(list_state_docs())           
# check for one state
list_state_docs("NORTH DAKOTA")



['state_ALABAMA ARHTP-Project-Narrative', 'state_ALASKA ak-rhtp-initiatives', 'state_ALASKA ak-rhtp-program-summary', 'state_ARKANSAS Arkansas-Rural-Health-Transformation-Program-Application', 'state_CALIFORNIA - RHTP-Project-Abstract', 'state_CALIFORNIA CA Combined-Associations-RHTP-support-letter_10.28.25', 'state_COLORADO Project Narrative (acc)', 'state_COLORADO Project Summary (1)', 'state_COLORADO RHTP Budget Narrative', 'state_CONNECTICUT cms_budget-narrative_final', 'state_CONNECTICUT cms_project_narrative_20251105', 'state_GEORGIA - GREAT Health Program FAQs_November 2025', 'state_GEORGIA Rural Health Program Survey', 'state_GEORGIAN - RHT Program Application 11.25', 'state_IDAHO RHTP Project Narrative', 'state_INDIANA - RHTP-NarrativeIN', 'state_INDIANA - RHTP-WorkingGroupMeeting5Nov2025', 'state_IOWA - Project Narrative Iowa RHTP CMS-RHT-26-001', 'state_IOWA Project Narrative Iowa RHTP CMS-RHT-26-001', 'state_IOWA RHT Application Letter Iowa Governor Kim Reynolds 2025.11.03'

['state_NORTH DAKOTA nd-rhtp-budget-narrative',
 'state_NORTH DAKOTA nd-rhtp-governor-letter',
 'state_NORTH DAKOTA nd-rhtp-project-narrative',
 'state_NORTH DAKOTA nd-rhtp-project-summary']

<div class="steps">
    <div class="steps-title3">Run Step 6B</div></div>

In [199]:
list_docs_by_category('lobbying')

['lobbying_Jun16_IntroLetter',
 'lobbying_Jun25_HarnerssingDataHearing',
 'lobbying_Nov19_ModernizeCoord',
 'lobbying_jun25_CoachCareCEO',
 'lobbying_launchSTMT',
 'lobbying_principleSTMT',
 'lobbying_sep4_EmpowerPatients']

<div class="steps">
    <div class="steps-title">Step 7: Functions for comparison!</div>

<div class="steps-title3">Step 7A: compare_state_to_state_docs compares all docs for one state to all docs for another</div>
<div class="steps-body">
    
- Instead of comparing one individual state doc to each other, this function will summarize the differences between each state's doc package
- Transform to df so its easy to read, will return a table for easy side-by-side comparison of each doc</div></div>

In [203]:
def compare_state_to_state_docs(state1, state2):
    """
    Compare all documents for state1 to all documents for state2.
    Returns a pandas DataFrame: rows = state1 docs, columns = state2 docs,
    values = similarity percentages rounded to 3 decimal places.
    """
    state1_docs = list_state_docs(state1)
    state2_docs = list_state_docs(state2)

    if not state1_docs:
        print(f"No documents found for {state1}")
        return None
    if not state2_docs:
        print(f"No documents found for {state2}")
        return None

    data = []

    for doc1 in state1_docs:
        row = []
        emb1 = state_embeddings_dict[doc1].reshape(1, -1)

        for doc2 in state2_docs:
            emb2 = state_embeddings_dict[doc2].reshape(1, -1)
            sim = float(cosine_similarity(emb1, emb2)[0][0])
            row.append(sim * 100)   # percentage

        data.append(row)

    df = pd.DataFrame(data, index=state1_docs, columns=state2_docs)
    return df.round(3)


<div class="steps">
<div class="steps-title3">Run 7A: compare_state_to_state_docs</div></div>

In [206]:
tx_vs_ca = compare_state_to_state_docs("TEXAS", "CALIFORNIA")
tx_vs_ca

Unnamed: 0,state_CALIFORNIA - RHTP-Project-Abstract,state_CALIFORNIA CA Combined-Associations-RHTP-support-letter_10.28.25
state_TEXAS rural-txstrong-prjct-narr,80.078,73.523
state_TEXAS rural-txstrongproject-supp-mtrls,74.651,70.907


<div class="steps">
<div class="steps-title3">Step 7B: compare_state_doc compares one state doc to all other categories AND each other state doc</div>
  <div class="steps-body">
      
- This function will be reused within our mega-fuction (step 7)
- This function can also run on its own for a simple doc-to-doc comparison
- Returns two pandas DataFrames: one for category comparisons and one for individual documents, both in percentages rounded to three decimals.</div></div>

In [209]:
def compare_state_doc(state_doc_name):
    """
    Compare a specific state document to:
      - federal and industry (category table)
      - each lobbying doc
      - each other state doc

    Returns:
      cat_df  : 1-row DataFrame with percentage category similarities 
      docs_df : DataFrame with rows of doc names, column of 'similarity_%'
    """
    if state_doc_name not in state_embeddings_dict:
        print(f"State document '{state_doc_name}' not found")
        return None, None

    state_emb = state_embeddings_dict[state_doc_name].reshape(1, -1)

    ## categories
    to_federal  = float(cosine_similarity(state_emb, fed_embedding)[0][0]) * 100
    to_industry = float(cosine_similarity(state_emb, industry_embedding)[0][0]) * 100

    cat_df = pd.DataFrame(
        [[to_federal, to_industry]],
        index=[state_doc_name],
        columns=["federal_%", "industry_%"]
    ).round(3)

##individual docs
    doc_names = []
    sims = []

    # lobbying docs
    for lob_name, lob_emb in lob_embeddings_dict.items():
        sim = float(cosine_similarity(state_emb, lob_emb.reshape(1, -1))[0][0]) * 100
        doc_names.append(lob_name)
        sims.append(sim)

    # other states
    for other_name, other_emb in state_embeddings_dict.items():
        if other_name == state_doc_name:
            continue
        sim = float(cosine_similarity(state_emb, other_emb.reshape(1, -1))[0][0]) * 100
        doc_names.append(other_name)
        sims.append(sim)

    docs_df = pd.DataFrame(
        {"similarity_%": sims},
        index=doc_names
    ).round(3).sort_values("similarity_%", ascending=False)

    return cat_df, docs_df


<div class="steps">
<div class="steps-title3">Run 6A again to remind ourselves what docs are available in our chosen state</div></div>

In [212]:
list_state_docs("OHIO")

['state_OHIO Ohio+RHT+PROGRAM+NARRATIVE']

<div class="steps">
<div class="steps-title3">Run 7B</div></div>

In [214]:
## can save this as a dataframe(s) if reporting requires diving in to one specific state bill!
compare_state_doc("state_OHIO Ohio+RHT+PROGRAM+NARRATIVE")


(                                       federal_%  industry_%
 state_OHIO Ohio+RHT+PROGRAM+NARRATIVE     63.258      59.406,
                                                     similarity_%
 state_OREGON - Oregon RHT Project Summary and N...        84.853
 state_OREGON RHT Project Summary and Narrative            84.853
 state_COLORADO Project Narrative (acc)                    83.261
 state_MASSACHUSETTS Final RHTP Summary and Proj...        82.873
 state_WASHINGTON rhtp-project-narrative                   82.643
 ...                                                          ...
 lobbying_Nov19_ModernizeCoord                             55.136
 lobbying_principleSTMT                                    50.075
 state_IOWA RHT Application Letter Iowa Governor...         2.395
 state_NORTH DAKOTA nd-rhtp-governor-letter                 2.395
 state_WASHINGTON rhtp-governors-endorsement                2.395
 
 [62 rows x 1 columns])

<div class="steps">
    <div class="steps-title">Step 8: Mega-Function that compares any doc (of any type/category) to all categories AND three specific docs of your choice (compare_any_doc)</div>

<div class="steps-body">
    
- Remember that the load_doc_embedding will be used here
- Parameters: target_doc of your choice, doc1-doc3 of your choice
- Output: summary and dictionary of cosine similarity scores comparing target_doc to each category and then comparing target_doc to the three chosen docs1-3</div></div>

In [217]:
def compare_any_doc(target_doc, doc1, doc2, doc3):
    """
    Compare any document to:
      - federal (single)
      - industry (single)
      - combined states
      - combined lobbying
      - three specific documents of your choice

    Prints a readable summary and also returns a dict of scores.
    """
    target_emb = _load_doc_embedding(target_doc)
    if target_emb is None:
        return None

    results = {
        "target": target_doc,
        "categories": {},
        "specific_docs": {}
    }

    print("\n" + "="*70)
    print(f"ANALYZING: {target_doc}")
    print("="*70)

    # --- category similarities ---
    print("\nCATEGORY SIMILARITIES:")
    print("-" * 70)

    fed_sim = float(cosine_similarity(target_emb, fed_embedding)[0][0])
    ind_sim = float(cosine_similarity(target_emb, industry_embedding)[0][0])
    st_sim  = float(cosine_similarity(target_emb, combined_states)[0][0])
    lob_sim = float(cosine_similarity(target_emb, combined_lobbying)[0][0])

    results["categories"] = {
        "federal": fed_sim,
        "industry": ind_sim,
        "states_combined": st_sim,
        "lobbying_combined": lob_sim,
    }

    print(f"Federal (bill):          {fed_sim*100:6.1f}%")
    print(f"Industry (BioIntell):    {ind_sim*100:6.1f}%")
    print(f"States (combined):       {st_sim*100:6.1f}%")
    print(f"Lobbying (combined):     {lob_sim*100:6.1f}%")

    # --- three specific documents ---
    print("\nSPECIFIC DOCUMENT COMPARISONS:")
    print("-" * 70)

    for name in [doc1, doc2, doc3]:
        emb = _load_doc_embedding(name)
        if emb is None:
            continue
        sim = float(cosine_similarity(target_emb, emb)[0][0])
        results["specific_docs"][name] = sim
        print(f"{name:40s} {sim*100:6.1f}%")

    print("\n" + "="*70 + "\n")
    return results


<div class="steps">
<div class="steps-title3">Run 6B again to check docs in category im targeting</div></div>

In [220]:
list_docs_by_category('lobbying')

['lobbying_Jun16_IntroLetter',
 'lobbying_Jun25_HarnerssingDataHearing',
 'lobbying_Nov19_ModernizeCoord',
 'lobbying_jun25_CoachCareCEO',
 'lobbying_launchSTMT',
 'lobbying_principleSTMT',
 'lobbying_sep4_EmpowerPatients']

<div class="steps">
<div class="steps-title3">Check out all the state files to see what states i have</div></div>

In [227]:
state_files

['new_embeddings/state_ALABAMA ARHTP-Project-Narrative.npy',
 'new_embeddings/state_ALASKA ak-rhtp-initiatives.npy',
 'new_embeddings/state_ALASKA ak-rhtp-program-summary.npy',
 'new_embeddings/state_ARKANSAS Arkansas-Rural-Health-Transformation-Program-Application.npy',
 'new_embeddings/state_CALIFORNIA - RHTP-Project-Abstract.npy',
 'new_embeddings/state_CALIFORNIA CA Combined-Associations-RHTP-support-letter_10.28.25.npy',
 'new_embeddings/state_COLORADO Project Narrative (acc).npy',
 'new_embeddings/state_COLORADO Project Summary (1).npy',
 'new_embeddings/state_COLORADO RHTP Budget Narrative.npy',
 'new_embeddings/state_CONNECTICUT cms_budget-narrative_final.npy',
 'new_embeddings/state_CONNECTICUT cms_project_narrative_20251105.npy',
 'new_embeddings/state_GEORGIA - GREAT Health Program FAQs_November 2025.npy',
 'new_embeddings/state_GEORGIA Rural Health Program Survey.npy',
 'new_embeddings/state_GEORGIAN - RHT Program Application 11.25.npy',
 'new_embeddings/state_IDAHO RHTP Pr

In [231]:
print(list_state_docs('TEXAS'))
print(list_state_docs ('COLORADO'))

['state_TEXAS rural-txstrong-prjct-narr', 'state_TEXAS rural-txstrongproject-supp-mtrls']
['state_COLORADO Project Narrative (acc)', 'state_COLORADO Project Summary (1)', 'state_COLORADO RHTP Budget Narrative']


In [233]:
compare_any_doc('lobbying_principleSTMT', 'state_TEXAS rural-txstrong-prjct-narr','state_COLORADO Project Narrative (acc)', 'state_TEXAS rural-txstrongproject-supp-mtrls')


ANALYZING: lobbying_principleSTMT

CATEGORY SIMILARITIES:
----------------------------------------------------------------------
Federal (bill):            39.1%
Industry (BioIntell):      72.5%
States (combined):         57.1%
Lobbying (combined):       90.1%

SPECIFIC DOCUMENT COMPARISONS:
----------------------------------------------------------------------
state_TEXAS rural-txstrong-prjct-narr      53.9%
state_COLORADO Project Narrative (acc)     52.4%
state_TEXAS rural-txstrongproject-supp-mtrls   47.9%




{'target': 'lobbying_principleSTMT',
 'categories': {'federal': 0.39086276292800903,
  'industry': 0.7246020436286926,
  'states_combined': 0.5711296200752258,
  'lobbying_combined': 0.9014018774032593},
 'specific_docs': {'state_TEXAS rural-txstrong-prjct-narr': 0.5389922857284546,
  'state_COLORADO Project Narrative (acc)': 0.5244195461273193,
  'state_TEXAS rural-txstrongproject-supp-mtrls': 0.47871044278144836}}

<div class="steps">
    <div class="steps-title">Step 9: Using NLTK and chunking to print out matching text</div>

<div class="steps-title3"> Step 9A: Chunk-ify</div>
<div class="steps-body">
    
- Return tables showing text that matches across docs
-  This will show us semantic similarities, not just text matches: highly applicable for things like state bills that are often influenced by the same advocacy group/federal law but read slightly differently
    - finding near-duplicates is more important than finding exact text matches (that's easy to do with command+F)

- *NOTE*: Here we are going back to .txt files</div></div>

In [249]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/lizziewalsh/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [251]:


def get_doc_chunks_and_embeddings(txt_path, model, max_len=400):
    """
    Split one cleaned .txt into shorter chunks (sentences or short paragraphs),
    encode each chunk, and return (chunks, embeddings).
    """
    with open(txt_path, "r", encoding="utf-8") as f:
        text = f.read()

    # simple sentence split; you can swap for paragraph logic if you prefer
    sentences = nltk.sent_tokenize(text)

    # optional: merge very short sentences into bigger chunks
    chunks = []
    current = ""
    for s in sentences:
        if len(current) + len(s) < max_len:
            current += " " + s
        else:
            if current.strip():
                chunks.append(current.strip())
            current = s
    if current.strip():
        chunks.append(current.strip())

    emb = model.encode(chunks)   # shape (num_chunks, d)
    return chunks, emb


<div class="steps">
<div class="steps-title3">Run 9A for two or more docs i want to compare</div>
<div class="steps-body"> 
    
*NOTE*: You need to run this every time you want to do 9B</div></div>

In [256]:
txt_path1 = "new_textfiles/state_ALASKA ak-rhtp-initiatives.txt"  # adjust to your folder/name

chunks_al_AK_init, emb_al_AK_init = get_doc_chunks_and_embeddings(txt_path1, model)

len(chunks_al_AK_init), emb_al_AK_init.shape   # quick sanity check


(142, (142, 768))

In [258]:

txt_path2 = "new_textfiles/lobbying_principleSTMT.txt"  # adjust to your folder/name

chunks_al_lob_princ, emb_al_lob_princ = get_doc_chunks_and_embeddings(txt_path2, model)

len(chunks_al_lob_princ), emb_al_lob_princ.shape   # quick sanity check

(21, (21, 768))

<div class="steps">
<div class="steps-title3">Step 9B: function top_similar_chunks will return dataframe with the top most similar pairs of chunks</div>

<div class="steps-body">
    
- This function should be used after running the larger cosine comparisons. Now that you've narrowed in on an interesting doc pair or group of docs, you can dig deeper</div></div>

In [261]:
def top_similar_chunks(chunks_a, emb_a, chunks_b, emb_b, top_n=10):
    """
    Given chunks + embeddings for two docs, return a DataFrame with
    the top-N most similar pairs of chunks.
    """
    sims = cosine_similarity(emb_a, emb_b)   # matrix [len(a) x len(b)]
    pairs = []

    for i in range(sims.shape[0]):
        for j in range(sims.shape[1]):
            pairs.append((i, j, sims[i, j]))

    # sort by similarity descending and take top_n
    pairs = sorted(pairs, key=lambda x: x[2], reverse=True)[:top_n]

    rows = []
    for i, j, s in pairs:
        rows.append({
            "docA_chunk_id": i, 
            "docB_chunk_id": j, 
            "docA_text": chunks_a[i],
            "docB_text": chunks_b[j],
            "similarity_%": round(float(s) * 100, 3)
        })

    return pd.DataFrame(rows)


<div class="steps">
<div class="steps-title3">Run 9B</div></div>

In [268]:
##name these in variables so you can save for later

AK_init_vs_lobbying_princ = top_similar_chunks(chunks_al_AK_init, emb_al_AK_init, chunks_al_lob_princ, emb_al_lob_princ,  top_n=10)

AK_init_vs_lobbying_princ

Unnamed: 0,docA_chunk_id,docB_chunk_id,docA_text,docB_text,similarity_%
0,93,15,• establish alternative payment methodologies ...,given the unprecedented clinical improvements ...,72.318
1,104,15,support may include guidance on application de...,given the unprecedented clinical improvements ...,72.272
2,84,17,specific targeted support will be provided to ...,"moreover, we commit to only collect data relev...",70.812
3,120,2,this initiative empowers providers with innova...,when a patient leaves a hospital or physician ...,70.519
4,122,2,these tools can support symptom tracking and m...,when a patient leaves a hospital or physician ...,70.509
5,134,15,• build health it infrastructure to support pr...,given the unprecedented clinical improvements ...,70.168
6,104,12,support may include guidance on application de...,"in working with these clinicians, we prioritiz...",70.007
7,85,15,of 14 • support value-based care and alternati...,given the unprecedented clinical improvements ...,69.963
8,69,7,"enable care teams, health care providers, and ...",these clinicians serve as healthcare liaisons ...,69.887
9,104,17,support may include guidance on application de...,"moreover, we commit to only collect data relev...",69.747


<div class="steps">
    <div class="steps-title">Step 10: Function that shows top similar text spans across all documents you choose</div>
<div class="steps-body">
    
- This function should be used after running the larger cosine comparisons. Now that you've narrowed in on an interesting doc pair or group of docs, you can dig deeper</div>

<div class="steps-title3">Compares every pair of different docs in a set of 2-6 documents (any more will crash probably)</div>

<div class="steps-body">
    
- This will pull out top chunk matches across multiple docs, not just pairs
- Choose between 2-6 single documents you suspect are similar 
- Each doc will produce 100-500 chunks so this comparison can be REALLY slow/noisy because its looking at tens of thousands of similarity calculations
- For best use, choose 2-3 state bills, 1 federal bill, and one lobbying/industry doc (or similar permutation)</div></div>


<div class="steps">
    <div class="steps-title">Step 10A: Label the chunks for 10B</div>

<b>docs_chunk</b> is a dictionary of two dictionaires:
-  Dict 1: Document Key
    - <b>doc_id</b> is a document's name string ("state_ALABAMA_narrative")
    - <b>txt_path</b> is the exact location of the doc ("new_textfiles/state_ALABAMA_narrative.txt")
- Dict 2: Value contains chunks and emb for that doc</b>
    - <b>chunks</b> is the list of text spans for the doc
    - <b>emb</b> is array of embeddings for those spans</div>

In [318]:
## need this now
from itertools import combinations

In [298]:
## you'll probably need to look up the lists again, but for .txts
## we remember that our TXT_DIR holds all the text files, and they all have a prefix of their category
## states also have a state_ prefix and a state name in all caps ("state_ALABAMA..")

def list_textfiles_by_prefix(prefix=None, contains=None):
    """
    List .txt files in TXT_DIR.

    - prefix: leading part of filename like 'state_' or 'lobbying_'
              (if None, list all .txt files)
    - contains: optional substring to filter on, e.g. 'ALABAMA'
    """
    pattern = "*.txt" if prefix is None else f"{prefix}*.txt"
    files = sorted(TXT_DIR.glob(pattern))
    names = [p.name for p in files]

    if contains:
        contains = contains.upper()
        names = [n for n in names if contains in n.upper()]

    return names

In [336]:
print(list_textfiles_by_prefix("state_INDIANA"), list_textfiles_by_prefix("state_ALABAMA"),list_textfiles_by_prefix("state_COLORADO"), list_textfiles_by_prefix("state_TEXAS"))

['state_INDIANA - RHTP-NarrativeIN.txt', 'state_INDIANA - RHTP-WorkingGroupMeeting5Nov2025.txt'] ['state_ALABAMA ARHTP-Project-Narrative.txt'] ['state_COLORADO Project Narrative (acc).txt', 'state_COLORADO Project Summary (1).txt', 'state_COLORADO RHTP Budget Narrative.txt'] ['state_TEXAS rural-txstrong-prjct-narr.txt', 'state_TEXAS rural-txstrongproject-supp-mtrls.txt']


In [343]:
# example: build for a few docs you care about
# Give the big dictionary a unique name AND change the function below to reflect it OR:
# Keep it doc_chunk always but overwrite the dicts each time.
docs_chunks_narratives_IN_TX_AL_CO = {}

for doc_id, txt_path in [
    ("state_INDIANA_narrative", "new_textfiles/state_INDIANA - RHTP-NarrativeIN.txt"),
    ("state_TEXAS_narrative",   "new_textfiles/state_TEXAS rural-txstrong-prjct-narr.txt"),
    ("state_ALABAMA_narrative",          "new_textfiles/state_ALABAMA ARHTP-Project-Narrative.txt"),
    ("state_COLORADO_narrative",        "new_textfiles/state_COLORADO Project Narrative (acc).txt"),
]:
    chunks, emb = get_doc_chunks_and_embeddings(txt_path, model)
    docs_chunks_narratives_IN_TX_AL_CO[doc_id] = {"chunks": chunks, "emb": emb}


<div class="steps">
    <div class="steps-title">Step 10B: top_similar_spans_across_docs function takes a dict of doc_id chunks and embeds and computes similarities</div>
<div class="steps-body">
    
- Compute similarities of each possible document pair as defined above
- Returns dataframe of top-N most similar text span pairs
    - Sorts all span‑to‑span similarities from all pairs, returns the top rows. Each row tells you which two docs, which two spans, and the similarity score between</div></div>


In [328]:
def top_similar_spans_across_docs(docs_chunks, top_n=5, min_sim=0.0):
    """
    Given a dict:
      {doc_id: {"chunks": [text...], "emb": array[num_chunks, dim]}}
    compute similarities between all pairs of docs and return
    a DataFrame of the top-N most similar span pairs overall.
    """
    rows = []

    # all unordered pairs of different docs
    for doc_a, doc_b in combinations(docs_chunks.keys(), 2):
        chunks_a = docs_chunks[doc_a]["chunks"]
        emb_a    = docs_chunks[doc_a]["emb"]
        chunks_b = docs_chunks[doc_b]["chunks"]
        emb_b    = docs_chunks[doc_b]["emb"]

        sim_matrix = cosine_similarity(emb_a, emb_b)  # [len(a) x len(b)]

        for i in range(sim_matrix.shape[0]):
            for j in range(sim_matrix.shape[1]):
                sim = float(sim_matrix[i, j])
                if sim < min_sim:
                    continue
                rows.append({
                    "docA": doc_a,
                    "spanA_id": i,
                    "docA_text": chunks_a[i],
                    "docB": doc_b,
                    "spanB_id": j,
                    "docB_text": chunks_b[j],
                    "similarity_%": round(sim * 100, 3)
                })

    if not rows:
        return pd.DataFrame()

    df = pd.DataFrame(rows)
    df = df.sort_values("similarity_%", ascending=False).head(top_n)
    return df


In [345]:
top_similar_spans_across_docs(docs_chunks_narratives_IN_TX_AL_CO, top_n=5, min_sim=0.0)

Unnamed: 0,docA,spanA_id,docA_text,docB,spanB_id,docB_text,similarity_%
656247,state_TEXAS_narrative,263,main strategic goals: tech innovation uses of ...,state_COLORADO_narrative,161,40 main strategic goal: tech innovation use of...,91.606
633455,state_TEXAS_narrative,186,"main strategic goals: tech innovation, sustain...",state_COLORADO_narrative,161,40 main strategic goal: tech innovation use of...,90.192
410870,state_INDIANA_narrative,302,main strategic goal: make rural america health...,state_COLORADO_narrative,106,main strategic goal: make rural america health...,90.1
276799,state_INDIANA_narrative,302,main strategic goal: make rural america health...,state_ALABAMA_narrative,127,rural health initiative summary category infor...,89.856
410814,state_INDIANA_narrative,302,main strategic goal: make rural america health...,state_COLORADO_narrative,50,"goals, initiatives, and outputs goal initiativ...",89.653


---

<div class="my-takeaways">TAKEAWAYS:

- Texas, Indiana, and Colorado narrative RHTP proposals are 89-90% identical
- 
</div>
