# JFK and RFK declassified Information Notebook analysis #
Tamir Abu Haya @ 2025
The following notebook analyzes the JFK and RFK declassified files that were released to the public.

Requirements to run the code:
* Access to your Google Drive as the files that are loaded will be loaded in there.
* Atleast 120-150 GB (Safer to have atleast 150GB) of free space
* A lot of runtime, the code ran for around 60~ hours or so.



For each section you have the imports and constants at the start, it is possible to change constants to make things faster/more accurate.

It is important to note that only the first section is required to run for each other section as it is preliminary and is responsible for the google mount. All other sections can be run in different sessions. There may be some possible conflicts in dependancies in the code, restarting runtime or deleting session and trying again mostly solved these issues for me.


I believe the analysis section is still very rudimentary and can be greatly expanded, the extraction of text seems to be reliable but I believe it can be further improved using more accurate methods (but bare in mind this will greatly increase the time)


# IMPORTANT NOTICE : #
since the sections can't be run simultaneously in the same session, this notebook has some of the cells with no output (I understand this may be a problem but unfortunately the runs were too much processing time and it wasnt feasible to keep the output cells of each section). However as compensation, I added the google drive folder of what the code performed, it can be found in the following link:

https://drive.google.com/drive/folders/1u6-9GXIEElA4AJjVQIxhtC4eDiNWfLaB?usp=sharing



## Section 1: Mounting Google Drive ##

This section of the code mounts google drive into our project and create the folder schema required for the assignment.

It is advised to keep the constants as is..

In [1]:
#google drive mount
from google.colab import drive
drive.mount('/content/drive')

#imports
import os, shutil, glob, zipfile, hashlib, time, json, re

Mounted at /content/drive


In [2]:
#constants
DATA_ROOT = "/content/drive/MyDrive/JFK_RFK_Project"

JFK_DIR = os.path.join(DATA_ROOT, "JFK")
RFK_DIR = os.path.join(DATA_ROOT, "RFK")
REF_DIR = os.path.join(DATA_ROOT, "Reference_Reports")

JFK_BY_RELEASE = os.path.join(JFK_DIR, "By_Release_Date")
RFK_BY_RELEASE = os.path.join(RFK_DIR, "By_Release_Date")

JFK_REL_PRE_2017  = os.path.join(JFK_BY_RELEASE, "Pre_2017_Releases")
JFK_REL_2017_2023 = os.path.join(JFK_BY_RELEASE, "2017_2023_Releases")
JFK_REL_2025      = os.path.join(JFK_BY_RELEASE, "2025_Releases")

RFK_REL_PRE_2025  = os.path.join(RFK_BY_RELEASE, "Pre_2025_Releases")
RFK_REL_2025      = os.path.join(RFK_BY_RELEASE, "2025_Releases")

JFK_AGENCY = os.path.join(JFK_DIR, "Agency_Releases")
RFK_AGENCY = os.path.join(RFK_DIR, "Agency_Releases")

#Supported types of files
KIND_DIR = {"pdf":"pdf", "audio":"audio", "image":"image", "other":"other"}

TMP_ON_DRIVE     = os.path.join(DATA_ROOT, ".tmp")
MARKERS_DIR      = os.path.join(DATA_ROOT, ".zip_markers")
MANIFEST_PARQUET = os.path.join(DATA_ROOT, "manifest.parquet")

#extensions supported
INTERESTING_EXTS = (".pdf", ".mp3", ".wav", ".jpg", ".jpeg", ".png", ".tif", ".tiff")



In [3]:
#functions to create folder schema for project
def ensure_dir(p: str):
    os.makedirs(p, exist_ok=True)

def precreate_schema():
    for p in [
        DATA_ROOT, JFK_DIR, RFK_DIR, REF_DIR,
        JFK_BY_RELEASE, RFK_BY_RELEASE,
        JFK_REL_PRE_2017, JFK_REL_2017_2023, JFK_REL_2025,
        RFK_REL_PRE_2025, RFK_REL_2025,
        JFK_AGENCY, RFK_AGENCY,
        TMP_ON_DRIVE, MARKERS_DIR,
    ]:
        ensure_dir(p)

    leaves = [
        JFK_REL_PRE_2017, JFK_REL_2017_2023, JFK_REL_2025,
        RFK_REL_PRE_2025, RFK_REL_2025,
        JFK_AGENCY, RFK_AGENCY,
        REF_DIR
    ]
    for leaf in leaves:
        for kd in KIND_DIR.values():
            ensure_dir(os.path.join(leaf, kd))

precreate_schema()
print("Drive mounted and Project root:", DATA_ROOT)

Drive mounted and Project root: /content/drive/MyDrive/JFK_RFK_Project


## Section 2: Loading the Files ##

This section is a script that is responsible for loading the files from the archives and loading them into the relevant location in the google drive..

IMPORTANT: the JFK files from the NSA (https://www.nsa.gov/Helpful-Links/NSA-FOIA/Declassification-Transparency-Initiatives/Historical-Releases/JFK/) need to be loaded manually. Additionally, the RFK files from the FBI vault (https://vault.fbi.gov/Robert%20F%20Kennedy%20(Assassination)%20/Robert%20F%20Kennedy%20(Assassination)%20Part%2001/view) also need to be loaded manually.


In [4]:
#pip installs
!pip -q install beautifulsoup4

In [10]:
#imports
import os, re, io, sys, time, json, shutil, zipfile, hashlib, glob
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm.auto import tqdm
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import pandas

In [6]:
#Constants (mainly sources)
JFK_BULK_PAGE = "https://www.archives.gov/research/jfk/jfkbulkdownload"
JFK_2025_PAGE = "https://www.archives.gov/research/jfk/release-2025"
RFK_2025_PAGE = "https://www.archives.gov/research/rfk"


WARREN_TOC = "https://www.archives.gov/research/jfk/warren-commission-report/toc"
HSCA_TOC   = "https://www.archives.gov/research/jfk/select-committee-report/toc"
ARRB_PDF_SOURCES = [
    "https://www.archives.gov/files/research/jfk/review-board/report/arrb-final-report.pdf",
    "https://www.govinfo.gov/content/pkg/GOVPUB-Y3-PURL-LPS11922/pdf/GOVPUB-Y3-PURL-LPS11922.pdf",
    "https://ia801705.us.archive.org/7/items/AssassinationsRecordsReviewBoard/Final%20Report%20of%20the%20Assassinations%20Records%20Review%20Board.pdf"]

#HTTP header for script
HTTP_HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Accept": "text/html,application/pdf;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.archives.gov/",
    "Connection": "keep-alive",
}

TIMEOUT = 120
HTTP_RETRIES = 5


AUDIO_EXT = (".mp3", ".wav")
IMAGE_EXT = (".jpg", ".jpeg", ".png", ".tif", ".tiff")
PDF_EXT = (".pdf")


In [7]:
#session initialization
SESSION = requests.Session()
retries = Retry(
    total=HTTP_RETRIES,
    backoff_factor=0.6,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=frozenset(['GET'])
)

SESSION.mount("https://", HTTPAdapter(max_retries=retries))
SESSION.mount("http://", HTTPAdapter(max_retries=retries))


In [8]:
#function utilities
def http_get(url: str, stream=True) -> requests.Response:
    last_exc = None
    for _ in range(HTTP_RETRIES):
        try:
            r = SESSION.get(url, headers=HTTP_HEADERS, stream=stream, timeout=TIMEOUT, allow_redirects=True)
            if r.status_code == 200:
                return r
            if stream and r.status_code in (403, 406):
                r2 = SESSION.get(url, headers=HTTP_HEADERS, stream=False, timeout=TIMEOUT, allow_redirects=True)
                if r2.status_code == 200:
                    return r2
        except Exception as e:
            last_exc = e
        time.sleep(0.8)
    if last_exc:
        raise last_exc
    raise RuntimeError(f"HTTP failed: {url}")


def classify_ext(name: str) -> str:
    ext = os.path.splitext(name)[1].lower()
    if ext == PDF_EXT: return "pdf"
    if ext in AUDIO_EXT: return "audio"
    if ext in IMAGE_EXT: return "image"
    return "other"

def download_file(url: str, dst: str) -> dict:
    try:
        if os.path.exists(dst):
            return {"status":"skipped", "reason":"exists"}
        r = http_get(url, stream=True)
        total = int(r.headers.get("content-length", 0))
        os.makedirs(os.path.dirname(dst), exist_ok=True)
        tmp = dst + ".part"
        with open(tmp, "wb") as f, tqdm(total=total, unit="B", unit_scale=True, desc=os.path.basename(dst), disable=(total==0)) as pbar:
            for chunk in r.iter_content(1024*64):
                if not chunk: continue
                f.write(chunk)
                if total: pbar.update(len(chunk))
        os.replace(tmp, dst)
        return {"status":"ok"}
    except Exception as e:
        try:
            if os.path.exists(dst + ".part"): os.remove(dst + ".part")
        except: pass
        return {"status":"error", "reason": str(e)}

def _marker_key(url: str, label: str, era_root: str) -> str:
    m = hashlib.md5()
    m.update(url.encode("utf-8"))
    m.update(label.encode("utf-8"))
    m.update(era_root.encode("utf-8"))
    return os.path.join(MARKERS_DIR, f"{m.hexdigest()}_{os.path.basename(label)}.done")

def move_by_kind(src_path: str, era_root: str) -> str:
    kind = classify_ext(src_path)
    dst_dir = os.path.join(era_root, KIND_DIR[kind])
    dst = os.path.join(dst_dir, os.path.basename(src_path))
    if os.path.abspath(src_path) == os.path.abspath(dst):
        return dst
    if not os.path.exists(dst):
        shutil.move(src_path, dst)
    else:
        if os.path.abspath(src_path) != os.path.abspath(dst):
            try: os.remove(src_path)
            except: pass
    return dst

def safe_unzip(zip_path: str, era_root: str, marker_path: str) -> dict:
    try:
        if not os.path.exists(zip_path):
            return {"status":"skipped", "reason":"zip missing"}
        tmp_extract = os.path.join(TMP_ON_DRIVE, f"_extract_{os.path.splitext(os.path.basename(zip_path))[0]}")
        os.makedirs(tmp_extract, exist_ok=True)
        with zipfile.ZipFile(zip_path, 'r') as z:
            z.extractall(tmp_extract)

        moved = 0
        for root, _, files in os.walk(tmp_extract):
            for fn in files:
                src = os.path.join(root, fn)
                if os.path.isfile(src):
                    newp = move_by_kind(src, era_root)
                    if newp != src: moved += 1

        shutil.rmtree(tmp_extract, ignore_errors=True)
        try: os.remove(zip_path)
        except: pass

        with open(marker_path, "w") as f: f.write("ok")
        return {"status":"ok", "moved": moved}
    except Exception as e:
        try: os.remove(zip_path)
        except: pass
        return {"status":"error", "reason": str(e)}

def write_manifest(rows: list):
    if not rows: return
    df = pandas.DataFrame(rows)
    if os.path.exists(MANIFEST_PARQUET):
        old = pandas.read_parquet(MANIFEST_PARQUET)
        df = pandas.concat([old, df], ignore_index=True).drop_duplicates(subset=["url","local_path"])
    df.to_parquet(MANIFEST_PARQUET)

In [11]:
def parse_nara_jfk_bulk_zip_links(bulk_page_url=JFK_BULK_PAGE):
    r = http_get(bulk_page_url, stream=False)
    soup = BeautifulSoup(r.content, "html.parser")
    year_pat = re.compile(r"(2017|2018|2021|2022|2023|2025)")
    out = []

    sections = []
    for tag in soup.find_all(["h2", "h3"]):
        txt = (tag.get_text(" ", strip=True) or "")
        m = year_pat.search(txt)
        if m:
            sections.append((tag, m.group(1), txt))

    if not sections:
        for a in soup.find_all("a", href=True):
            href = a["href"]
            if href.lower().endswith(".zip"):
                url = urljoin(bulk_page_url, href)
                text = a.parent.get_text(" ", strip=True) if a.parent else a.get_text(strip=True)
                md5 = None
                m = re.search(r"MD5:\s*([0-9a-fA-F]{32})", text or "")
                if m: md5 = m.group(1).lower()
                out.append({"url": url, "label": os.path.basename(urlparse(url).path), "md5": md5, "year": None, "section_title":"unknown"})
        return out

    for idx, (heading, year, title) in enumerate(sections):
        next_heading = sections[idx + 1][0] if idx + 1 < len(sections) else None
        node = heading.next_sibling
        while node and node != next_heading:
            if hasattr(node, "find_all"):
                for a in node.find_all("a", href=True):
                    href = a["href"]
                    if href.lower().endswith(".zip"):
                        url = urljoin(bulk_page_url, href)
                        text = a.parent.get_text(" ", strip=True) if getattr(a, "parent", None) else a.get_text(strip=True)
                        md5 = None
                        m = re.search(r"MD5:\s*([0-9a-fA-F]{32})", text or "")
                        if m: md5 = m.group(1).lower()
                        out.append({"url": url, "label": os.path.basename(urlparse(url).path), "md5": md5, "year": year, "section_title": title})
            node = node.next_sibling
    return out

def jfk_year_bin(year: str) -> str:
    if year in {"2017","2018","2021","2022","2023"}: return JFK_REL_2017_2023
    if year == "2025": return JFK_REL_2025
    return JFK_REL_PRE_2017

def download_nara_jfk_bulk():
    links = parse_nara_jfk_bulk_zip_links()
    results, manifest = [], []

    for it in links:
        era_root = jfk_year_bin(it["year"] or "2017")
        marker = _marker_key(it["url"], it["label"], era_root)

        if os.path.exists(marker):
            results.append((it["label"], "skip:marker", f"already extracted → {os.path.basename(era_root)}"))
            continue

        zip_tmp  = os.path.join(TMP_ON_DRIVE, it["label"])
        res = download_file(it["url"], zip_tmp)

        if res["status"] == "skipped":
            results.append((it["label"], "skip:file", "zip already present; extracting"))
        elif res["status"] == "error":
            results.append((it["label"], "error", res.get("reason","")))
            continue
        else:
            results.append((it["label"], "ok:downloaded", f"saved → {zip_tmp}"))

        if it.get("md5"):
            try:
                md5 = hashlib.md5()
                with open(zip_tmp,"rb") as f:
                    for chunk in iter(lambda: f.read(2**20), b""): md5.update(chunk)
                if md5.hexdigest().lower() != it["md5"]:
                    results.append((it["label"], "error", "md5 mismatch; removed"))
                    try: os.remove(zip_tmp)
                    except: pass
                    continue
            except Exception as e:
                results.append((it["label"], "error", f"md5 check failed: {e}"))
                try: os.remove(zip_tmp)
                except: pass
                continue

        ez = safe_unzip(zip_tmp, era_root, marker)
        if ez["status"] == "ok":
            for kd in KIND_DIR.values():
                kdir = os.path.join(era_root, kd)
                for fn in os.listdir(kdir):
                    full = os.path.join(kdir, fn)
                    if os.path.isfile(full) and any(full.lower().endswith(ext) for ext in INTERESTING_EXTS):
                        manifest.append({
                            "collection":"JFK",
                            "bin": os.path.basename(era_root),
                            "source":"NARA_JFK_BULK",
                            "section_title": it.get("section_title",""),
                            "url": it["url"],
                            "local_path": full,
                            "file_kind": classify_ext(full)
                        })
            results.append((it["label"], "ok:extracted", f"→ {os.path.basename(era_root)} (moved={ez.get('moved',0)})"))
        else:
            results.append((it["label"], "error", f"extract failed: {ez.get('reason')}"))

    write_manifest(manifest)
    return results

def parse_archives_page_files(page_url: str):
    r = http_get(page_url, stream=False)
    soup = BeautifulSoup(r.content, "html.parser")
    links = []
    for a in soup.find_all("a", href=True):
        href = a["href"]
        if any(href.lower().endswith(ext) for ext in INTERESTING_EXTS):
            links.append(urljoin(page_url, href))
    return sorted(set(links))

def download_jfk_2025_page():
    era_root = JFK_REL_2025
    manifest, results = [], []
    for url in parse_archives_page_files(JFK_2025_PAGE):
        fn = os.path.basename(urlparse(url).path)
        kind = classify_ext(fn)
        dst = os.path.join(era_root, KIND_DIR[kind], fn)
        res = download_file(url, dst)
        if res["status"] == "ok":
            results.append((fn, "ok:downloaded", f"→ {dst}"))
            manifest.append({"collection":"JFK","bin": os.path.basename(era_root),"source":"NARA_JFK_2025","url": url,"local_path": dst,"file_kind": kind})
        elif res["status"] == "skipped":
            results.append((fn, "skip:file", "already present"))
        else:
            results.append((fn, "error", res.get("reason","")))
    write_manifest(manifest)
    return results

def download_jfk_pre2017_core():
    era_root = JFK_REL_PRE_2017
    manifest, results = [], []

    for landing in [WARREN_TOC, HSCA_TOC]:
        dst = os.path.join(era_root, KIND_DIR["other"], os.path.basename(urlparse(landing).path) + ".url.txt")
        if not os.path.exists(dst):
            with open(dst, "w") as f: f.write(landing + "\n")
            results.append((os.path.basename(dst), "ok:recorded", "landing saved"))
        else:
            results.append((os.path.basename(dst), "skip:file", "landing already recorded"))
        manifest.append({"collection":"JFK","bin":os.path.basename(era_root),"source":"NARA_TOC","url":landing,"local_path":dst,"file_kind":"other"})

    for url in ARRB_PDF_SOURCES:
        fn = os.path.basename(urlparse(url).path)
        dst = os.path.join(era_root, KIND_DIR["pdf"], fn)
        res = download_file(url, dst)
        if res["status"] == "ok":
            results.append((fn, "ok:downloaded", f"→ {dst}"))
            manifest.append({"collection":"JFK","bin": os.path.basename(era_root),"source":"ARRB","url": url,"local_path": dst,"file_kind":"pdf"})
        elif res["status"] == "skipped":
            results.append((fn, "skip:file", "already present"))
        else:
            results.append((fn, "error", res.get("reason","")))
    write_manifest(manifest)
    return results

def download_rfk_2025_page():
    era_root = RFK_REL_2025
    manifest, results = [], []
    for url in parse_archives_page_files(RFK_2025_PAGE):
        fn = os.path.basename(urlparse(url).path)
        kind = classify_ext(fn)
        dst = os.path.join(era_root, KIND_DIR[kind], fn)
        res = download_file(url, dst)
        if res["status"] == "ok":
            results.append((fn, "ok:downloaded", f"→ {dst}"))
            manifest.append({"collection":"RFK","bin": os.path.basename(era_root),"source":"NARA_RFK_2025","url": url,"local_path": dst,"file_kind": kind})
        elif res["status"] == "skipped":
            results.append((fn, "skip:file", "already present"))
        else:
            results.append((fn, "error", res.get("reason","")))
    write_manifest(manifest)
    return results

def download_reference_reports():
    base_dir = REF_DIR
    manifest, results = [], []

    for landing in [WARREN_TOC, HSCA_TOC]:
        dst = os.path.join(base_dir, KIND_DIR["other"], os.path.basename(urlparse(landing).path) + ".url.txt")
        if not os.path.exists(dst):
            with open(dst, "w") as f: f.write(landing + "\n")
            results.append((os.path.basename(dst), "ok:recorded", "landing saved"))
        else:
            results.append((os.path.basename(dst), "skip:file", "landing already recorded"))
        manifest.append({"collection":"REFERENCE","bin":"TOC","source":"NARA_TOC","url":landing,"local_path":dst,"file_kind":"other"})

    for url in ARRB_PDF_SOURCES:
        fn = os.path.basename(urlparse(url).path)
        dst = os.path.join(base_dir, KIND_DIR["pdf"], fn)
        res = download_file(url, dst)
        if res["status"] == "ok":
            results.append((fn, "ok:downloaded", f"→ {dst}"))
            manifest.append({"collection":"REFERENCE","bin":"ARRB","source":"ARRB","url":url,"local_path":dst,"file_kind":"pdf"})
        elif res["status"] == "skipped":
            results.append((fn, "skip:file", "already present"))
        else:
            results.append((fn, "error", res.get("reason","")))

    write_manifest(manifest)
    return results

def run_full_acquisition():
    print("1. NARA JFK bulk (2017_2023 / 2025)")
    for label, status, msg in download_nara_jfk_bulk():
        print(f"[{status}] {label} — {msg}")

    print("2. JFK (2025)")
    for label, status, msg in download_jfk_2025_page():
        print(f"[{status}] {label} — {msg}")

    print("3. JFK (Pre_2017) (Warren/HSCA TOCs + ARRB PDFs)")
    for label, status, msg in download_jfk_pre2017_core():
        print(f"[{status}] {label} — {msg}")

    print("4. RFK (2025)")
    for label, status, msg in download_rfk_2025_page():
        print(f"[{status}] {label} — {msg}")

    print("5. Reference Reports")
    for label, status, msg in download_reference_reports():
        print(f"[{status}] {label} — {msg}")

    print("Done")

run_full_acquisition()


1. NARA JFK bulk (2017_2023 / 2025)
[skip:marker] jfk-pdf1.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-pdf2.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-pdf3.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-wav1.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-wav2.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-wav3.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-wav4.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-wav5.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk-wav6.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk20171103.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk20171109.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk2017111710.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk20171215a.zip — already extracted → 2017_2023_Releases
[skip:marker] jfk20171215b.zip — already extracted → 2017_2023_Re

In [12]:
def folder_preview(root: str, max_rows: int=8):
    rows = []
    for cur, _, files in os.walk(root):
        keep = [f for f in files if f.lower().endswith(INTERESTING_EXTS)]
        if keep:
            rows.append({
                "folder": cur.replace(root, "").lstrip("/"),
                "count": len(keep),
                "samples": ", ".join(sorted(keep)[:max_rows])
            })
    if not rows:
        return pd.DataFrame(columns=["folder","count","samples"])
    return pd.DataFrame(rows).sort_values(by=["folder","count"], ascending=[True, False])

def preview_all():
    print("JFK / By_Release_Date:")
    display(folder_preview(JFK_BY_RELEASE))
    print("\nJFK / Agency_Releases:")
    display(folder_preview(JFK_AGENCY))
    print("\nRFK / By_Release_Date:")
    display(folder_preview(RFK_BY_RELEASE))
    print("\nRFK / Agency_Releases:")
    display(folder_preview(RFK_AGENCY))
    print("\nReference_Reports:")
    display(folder_preview(REF_DIR))
    if os.path.exists(MANIFEST_PARQUET):
        print("\nManifest head:")
        display(pd.read_parquet(MANIFEST_PARQUET).head(40))

preview_all()


JFK / By_Release_Date:


Unnamed: 0,folder,count,samples
1,2017_2023_Releases/audio,19,"104-10534-10000.wav, 104-10534-10001.wav, 104-..."
2,2025_Releases/pdf,2566,"104-10003-10041.pdf, 104-10004-10143 (C0693220..."
0,Pre_2017_Releases/pdf,3,Final%20Report%20of%20the%20Assassinations%20R...



JFK / Agency_Releases:


Unnamed: 0,folder,count,samples
0,pdf,353,"JFK00001.pdf, JFK00002.pdf, JFK00003.pdf, JFK0..."



RFK / By_Release_Date:


Unnamed: 0,folder,count,samples
1,2025_Releases/audio,13,fbi_experience_rfk-2017_0108-box_005_tray_001-...
0,2025_Releases/pdf,1969,00010472_no_information_identifiable_with_shar...



RFK / Agency_Releases:


Unnamed: 0,folder,count,samples
0,pdf,3,"rfksumm1a.pdf, rfksumm1b.pdf, rfksumm1c.pdf"



Reference_Reports:


Unnamed: 0,folder,count,samples
0,pdf,3,Final%20Report%20of%20the%20Assassinations%20R...



Manifest head:


Unnamed: 0,collection,bin,source,section_title,url,local_path,file_kind
0,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
1,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
2,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
3,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
4,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
5,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
6,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
7,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
8,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf
9,JFK,2017_2023_Releases,NARA_JFK_BULK,unknown,https://www.archives.gov/files/research/jfk/re...,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,pdf


## Section 3: Extracting unstructured data ##

The following code is responsible for extracting the text from the audio files and pdf files mostly. We use faster-whisper for audio extraction and pymupdf for pdf scanning and scanned pages go through Tesseract OCR for text extraction. The text is stored in JSONL sidecar files.

In [3]:
#pip installs
!apt -yq install tesseract-ocr ffmpeg >/dev/null
!pip -q install "pymupdf==1.24.10" "pymupdf4llm>=0.0.17" \
                pdfminer.six pytesseract \
                pillow pandas pyarrow fastparquet tqdm \
                faster-whisper==1.0.3



[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m93.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m97.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.7/34.7 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.8/38.8 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m93.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [4]:
#imports
import os, io, json, time, hashlib, random, sys, shutil, threading, gc, inspect, re
from pathlib import Path
from typing import List, Dict, Any, Iterable, Tuple

import pandas as pd
from tqdm.auto import tqdm
from PIL import Image

import fitz
import pymupdf4llm
from pdfminer.high_level import extract_text as pdfminer_extract_text
import pytesseract
from faster_whisper import WhisperModel
from IPython.display import display, Markdown

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed

In [5]:
#constants

DRIVE_ROOT = "/content/drive/MyDrive/JFK_RFK_Project"
DRIVE_JFK  = os.path.join(DRIVE_ROOT, "JFK")
DRIVE_RFK  = os.path.join(DRIVE_ROOT, "RFK")
DRIVE_REF  = os.path.join(DRIVE_ROOT, "Reference_Reports")

DATA_ROOT = DRIVE_ROOT
JFK_DIR   = DRIVE_JFK
RFK_DIR   = DRIVE_RFK
REF_DIR   = DRIVE_REF

RX_DOC_EXTS   = (".pdf",)
RX_IMAGE_EXTS = (".png",".jpg",".jpeg",".tif",".tiff")
RX_AUDIO_EXTS = (".mp3",".wav")

#FAST MODE: triage pass so you can move on quickly.

FAST_MODE = True
FAST_SCANNED_OCR_PAGES = 0   #0 = skip scanned OCR; N (e.g. 2) = OCR only first N pages

RX_PDF_MAX_PAGES = None
TESS_LANG        = "eng"
RX_VISUALIZE     = False
RX_VIZ_MAX_PAGES = 1
RAW_INDEX        = os.path.join(DRIVE_ROOT, "raw_index.parquet")

RX_OCR_DPI        = 170
_OCR_CONCURRENCY  = 2

def _has_cuda() -> bool:
    try:
        import torch
        return bool(torch.cuda.is_available())
    except Exception:
        return os.path.exists("/usr/local/cuda") or (shutil.which("nvidia-smi") is not None)

ASR_DEVICE = "cuda" if _has_cuda() else "cpu"
ASR_MODEL  = WhisperModel(
    "tiny",
    device=ASR_DEVICE,
    compute_type=("float16" if ASR_DEVICE=="cuda" else "int8"))

_ASR_CONCURRENCY = 1 if ASR_DEVICE == "cuda" else 2
_ASR_SEM  = threading.Semaphore(_ASR_CONCURRENCY)
_OCR_SEM  = threading.Semaphore(_OCR_CONCURRENCY)
_PRINT_LOCK = threading.Lock()

BATCH_ROOT    = "/content/JFK_RFK_BATCH"
BATCH_SIZE    = 1000          #how many files per batch
BATCH_WORKERS = 5             #threads per batch



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.json: 0.00B [00:00, ?B/s]

vocabulary.txt: 0.00B [00:00, ?B/s]

model.bin:   0%|          | 0.00/75.5M [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

In [6]:
#helpers
def log(msg: str):
    with _PRINT_LOCK:
        try:
            tqdm.write(str(msg))
        except Exception:
            print(msg, flush=True)

def rx_sha256_of_path(path: str, block=2**20) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(block), b""): h.update(chunk)
    return h.hexdigest()

def rx_bin_root_for(src_path: str) -> str:
    cur = Path(src_path).parent
    while True:
        if cur.name in {"pdf","audio","image","other"}:
            return str(cur.parent)
        if str(cur) == "/": break
        cur = cur.parent
    return str(Path(src_path).parent)

def rx_make_sidecar_path(src_path: str) -> str:
    bin_root = rx_bin_root_for(src_path)
    type_dir = Path(src_path).parent.name
    fn = Path(src_path).name
    sidecar_name = fn + ".jsonl"
    dest = Path(bin_root) / "Extracted_JSONL" / type_dir / sidecar_name
    dest.parent.mkdir(parents=True, exist_ok=True)
    return str(dest)

def rx_sidecar_done(sidecar: str) -> bool:
    if os.path.exists(sidecar) and os.path.getsize(sidecar) > 0:
        return True
    if sidecar.startswith("/content/"):
        drive_sc = sidecar.replace("/content/JFK_RFK_BATCH", DRIVE_ROOT, 1)
        if os.path.exists(drive_sc) and os.path.getsize(drive_sc) > 0:
            return True
    return False

def rx_update_raw_index(rows: List[Dict[str, Any]]):
    if not rows: return
    df = pd.DataFrame(rows)
    if os.path.exists(RAW_INDEX):
        old = pd.read_parquet(RAW_INDEX)
        df = pd.concat([old, df], ignore_index=True).drop_duplicates(subset=["sidecar"], keep="last")
    df.to_parquet(RAW_INDEX)

def rx_is_likely_digital_pdf(pdf_path: str) -> bool:
    try:
        with fitz.open(pdf_path) as doc:
            take = min(len(doc), 3)
            hits = sum(1 for i in range(take) if doc[i].get_text("text").strip())
            return hits >= 1
    except Exception:
        return False

def pdf_to_markdown_with_tables(pdf_path: str) -> str:
    fn = pymupdf4llm.to_markdown
    try:
        params = inspect.signature(fn).parameters
        if "image_ocr" in params:
            return fn(pdf_path, image_ocr=False)
        if "ocr_images" in params:
            return fn(pdf_path, ocr_images=False)
        return fn(pdf_path)
    except Exception as e:
        parts = []
        try:
            with fitz.open(pdf_path) as doc:
                total = len(doc) if RX_PDF_MAX_PAGES is None else min(RX_PDF_MAX_PAGES, len(doc))
                for i in range(total):
                    parts.append(doc[i].get_text("text"))
            md = "\n\n".join(parts)
            log(f"[fallback] to_markdown failed ({e}); returned plain text ({len(md)} chars)")
            return md
        except Exception as e2:
            raise RuntimeError(f"pymupdf4llm.to_markdown failed and fallback text extraction also failed: {e2}") from e

def pdf_pages_to_pixmap_stream(pdf_path: str, dpi: int, max_pages: int = None):
    with fitz.open(pdf_path) as doc:
        total_all = len(doc)
        total = total_all if max_pages is None else min(max_pages, total_all)
        zoom = dpi / 72.0
        mat = fitz.Matrix(zoom, zoom)
        for i in range(total):
            pix = doc[i].get_pixmap(matrix=mat, alpha=False)
            yield i, pix

def ocr_pil(img: Image.Image, lang=TESS_LANG) -> str:
    try:
        return pytesseract.image_to_string(img, lang=lang) or ""
    except Exception:
        return ""

def rx_iter_pdf_records(pdf_path: str) -> Iterable[Dict[str, Any]]:
    if rx_is_likely_digital_pdf(pdf_path):
        md = pdf_to_markdown_with_tables(pdf_path)
        if RX_VISUALIZE:
            display(Markdown(f"**Digital PDF** — {os.path.basename(pdf_path)}"))
            print(md[:800], "…")
        yield {"type":"span","page":None,"engine":"pymupdf4llm","text":md}
        return

    if FAST_MODE and FAST_SCANNED_OCR_PAGES == 0:
        yield {"type":"notice", "page":None, "engine":"none", "text":"[FAST_MODE] scanned PDF OCR skipped"}
        return

    max_pages = RX_PDF_MAX_PAGES
    if FAST_MODE and FAST_SCANNED_OCR_PAGES is not None:
        max_pages = FAST_SCANNED_OCR_PAGES if FAST_SCANNED_OCR_PAGES > 0 else 0

    with _OCR_SEM:
        for i, pix in pdf_pages_to_pixmap_stream(pdf_path, dpi=RX_OCR_DPI, max_pages=max_pages):
            try:
                im = Image.open(io.BytesIO(pix.tobytes("png"))).convert("L")
                txt = ocr_pil(im)
                if RX_VISUALIZE and i < RX_VIZ_MAX_PAGES:
                    display(Markdown(f"**Scanned PDF (OCR)** — {os.path.basename(pdf_path)} p.{i+1}"))
                    display(im); print(txt[:600], "…")
            finally:
                del im, pix
            yield {"type":"span","page":i+1,"engine":"tesseract","text":txt}
            if (i % 5) == 0:
                gc.collect()

def rx_iter_audio_records(audio_path: str) -> Iterable[Dict[str, Any]]:
    """Yield segment records as they are decoded; then a 'full' summary."""
    with _ASR_SEM:
        segments, info = ASR_MODEL.transcribe(audio_path, beam_size=5)
        lang = getattr(info, "language", "unk")
        full_parts = []
        last_end = None
        count = 0
        for s in segments:
            yield {"type":"segment","segment":count,"start":s.start,"end":s.end,"text":s.text}
            full_parts.append(s.text)
            last_end = s.end
            count += 1
        full_text = " ".join(full_parts).strip()
    log(f"[Audio ASR] {os.path.basename(audio_path)} → {count} segments, {len(full_text)} chars")
    if RX_VISUALIZE and full_text:
        display(Markdown(f"**Audio ASR** — {os.path.basename(audio_path)}"))
        print(full_text[:800], "…")
    yield {"type":"segment","segment":"full","text":full_text,"language":lang,"duration":last_end}

def rx_ingest_one(src_path: str) -> Dict[str, Any]:
    sidecar = rx_make_sidecar_path(src_path)
    if rx_sidecar_done(sidecar):
        return {"status":"skipped", "sidecar": sidecar}

    ext = Path(src_path).suffix.lower()
    t0 = time.time()
    try:
        meta = {
            "type":"meta",
            "source_path": src_path,
            "source_rel": os.path.relpath(src_path, DATA_ROOT),
            "sha256": rx_sha256_of_path(src_path),
            "bytes": os.path.getsize(src_path),
            "extracted_at": time.strftime("%Y-%m-%d %H:%M:%S"),
            "mime": "application/pdf" if ext==".pdf" else ("image" if ext in RX_IMAGE_EXTS else "audio")
        }
        Path(sidecar).parent.mkdir(parents=True, exist_ok=True)
        with open(sidecar, "w", encoding="utf-8") as f:
            f.write(json.dumps(meta, ensure_ascii=False) + "\n")

            if ext in RX_DOC_EXTS:
                for r in rx_iter_pdf_records(src_path):
                    f.write(json.dumps(r, ensure_ascii=False) + "\n")

            elif ext in RX_IMAGE_EXTS:
                with _OCR_SEM:
                    im = Image.open(src_path).convert("L")
                    txt = ocr_pil(im)
                    del im
                f.write(json.dumps({"type":"span","page":1,"engine":"tesseract","text":txt},
                                   ensure_ascii=False) + "\n")

            elif ext in RX_AUDIO_EXTS:
                for r in rx_iter_audio_records(src_path):
                    f.write(json.dumps(r, ensure_ascii=False) + "\n")
            else:
                return {"status":"skipped","sidecar":sidecar}

        secs = round(time.time()-t0,2)
        return {"status":"ok","sidecar":sidecar,"secs":secs}

    except Exception as e:
        return {"status":"error","reason":str(e),"sidecar":sidecar}

def rx_all_source_files() -> List[str]:
    include_dirs = [
        os.path.join(JFK_DIR, "By_Release_Date"),
        os.path.join(JFK_DIR, "Agency_Releases"),
        os.path.join(RFK_DIR, "By_Release_Date"),
        os.path.join(RFK_DIR, "Agency_Releases"),
        REF_DIR,
    ]
    paths = []
    for root in include_dirs:
        if not os.path.exists(root): continue
        for cur, _, files in os.walk(root):
            for fn in files:
                ext = Path(fn).suffix.lower()
                if ext in RX_DOC_EXTS + RX_IMAGE_EXTS + RX_AUDIO_EXTS:
                    paths.append(os.path.join(cur, fn))
    return paths

def split_paths(paths: List[str]) -> Tuple[List[str], List[str]]:
    pdf_img = []
    audio = []
    for p in paths:
        ext = Path(p).suffix.lower()
        if ext in RX_AUDIO_EXTS:
            audio.append(p)
        else:
            pdf_img.append(p)
    return pdf_img, audio

def rx_run_raw_extraction(limit_files: int = None,
                          shuffle: bool = False,
                          workers: int = 5,
                          progress_every: int = 200):
    rows: List[Dict[str, Any]] = []
    paths = rx_all_source_files()
    if shuffle: random.shuffle(paths)
    if limit_files: paths = paths[:limit_files]

    total = len(paths)
    if total == 0:
        log("No eligible files found. Check your folder paths.")
        return

    skipped = processed = errors = 0
    with ThreadPoolExecutor(max_workers=workers) as ex, \
         tqdm(total=total, desc=f"Extracting 0/{total}", leave=True, mininterval=0.2, dynamic_ncols=True) as pbar:
        fut2p = {ex.submit(rx_ingest_one, p): p for p in paths}
        for fut in as_completed(fut2p):
            p = fut2p[fut]
            rel = os.path.relpath(p, DATA_ROOT)
            try:
                res = fut.result()
            except Exception as e:
                res = {"status":"error","reason":str(e),"sidecar":rx_make_sidecar_path(p)}
            rows.append({**res, "source": rel})

            st = res.get("status")
            if st == "ok": processed += 1
            elif st == "skipped": skipped += 1
            else: errors += 1

            done = processed + skipped + errors
            pbar.update(1)
            pbar.set_description_str(f"Extracting {done}/{total}")
            pbar.set_postfix_str(f"extracted {processed}/{total} • skip {skipped} • err {errors}")
            if progress_every and (done % progress_every == 0 or done == total):
                log(f"[progress] {processed}/{total} extracted • {done}/{total} processed (skip={skipped}, err={errors})")

    rx_update_raw_index(rows)
    log(f"\nSummary → processed: {processed}, skipped: {skipped}, errors: {errors}")
    log(f"Done. Index: {RAW_INDEX}")


def _drive_equivalent(path: str) -> str:
    return path.replace(BATCH_ROOT, DRIVE_ROOT, 1)

def _local_equivalent(path: str) -> str:
    return path.replace(DRIVE_ROOT, BATCH_ROOT, 1)

def _drive_all_source_files():
    include_dirs = [DRIVE_JFK, DRIVE_RFK, DRIVE_REF]
    for root in include_dirs:
        if not os.path.exists(root):
            continue
        for cur, _, files in os.walk(root):
            for fn in files:
                ext = Path(fn).suffix.lower()
                if ext in RX_DOC_EXTS + RX_IMAGE_EXTS + RX_AUDIO_EXTS:
                    yield os.path.join(cur, fn)

def _expected_drive_sidecar_for_src(drive_src_path: str) -> str:
    return rx_make_sidecar_path(drive_src_path)

def _copy_batch_to_local(drive_paths: List[str]) -> List[str]:
    local_paths: List[str] = []
    for dp in drive_paths:
        lp = _local_equivalent(dp)
        Path(os.path.dirname(lp)).mkdir(parents=True, exist_ok=True)
        shutil.copy2(dp, lp)
        local_paths.append(lp)
    return local_paths

def _copy_sidecars_back_to_drive(rows: List[Dict[str, Any]]):
    copied = 0
    for r in rows:
        if r.get("status") != "ok":
            continue
        sc_local = r.get("sidecar")
        if not sc_local or not os.path.exists(sc_local):
            continue
        sc_drive = _drive_equivalent(sc_local)
        Path(os.path.dirname(sc_drive)).mkdir(parents=True, exist_ok=True)
        shutil.copy2(sc_local, sc_drive)
        copied += 1
    log(f"[sync] Copied {copied} new sidecars → Drive")

def _process_local_paths(local_paths: List[str], drive_paths: List[str], workers: int = BATCH_WORKERS) -> List[Dict[str, Any]]:
    """Run rx_ingest_one on local_paths with a thread pool; index rows store Drive-relative 'source'."""
    rows: List[Dict[str, Any]] = []
    total = len(local_paths)
    if total == 0:
        return rows

    global DATA_ROOT
    old_data_root = DATA_ROOT
    DATA_ROOT = DRIVE_ROOT
    try:
        with ThreadPoolExecutor(max_workers=workers) as ex, \
             tqdm(total=total, desc=f"Batch extracting 0/{total}", mininterval=0.2, dynamic_ncols=True) as pbar:
            fut2pair = {ex.submit(rx_ingest_one, lp): (lp, dp) for lp, dp in zip(local_paths, drive_paths)}
            done = processed = skipped = errors = 0
            for fut in as_completed(fut2pair):
                lp, dp = fut2pair[fut]
                rel = os.path.relpath(dp, DRIVE_ROOT)
                try:
                    res = fut.result()
                except Exception as e:
                    res = {"status":"error","reason":str(e),"sidecar":rx_make_sidecar_path(lp)}
                rows.append({**res, "source": rel})

                st = res.get("status")
                if st == "ok": processed += 1
                elif st == "skipped": skipped += 1
                else: errors += 1

                done = processed + skipped + errors
                pbar.update(1)
                pbar.set_description_str(f"Batch extracting {done}/{total}")
                pbar.set_postfix_str(f"extracted {processed}/{total} • skip {skipped} • err {errors}")
    finally:
        DATA_ROOT = old_data_root
    return rows

def _append_rows_to_drive_index(rows: List[Dict[str, Any]]):
    if not rows:
        return
    rx_update_raw_index(rows)

def rx_run_batched_from_drive(batch_size: int = BATCH_SIZE, limit_files: int = None, shuffle: bool = False):
    all_paths = list(_drive_all_source_files())
    if shuffle:
        random.shuffle(all_paths)
    if limit_files:
        all_paths = all_paths[:limit_files]

    total = len(all_paths)
    if total == 0:
        log("No eligible files found on Drive.")
        return

    todo_drive: List[str] = []
    for dp in all_paths:
        sc_drive = _expected_drive_sidecar_for_src(dp)
        if os.path.exists(sc_drive) and os.path.getsize(sc_drive) > 0:
            continue
        todo_drive.append(dp)

    total_todo = len(todo_drive)
    already_done = total - total_todo
    log(f"Total on Drive: {total}  |  already extracted (skipped): {already_done}  |  to process: {total_todo}")

    processed_global = 0
    start = 0
    while start < total_todo:
        end = min(start + batch_size, total_todo)
        batch_drive = todo_drive[start:end]

        if os.path.exists(BATCH_ROOT):
            shutil.rmtree(BATCH_ROOT)
        Path(BATCH_ROOT).mkdir(parents=True, exist_ok=True)

        log(f"\n[batch] Copying {len(batch_drive)} files Drive → local SSD …")
        batch_local = _copy_batch_to_local(batch_drive)

        log("[batch] Extracting locally …")
        rows = _process_local_paths(batch_local, batch_drive, workers=BATCH_WORKERS)

        log("[batch] Syncing sidecars back to Drive …")
        _copy_sidecars_back_to_drive(rows)

        _append_rows_to_drive_index(rows)

        processed_now = sum(1 for r in rows if r.get("status") == "ok")
        processed_global += processed_now
        log(f"[progress] extracted {processed_global}/{total_todo} in this run "
            f"(overall done incl. existing: {already_done + processed_global}/{total})")

        shutil.rmtree(BATCH_ROOT, ignore_errors=True)

        start = end

    log(f"Final: extracted {processed_global}/{total_todo} newly • already existed {already_done} • total {total}")

def rx_preview_extracted(n: int = 10):
    idx = RAW_INDEX
    if not os.path.exists(idx):
        print("No raw_index.parquet yet."); return
    df = pd.read_parquet(idx)
    display(df.head(n))
    ok = df[df.status=="ok"]
    if len(ok):
        sample = ok.iloc[0]["sidecar"]
        print("Sample sidecar:", sample)
        with open(sample,"r",encoding="utf-8") as f:
            for i, line in enumerate(f):
                if i>30: break
                print(line.rstrip())



In [None]:
#Batched: process 1000 files at a time from Drive → local SSD:
rx_run_batched_from_drive(batch_size=1000, shuffle=False)

In [7]:
#Inspect extracted files:
rx_preview_extracted()

Unnamed: 0,status,sidecar,secs,source
0,ok,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,5.31,JFK/By_Release_Date/2017_2023_Releases/pdf/104...
1,ok,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,6.49,JFK/By_Release_Date/2017_2023_Releases/pdf/doc...
2,ok,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,6.22,JFK/By_Release_Date/2017_2023_Releases/pdf/124...
3,ok,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,5.77,JFK/By_Release_Date/2017_2023_Releases/pdf/doc...
4,ok,/content/drive/MyDrive/JFK_RFK_Project/JFK/By_...,18.56,JFK/By_Release_Date/2025_Releases/pdf/104-1016...
5,ok,/content/JFK_RFK_BATCH/JFK/By_Release_Date/201...,0.2,JFK/By_Release_Date/2017_2023_Releases/pdf/doc...
6,ok,/content/JFK_RFK_BATCH/JFK/By_Release_Date/201...,0.16,JFK/By_Release_Date/2017_2023_Releases/pdf/doc...
7,ok,/content/JFK_RFK_BATCH/JFK/By_Release_Date/201...,0.24,JFK/By_Release_Date/2017_2023_Releases/pdf/doc...
8,ok,/content/JFK_RFK_BATCH/JFK/By_Release_Date/201...,0.02,JFK/By_Release_Date/2017_2023_Releases/pdf/doc...
9,ok,/content/JFK_RFK_BATCH/JFK/By_Release_Date/201...,0.0,JFK/By_Release_Date/2017_2023_Releases/pdf/doc...


Sample sidecar: /content/drive/MyDrive/JFK_RFK_Project/JFK/By_Release_Date/2017_2023_Releases/Extracted_JSONL/pdf/104-10529-10372.pdf.jsonl
{"type": "meta", "source_path": "/content/drive/MyDrive/JFK_RFK_Project/JFK/By_Release_Date/2017_2023_Releases/pdf/104-10529-10372.pdf", "source_rel": "JFK/By_Release_Date/2017_2023_Releases/pdf/104-10529-10372.pdf", "sha256": "7fed020033c727d69e53c212f18cb8467b8e5b5adc8a514b68358aa2c729fe0e", "bytes": 124538, "extracted_at": "2025-08-17 01:24:08", "mime": "application/pdf"}
{"type": "span", "page": 1, "engine": "tesseract", "text": " gck-< . = et\n\nekl\n\n.\n\nare - a ree 90768\n\n*RE pwOOD a\n\n \n\n \n\na neta\n* yomiait, soy . femme\nbags SRY 88\n| Pe aM tee , Bitten\n\nPROP S yea”\n\n \n\f"}


## Section 4: Unstructured to Structured data ##

In [None]:
#pip installs
!pip -q install --upgrade pip setuptools wheel
!pip -q install "numpy==2.0.2"
!pip -q install "spacy==3.8.2" "thinc==8.3.2"
!pip -q install "pandas==2.2.2" "dateparser==1.2.0" "yake==0.4.8" \
                 "nltk==3.9.1" "matplotlib==3.9.0" "tqdm==4.66.4"
!python -m spacy download en_core_web_sm -q

In [None]:
#imports
import os, re, io, json, gc, glob, math, random, warnings, time, csv
from pathlib import Path
from collections import Counter, defaultdict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import spacy
from spacy.tokens import Doc
import dateparser
from yake import KeywordExtractor

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.auto import tqdm
nltk.download('vader_lexicon', quiet=True)

random.seed(7)


In [None]:
#constants
DRIVE_ROOT = "/content/drive/MyDrive/JFK_RFK_Project"
SIDECARE_GLOB = os.path.join(DRIVE_ROOT, "**", "Extracted_JSONL", "**", "*.jsonl")

OUT_ROOT = os.path.join(DRIVE_ROOT, "Structured")
FIG_ROOT = os.path.join(OUT_ROOT, "figs")
os.makedirs(OUT_ROOT, exist_ok=True)
os.makedirs(FIG_ROOT, exist_ok=True)


DOCS_PER_CHUNK    = 150           #sidecars per batch
MAX_TEXT_PER_DOC  = 80000        #cap chars per doc
NLP_BATCH_SIZE    = 12            #spaCy sub-batch
NLP_WORKERS       = 1             #IMPORTANT: single-process avoids Colab EOFError

ERA_INCLUDE = None

PLOT_MAX_DOCS_PER_ERA = 4000
TOP_ENTS_N = 20

ERA_PAT = re.compile(r"(Pre_2017_Releases|2017_2023_Releases|2025_Releases|Pre_2025_Releases|Agency_Releases|Reference_Reports)")
COL_PAT = re.compile(r"/(JFK|RFK|Reference_Reports)/")

ASSASSINATION_LEXICON = set("""
assassination assassinate assassin sniper shot shooting gun rifle handgun bullet
conspiracy cover-up coverup plot cabal collusion
cia fbi nsa secret service oswald sirhan ballistics autopsy autopsies
declassified classified redaction redacted memo cable dispatch
""".split())

DATE_SETTINGS = {"PREFER_DAY_OF_MONTH": "first", "PREFER_DATES_FROM": "past"}



In [None]:
#functions
def detect_bucket(p):
    era = ERA_PAT.search(p)
    col = COL_PAT.search(p)
    era_str = era.group(1) if era else "UnknownEra"
    col_str = col.group(1) if col else "OTHER"
    if col_str == "Reference_Reports": col_str = "REF"
    return f"{col_str}::{era_str}"

def load_spacy():
    nlp = spacy.load("en_core_web_sm", disable=["textcat","lemmatizer","senter"])
    return nlp

nlp = load_spacy()
sia = SentimentIntensityAnalyzer()
yake_kw = KeywordExtractor(lan="en", n=1, top=20)


def iter_sidecar_text(sidecar_path, cap=MAX_TEXT_PER_DOC):
    """Stream a sidecar into a single text blob (capped) + meta."""
    meta = {"source": None, "mime": None, "file_kind": None}
    parts, total = [], 0
    with open(sidecar_path, "r", encoding="utf-8", errors="ignore") as fh:
        for line in fh:
            if not line.strip():
                continue
            try:
                obj = json.loads(line)
            except:
                continue
            t = obj.get("type")
            if t == "meta":
                meta["source"] = obj.get("source_path")
                meta["mime"]   = obj.get("mime")
                sp = meta["source"] or ""
                ext = os.path.splitext(sp)[1].lower()
                if ext == ".pdf": meta["file_kind"] = "pdf"
                elif ext in (".mp3",".wav"): meta["file_kind"] = "audio"
                elif ext in (".jpg",".jpeg",".png",".tif",".tiff"): meta["file_kind"] = "image"
                else: meta["file_kind"] = "other"
            elif t in ("span","segment"):
                text = obj.get("text") or ""
                if not text:
                    continue
                if total + len(text) > cap:
                    text = text[: max(0, cap - total)]
                parts.append(text)
                total += len(text)
                if total >= cap:
                    break
    return (" ".join(parts).strip(), meta)

def parse_dates_free(text):
    """Grab simple forms: YYYY or dd/mm/yyyy (etc) → ISO date strings."""
    hits = set()
    for m in re.finditer(r"\b(?:\d{1,2}[\-/]\d{1,2}[\-/]\d{2,4}|\d{4})\b", text):
        s = m.group(0)
        dt = dateparser.parse(s, settings=DATE_SETTINGS)
        if dt:
            hits.add(dt.date().isoformat())
    return sorted(hits)

def doc_features(doc: Doc):
    ents = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in {"PERSON","ORG","GPE","LOC","DATE"}]
    sent = sia.polarity_scores(doc.text)["compound"]
    kws  = [k for k,_ in yake_kw.extract_keywords(doc.text)]
    toks = [t.lower() for t in re.findall(r"[A-Za-z][A-Za-z\-']{2,}", doc.text)]
    lex_hits = [w for w in toks if w in ASSASSINATION_LEXICON]
    return ents, sent, kws, Counter(lex_hits)

def out_paths_for_era(era_key):
    base = os.path.join(OUT_ROOT, era_key.replace("::","/"))
    fig_dir = os.path.join(FIG_ROOT, era_key.replace("::","_"))
    os.makedirs(base, exist_ok=True)
    os.makedirs(fig_dir, exist_ok=True)
    return {
        "base": base,
        "fig_dir": fig_dir,
        "ledger": os.path.join(base, "ledger.csv"),
        "docs_glob": os.path.join(base, "docs-*.csv"),
        "ents_glob": os.path.join(base, "entities-*.csv"),
        "kw_glob":   os.path.join(base, "keywords-*.csv"),
        "dates_glob":os.path.join(base, "dates-*.csv"),
        "cooc_file": os.path.join(base, "cooccurrence.csv"),
    }

def has_extractable_text(sidecar_path, quick_stop=1500):
    """Fast preflight: does the sidecar contain any span/segment text?"""
    n = 0
    with open(sidecar_path, "r", encoding="utf-8", errors="ignore") as fh:
        for line in fh:
            if not line.strip():
                continue
            try:
                obj = json.loads(line)
            except:
                continue
            if obj.get("type") in ("span","segment") and (obj.get("text") or "").strip():
                return True
            n += 1
            if n > quick_stop:
                break
    return False

def read_ledger(ledger_path):
    if not os.path.exists(ledger_path):
        return set()
    done = set()
    try:
        with open(ledger_path, "r", encoding="utf-8", errors="ignore") as f:
            r = csv.reader(f)
            header = next(r, None)
            for row in r:
                if not row: continue
                done.add(row[0])
    except Exception:
        try:
            df = pd.read_csv(ledger_path, usecols=["sidecar"])
            done = set(df["sidecar"].astype(str))
        except Exception:
            pass
    return done

def append_ledger_rows(ledger_path, rows):
    hdr = not os.path.exists(ledger_path)
    with open(ledger_path, "a", encoding="utf-8", newline="") as f:
        w = csv.writer(f)
        if hdr:
            w.writerow(["sidecar","status","chars","tokens","lex_hits","processed_at"])
        for r in rows:
            w.writerow(r)

def next_seq_id(glob_pattern):
    files = sorted(glob.glob(glob_pattern))
    if not files:
        return 1
    nums = []
    for p in files:
        m = re.search(r"-(\d+)\.(?:csv)$", os.path.basename(p))
        if m:
            nums.append(int(m.group(1)))
    return (max(nums) + 1) if nums else 1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.8 MB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m27.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m44.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of th

In [None]:
def run_structured_extraction():
    all_sidecars = sorted(glob.glob(SIDECARE_GLOB, recursive=True))
    buckets = defaultdict(list)
    for sc in all_sidecars:
        key = detect_bucket(sc)
        if (ERA_INCLUDE is None) or (key in ERA_INCLUDE):
            buckets[key].append(sc)

    print("Discovered eras:")
    for era, lst in buckets.items():
        print(f"  {era:<28} {len(lst):>6} sidecars")

    for era, sidecars in buckets.items():
        print(f"\nERA: {era} — {len(sidecars)} sidecars")
        paths = out_paths_for_era(era)

        if not any(has_extractable_text(sc) for sc in sidecars[:min(200, len(sidecars))]):
            print("[skip] This era appears to have no extractable text.")
            continue

        done = read_ledger(paths["ledger"])
        todo = [sc for sc in sidecars if sc not in done]
        print(f"  {len(done)} already processed; {len(todo)} to do.")

        cooc_counter = Counter()

        docs_seq = next_seq_id(paths["docs_glob"])
        ents_seq = next_seq_id(paths["ents_glob"])
        kw_seq   = next_seq_id(paths["kw_glob"])
        dates_seq= next_seq_id(paths["dates_glob"])

        for i in tqdm(range(0, len(todo), DOCS_PER_CHUNK), desc=f"{era} batches (resumable)"):
            batch = todo[i:i+DOCS_PER_CHUNK]

            texts, metas, names = [], [], []
            for sc in batch:
                t, m = iter_sidecar_text(sc)
                if t:
                    texts.append(t); metas.append(m); names.append(sc)

            if not texts:
                continue

            docs_rows, ents_rows, kw_rows, date_rows, ledger_rows = [], [], [], [], []
            for doc, meta, scp in zip(nlp.pipe(texts, batch_size=NLP_BATCH_SIZE, n_process=NLP_WORKERS),
                                      metas, names):
                try:
                    ents, sent, kws, lex = doc_features(doc)

                    token_count = len(re.findall(r"\b\w+\b", doc.text))
                    lex_hits_total = sum(lex.values())

                    docs_rows.append({
                        "era": era,
                        "sidecar": scp,
                        "source": meta.get("source"),
                        "file_kind": meta.get("file_kind"),
                        "chars": len(doc.text),
                        "sentiment": sent,
                        "tokens": token_count,
                        "lex_hits": lex_hits_total
                    })
                    for text,label in ents:
                        ents_rows.append({"era":era,"sidecar":scp,"label":label,"text":text})
                    for k in kws:
                        kw_rows.append({"era":era,"sidecar":scp,"keyword":k})
                    for d in parse_dates_free(doc.text):
                        date_rows.append({"era":era,"sidecar":scp,"date":d})

                    words = [w for w,cnt in lex.items() for _ in range(cnt)]
                    for a in words:
                        for b in words:
                            if a < b:
                                cooc_counter[(a,b)] += 1

                    ledger_rows.append([scp, "done", len(doc.text), token_count, lex_hits_total, time.strftime("%Y-%m-%d %H:%M:%S")])

                except Exception as e:
                    ledger_rows.append([scp, f"error:{str(e)[:60]}", 0, 0, 0, time.strftime("%Y-%m-%d %H:%M:%S")])
                    continue

            base = paths["base"]

            if docs_rows:
                pd.DataFrame(docs_rows).to_csv(os.path.join(base, f"docs-{docs_seq:04d}.csv"), index=False)
                docs_seq += 1
            if ents_rows:
                pd.DataFrame(ents_rows).to_csv(os.path.join(base, f"entities-{ents_seq:04d}.csv"), index=False)
                ents_seq += 1
            if kw_rows:
                pd.DataFrame(kw_rows).to_csv(os.path.join(base, f"keywords-{kw_seq:04d}.csv"), index=False)
                kw_seq += 1
            if date_rows:
                pd.DataFrame(date_rows).to_csv(os.path.join(base, f"dates-{dates_seq:04d}.csv"), index=False)
                dates_seq += 1

            if ledger_rows:
                append_ledger_rows(paths["ledger"], ledger_rows)

            #freeing memory
            del docs_rows, ents_rows, kw_rows, date_rows, ledger_rows
            gc.collect()

        #writing the co-occurrence summary
        if cooc_counter:
            pairs, counts = zip(*cooc_counter.items())
            a = [p[0] for p in pairs]; b = [p[1] for p in pairs]
            cooc_df = pd.DataFrame({"w1":a,"w2":b,"count":counts})
            cooc_df = cooc_df.sort_values("count", ascending=False)
            cooc_df.to_csv(paths["cooc_file"], index=False)

    print("Structured outputs under:", OUT_ROOT)
    print("Figures under:", FIG_ROOT)

def _reservoir_sample_csv(files, sample_target, usecols=None):
    import random
    reservoir, n = [], 0
    for p in files:
        try:
            for chunk in pd.read_csv(p, chunksize=20_000, usecols=usecols):
                for _, row in chunk.iterrows():
                    n += 1
                    if len(reservoir) < sample_target:
                        reservoir.append(row)
                    else:
                        j = random.randint(0, n-1)
                        if j < sample_target:
                            reservoir[j] = row
        except Exception:
            continue
        if len(reservoir) >= sample_target:
            pass
    if not reservoir:
        return pd.DataFrame(columns=usecols if usecols else None)
    return pd.DataFrame(reservoir)

run_structured_extraction()


Discovered eras:
  JFK::Agency_Releases            353 sidecars
  JFK::2017_2023_Releases          19 sidecars
  JFK::2025_Releases             2566 sidecars
  JFK::Pre_2017_Releases            3 sidecars
  RFK::Agency_Releases              3 sidecars
  RFK::2025_Releases             1982 sidecars
  REF::Reference_Reports            3 sidecars

=== ERA: JFK::Agency_Releases — 353 sidecars ===
  353 already processed; 0 to do.


JFK::Agency_Releases batches (resumable): 0it [00:00, ?it/s]


=== ERA: JFK::2017_2023_Releases — 19 sidecars ===
  19 already processed; 0 to do.


JFK::2017_2023_Releases batches (resumable): 0it [00:00, ?it/s]


=== ERA: JFK::2025_Releases — 2566 sidecars ===
  [skip] This era appears to have no extractable text (likely FAST_MODE notice-only).

=== ERA: JFK::Pre_2017_Releases — 3 sidecars ===
  3 already processed; 0 to do.


JFK::Pre_2017_Releases batches (resumable): 0it [00:00, ?it/s]


=== ERA: RFK::Agency_Releases — 3 sidecars ===
  3 already processed; 0 to do.


RFK::Agency_Releases batches (resumable): 0it [00:00, ?it/s]


=== ERA: RFK::2025_Releases — 1982 sidecars ===
  1855 already processed; 127 to do.


RFK::2025_Releases batches (resumable):   0%|          | 0/1 [00:00<?, ?it/s]


=== ERA: REF::Reference_Reports — 3 sidecars ===
  3 already processed; 0 to do.


REF::Reference_Reports batches (resumable): 0it [00:00, ?it/s]


All done. Structured outputs under: /content/drive/MyDrive/JFK_RFK_Project/Structured
Figures under: /content/drive/MyDrive/JFK_RFK_Project/Structured/figs


## Section 5: Analysis ##

In [13]:
#pip installs
!pip -q install wordcloud


In [14]:
#imports

import os, re, glob, math, itertools, warnings
from collections import Counter, defaultdict
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud


In [15]:
#constants
OUT_ROOT = "/content/drive/MyDrive/JFK_RFK_Project/Structured"
FIGS_COMPARE = os.path.join(OUT_ROOT, "figs_compare")
os.makedirs(FIGS_COMPARE, exist_ok=True)
plt.rcParams.update({"figure.dpi": 150})

In [16]:
#helper functions
def parse_host(era_folder_name: str) -> str:
    return "Agency" if "Agency_Releases" in era_folder_name else "NARA"

def era_label(era_folder_name: str) -> str:
    return (era_folder_name
            .replace("Pre_2017_Releases","Pre-2017")
            .replace("Pre_2025_Releases","Pre-2025")
            .replace("2017_2023_Releases","2017–2023")
            .replace("2025_Releases","2025")
            .replace("_"," "))

def list_era_dirs(root=OUT_ROOT):
    """
    Return a list of dicts (per <COL>/<ERA> folder) with available file-family paths.
    Detects entities/keywords/dates/docs as:
      - CSV shards:    <family>-*.csv
      - Single CSV:    <family>.csv
      - Parquet:       <family>.parquet
    """
    rows = []
    for col in ("JFK","RFK","REF"):
        base = os.path.join(root, col)
        if not os.path.exists(base):
            continue
        for era_folder in sorted(os.listdir(base)):
            edir = os.path.join(base, era_folder)
            if not os.path.isdir(edir):
                continue
            fam = {}
            for name in ("entities","keywords","dates","docs"):
                paths = []
                paths += sorted(glob.glob(os.path.join(edir, f"{name}-*.csv")))
                p_csv = os.path.join(edir, f"{name}.csv")
                if os.path.exists(p_csv): paths.append(p_csv)
                p_parq = os.path.join(edir, f"{name}.parquet")
                if os.path.exists(p_parq): paths.append(p_parq)
                fam[name] = paths
            if any(fam.values()):
                rows.append({
                    "collection": os.path.basename(base),
                    "era_folder": era_folder,
                    "era_label":  era_label(era_folder),
                    "host":       parse_host(era_folder),
                    "edir":       edir,
                    "families":   fam
                })
    return rows

INV = list_era_dirs()
if not INV:
    raise SystemExit("No structured outputs found under Structured/. Run Section 3 first.")

print("Structured inventory:")
for r in INV:
    f = r["families"]
    print(f"  {r['collection']:<3} | {r['era_label']:<12} | {r['host']:<6} "
          f"| ents:{int(bool(f['entities']))} kw:{int(bool(f['keywords']))} dates:{int(bool(f['dates']))} docs:{int(bool(f['docs']))}")

def _iter_csv_chunks(paths, usecols=None, chunksize=150_000):
    for p in paths:
        if not p.endswith(".csv"):
            continue
        try:
            for chunk in pd.read_csv(p, usecols=usecols, chunksize=chunksize):
                yield chunk
        except Exception:
            for chunk in pd.read_csv(p, chunksize=chunksize):
                if usecols:
                    cols = [c for c in usecols if c in chunk.columns]
                    chunk = chunk[cols]
                yield chunk

def _iter_parquet_chunks(paths, columns=None, batch_rows=200_000):
    try:
        import pyarrow.dataset as ds
        for p in paths:
            if not p.endswith(".parquet"):
                continue
            try:
                dataset = ds.dataset(p, format="parquet")
                scanner = dataset.scan(columns=columns, batch_size=batch_rows)
                for rec in scanner.to_batches():
                    yield rec.to_pandas()
            except Exception:
                try:
                    df = pd.read_parquet(p, columns=columns)
                    yield df
                except Exception:
                    continue
    except Exception:
        for p in paths:
            if p.endswith(".parquet"):
                try:
                    df = pd.read_parquet(p, columns=columns)
                    yield df
                except Exception:
                    continue

def iter_family(edir, family, usecols=None, chunksize=150_000):
    paths = [p for p in sorted(glob.glob(os.path.join(edir, f"{family}-*.csv")))]
    csv_single = os.path.join(edir, f"{family}.csv")
    if os.path.exists(csv_single): paths.append(csv_single)
    if paths:
        yield from _iter_csv_chunks(paths, usecols=usecols, chunksize=chunksize)
        return
    p_parq = os.path.join(edir, f"{family}.parquet")
    if os.path.exists(p_parq):
        yield from _iter_parquet_chunks([p_parq], columns=usecols)
        return

def normalize_entity(s: str) -> str:
    if not isinstance(s, str): return ""
    s = re.sub(r"\s+", " ", s).strip()
    s = re.sub(r"^[\"'()]+|[\"'(),.:;]+$", "", s)
    s = re.sub(r"^(Mr|Mrs|Ms|Dr|Sen|Rep|Pres|President)\.?\s+", "", s, flags=re.I)
    return s

def _count_terms_entities(edir, label_pool=("PERSON","ORG","GPE"), chunksize=150_000):
    c = Counter()
    for ch in iter_family(edir, "entities", usecols=("label","text"), chunksize=chunksize):
        ch = ch[ch["label"].isin(label_pool)]
        if ch.empty: continue
        names = ch["text"].astype(str).map(normalize_entity)
        names = names[names.str.len() > 1]
        c.update(names.value_counts().to_dict())
    return c

def _count_terms_keywords(edir, chunksize=150_000):
    c = Counter()
    for ch in iter_family(edir, "keywords", usecols=("keyword",), chunksize=chunksize):
        kw = ch["keyword"].astype(str).str.strip()
        kw = kw[kw.str.len() > 1]
        c.update(kw.value_counts().to_dict())
    return c

def _choose_mode(rows):
    """Prefer entities if present across *all* rows; else keywords if present; else None."""
    if rows and all(bool(r["families"]["entities"]) for r in rows): return "entities"
    if rows and all(bool(r["families"]["keywords"]) for r in rows): return "keywords"
    return None

def _counts_for_rows(rows, mode="entities", label_pool=("PERSON","ORG","GPE")):
    total = Counter()
    for r in rows:
        if mode == "entities":
            total.update(_count_terms_entities(r["edir"], label_pool=label_pool))
        else:
            total.update(_count_terms_keywords(r["edir"]))
    return total

Structured inventory:
  JFK | 2017–2023    | NARA   | ents:1 kw:1 dates:1 docs:1
  JFK | 2025         | NARA   | ents:1 kw:1 dates:1 docs:1
  JFK | Agency Releases | Agency | ents:1 kw:1 dates:1 docs:1
  JFK | Pre-2017     | NARA   | ents:1 kw:1 dates:1 docs:1
  RFK | 2025         | NARA   | ents:1 kw:1 dates:1 docs:1
  RFK | Agency Releases | Agency | ents:1 kw:1 dates:1 docs:1
  REF | Reference Reports | NARA   | ents:1 kw:1 dates:1 docs:1


In [17]:
#word-cloud grids
def make_wordcloud_grids():
    for collection in ("JFK","RFK"):
        rows = [r for r in INV if r["collection"] == collection]
        if not rows:
            continue
        mode = "keywords" if any(r["families"]["keywords"] for r in rows) else "entities"
        buckets = defaultdict(Counter)
        for r in rows:
            key = (r["era_label"], r["host"])
            if mode == "keywords" and r["families"]["keywords"]:
                buckets[key].update(_count_terms_keywords(r["edir"]))
            elif mode == "entities" and r["families"]["entities"]:
                buckets[key].update(_count_terms_entities(r["edir"]))
        if not buckets:
            print(f"[wordcloud] skip {collection}: no data.")
            continue
        keys_sorted = sorted(buckets.keys(), key=lambda t: (t[0], t[1]))
        n = len(keys_sorted); cols = min(3, n); rows_n = math.ceil(n/cols)
        fig = plt.figure(figsize=(5*cols, 4*rows_n))
        for i, key in enumerate(keys_sorted, 1):
            freqs = {k:int(v) for k,v in buckets[key].most_common(5000)}
            if not freqs: continue
            wc = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(freqs)
            ax = fig.add_subplot(rows_n, cols, i)
            ax.imshow(wc, interpolation="bilinear"); ax.axis("off")
            ax.set_title(f"{collection} — {key[0]} ({key[1]})")
        fig.tight_layout()
        out = os.path.join(FIGS_COMPARE, f"wordclouds_{collection}.png")
        plt.savefig(out); plt.close(); print("wrote", out)


In [18]:
#Referenced-year overlays by era (per collection)
def plot_year_overlays(collection="JFK"):
    frames = []
    for r in INV:
        if r["collection"] != collection or not r["families"]["dates"]:
            continue
        yearly = Counter()
        for ch in iter_family(r["edir"], "dates", usecols=("date",), chunksize=150_000):
            if ch.empty: continue
            yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
            yearly.update(yrs.value_counts().to_dict())
        if not yearly: continue
        y,c = zip(*sorted(yearly.items()))
        frames.append(pd.DataFrame({"year": y, r["era_label"]: c}))
    if not frames:
        print(f"[years] no dates for {collection}")
        return
    X = frames[0]
    for f in frames[1:]:
        X = X.merge(f, on="year", how="outer")
    X = X.fillna(0).sort_values("year")
    plt.figure(figsize=(10,5))
    for c in X.columns:
        if c == "year": continue
        plt.plot(X["year"], X[c], label=c)
    plt.legend(); plt.title(f"Referenced years by era — {collection}")
    plt.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"year_overlay_{collection}.png")
    plt.savefig(out); plt.close(); print("wrote", out)


In [19]:
#Host contrast (Agency vs NARA in 2025) — per collection
def host_contrast(collection="JFK", era="2025", label="PERSON", k=15):
    eras = [r for r in INV if r["collection"]==collection and r["era_label"]==era]
    have_agency = any(r["host"]=="Agency" for r in eras)
    have_nara   = any(r["host"]=="NARA"   for r in eras)
    if not (eras and have_agency and have_nara):
        print(f"[host] skip {collection} {era}: need Agency and NARA.")
        return
    mode = _choose_mode(eras)
    if mode is None:
        print(f"[host] skip {collection} {era}: no entities/keywords across hosts.")
        return
    counts = {"Agency": Counter(), "NARA": Counter()}; totals = {"Agency":0,"NARA":0}
    for r in eras:
        host = r["host"]
        if mode == "entities":
            cnt = Counter()
            for ch in iter_family(r["edir"], "entities", usecols=("label","text"), chunksize=150_000):
                ch = ch[ch["label"]==label] if label else ch
                if ch.empty: continue
                names = ch["text"].astype(str).map(normalize_entity)
                names = names[names.str.len() > 1]
                cnt.update(names.value_counts().to_dict())
            counts[host].update(cnt); totals[host]+=int(sum(cnt.values()))
        else:
            cnt = _count_terms_keywords(r["edir"])
            counts[host].update(cnt); totals[host]+=int(sum(cnt.values()))
    if totals["Agency"]==0 or totals["NARA"]==0:
        print(f"[host] skip {collection} {era}: empty counts.")
        return
    rows = []
    names = set(counts["Agency"]) | set(counts["NARA"])
    for n in names:
        r_ag = 1e4 * counts["Agency"].get(n,0)/max(1,totals["Agency"])
        r_na = 1e4 * counts["NARA"].get(n,0)/max(1,totals["NARA"])
        rows.append((n, r_ag - r_na))
    df = pd.DataFrame(rows, columns=["name","diff"])
    more_agency = df.sort_values("diff", ascending=False).head(k)
    more_nara   = df.sort_values("diff", ascending=True).head(k)
    fig, axes = plt.subplots(1,2, figsize=(14,6))
    axes[0].barh(list(more_agency["name"])[::-1], list(more_agency["diff"])[::-1]); axes[0].set_title(f"{collection}: Agency − NARA (per 10k), {era} — {mode}")
    axes[1].barh(list(more_nara["name"]), list(more_nara["diff"])); axes[1].invert_yaxis(); axes[1].set_title(f"{collection}: NARA − Agency (negative), {era} — {mode}")
    fig.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"host_contrast_{collection}_{label}_{era}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)


In [20]:
#PMI pair drift (NARA only): 2017–2023 -> 2025 (per collection)
def _collect_doc_sets(r, mode="entities", label_pool=("PERSON","ORG","GPE"), max_docs=None):
    seen = set(); yielded = 0
    if mode == "entities" and r["families"]["entities"]:
        for ch in iter_family(r["edir"], "entities", usecols=("label","text","sidecar"), chunksize=120_000):
            ch = ch[ch["label"].isin(label_pool)]
            if ch.empty: continue
            ch["text_norm"] = ch["text"].astype(str).map(normalize_entity)
            ch = ch[ch["text_norm"].str.len()>1]
            for sid, sub in ch.groupby("sidecar"):
                if sid in seen: continue
                seen.add(sid)
                terms = set(sub["text_norm"].astype(str))
                if terms:
                    yield sid, terms; yielded+=1
                    if max_docs and yielded>=max_docs: return
    elif mode == "keywords" and r["families"]["keywords"]:
        for ch in iter_family(r["edir"], "keywords", usecols=("keyword","sidecar"), chunksize=120_000):
            if ch.empty: continue
            ch["kw"] = ch["keyword"].astype(str).str.strip()
            ch = ch[ch["kw"].str.len()>1]
            for sid, sub in ch.groupby("sidecar"):
                if sid in seen: continue
                seen.add(sid)
                terms = set(sub["kw"].astype(str))
                if terms:
                    yield sid, terms; yielded+=1
                    if max_docs and yielded>=max_docs: return

def _pmi_for_collection(collection="JFK", top_terms=300, min_pair_docs=8, label_pool=("PERSON","ORG","GPE")):
    e1723 = [r for r in INV if r["collection"]==collection and r["era_label"]=="2017–2023" and r["host"]=="NARA"]
    e2025 = [r for r in INV if r["collection"]==collection and r["era_label"]=="2025"       and r["host"]=="NARA"]
    if not e1723 or not e2025:
        print(f"[PMI] skip {collection}: need NARA 2017–2023 & 2025.")
        return None
    mode = _choose_mode(e1723+e2025)
    if mode is None:
        print(f"[PMI] skip {collection}: no entities/keywords across both eras.")
        return None

    def docfreq(rows):
        DF = Counter(); N=0
        for r in rows:
            for _, terms in _collect_doc_sets(r, mode=mode, label_pool=label_pool):
                N+=1
                for t in terms: DF[t]+=1
        return DF,N

    DF1,N1 = docfreq(e1723); DF2,N2 = docfreq(e2025)
    if N1==0 or N2==0:
        print(f"[PMI] zero docs for {collection}."); return None

    allow = sorted(set([t for t,_ in DF1.most_common(top_terms)]) | set([t for t,_ in DF2.most_common(top_terms)]))

    def pairfreq(rows, ALLOW):
        pair=Counter(); DF=Counter(); N=0
        for r in rows:
            for _, terms in _collect_doc_sets(r, mode=mode, label_pool=("PERSON","ORG","GPE")):
                N+=1
                terms2 = sorted(t for t in terms if t in ALLOW)
                for t in set(terms2): DF[t]+=1
                for i in range(len(terms2)):
                    a = terms2[i]
                    for b in terms2[i+1:]:
                        pair[(a,b)]+=1
        return pair,DF,N

    P1,DF1b,N1b = pairfreq(e1723, allow)
    P2,DF2b,N2b = pairfreq(e2025, allow)

    def pmi(pair_counts, DF, N):
        out = {}
        for (a,b), n_ab in pair_counts.items():
            if n_ab < min_pair_docs: continue
            n_a = DF.get(a,0); n_b=DF.get(b,0)
            if n_a==0 or n_b==0: continue
            out[(a,b)] = math.log2((n_ab*max(1,N))/(max(1,n_a)*max(1,n_b)))
        return out

    PMI1 = pmi(P1,DF1b,N1b); PMI2 = pmi(P2,DF2b,N2b)
    if not PMI1 and not PMI2:
        print(f"[PMI] no pairs for {collection}."); return None
    pairs = set(PMI1)|set(PMI2)
    rows=[]
    for k in pairs:
        v1 = PMI1.get(k,0.0); v2 = PMI2.get(k,0.0)
        rows.append((k[0]+" ↔ "+k[1], v2-v1))
    df = pd.DataFrame(rows, columns=["pair","delta_pmi"]).sort_values("delta_pmi", ascending=False)
    return df, mode

def plot_pmi_drift(collection="JFK", top=20):
    res = _pmi_for_collection(collection=collection, top_terms=300, min_pair_docs=8)
    if res is None: return
    df, mode = res
    up = df.head(top); down = df.tail(top).sort_values("delta_pmi")
    fig, axes = plt.subplots(1,2, figsize=(14,6))
    axes[0].barh(list(up["pair"])[::-1], list(up["delta_pmi"])[::-1]);   axes[0].set_title(f"{collection} PMI↑ (2025 − 2017–2023) — {mode}")
    axes[1].barh(list(down["pair"]), list(down["delta_pmi"])); axes[1].invert_yaxis(); axes[1].set_title(f"{collection} PMI↓ — {mode}")
    fig.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"pmi_drift_{collection}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [21]:
#Log-odds keyness (Monroe): era & host contrasts
def _logodds(A_rows, B_rows, label_pool=("PERSON","ORG","GPE"), alpha0=1000):
    mode = _choose_mode(A_rows+B_rows)
    if mode is None: return None, None
    cA = _counts_for_rows(A_rows, mode, label_pool); cB = _counts_for_rows(B_rows, mode, label_pool)
    if not cA or not cB: return None, mode
    bg = cA + cB; m0 = sum(bg.values())
    if m0 == 0: return None, mode
    nA = sum(cA.values()); nB = sum(cB.values())
    rows=[]
    for t in set(bg.keys()):
        m_i = bg[t] / m0
        yA,yB = cA.get(t,0), cB.get(t,0)
        numA = yA + alpha0*m_i; denA = (nA+alpha0) - numA
        numB = yB + alpha0*m_i; denB = (nB+alpha0) - numB
        if min(numA,denA,numB,denB) <= 0: continue
        delta = math.log(numB/denB) - math.log(numA/denA)
        var   = 1.0/numA + 1.0/numB
        z     = delta / math.sqrt(var)
        rows.append((t, delta, z))
    if not rows: return None, mode
    return pd.DataFrame(rows, columns=["term","delta","z"]).sort_values("z", ascending=False), mode

def plot_logodds_era(collection="JFK", top=20):
    A = [r for r in INV if r["collection"]==collection and r["era_label"]=="2017–2023" and r["host"]=="NARA"]
    B = [r for r in INV if r["collection"]==collection and r["era_label"]=="2025"       and r["host"]=="NARA"]
    if not A or not B:
        print(f"[logodds era] skip {collection}: need NARA 2017–2023 and 2025.")
        return
    DF, mode = _logodds(A,B)
    if DF is None:
        print(f"[logodds era] skip {collection}: no data.")
        return
    up2025 = DF.head(top); up1723 = DF.tail(top).sort_values("z")
    fig, axes = plt.subplots(1,2, figsize=(14,6))
    axes[0].barh(list(up2025["term"])[::-1], list(up2025["z"])[::-1]); axes[0].set_title(f"{collection}: distinctive for 2025 (z)")
    axes[1].barh(list(up1723["term"]), list(up1723["z"])); axes[1].invert_yaxis(); axes[1].set_title(f"{collection}: distinctive for 2017–2023 (z)")
    fig.suptitle(f"Log-odds keyness (mode: {mode})"); fig.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"logodds_era_{collection}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

def plot_logodds_host(collection="JFK", top=20):
    A = [r for r in INV if r["collection"]==collection and r["era_label"]=="2025" and r["host"]=="Agency"]
    B = [r for r in INV if r["collection"]==collection and r["era_label"]=="2025" and r["host"]=="NARA"]
    if not A or not B:
        print(f"[logodds host] skip {collection}: need 2025 Agency and NARA.")
        return
    DF, mode = _logodds(A,B)
    if DF is None:
        print(f"[logodds host] skip {collection}: no data.")
        return
    agency = DF.head(top); nara = DF.tail(top).sort_values("z")
    fig, axes = plt.subplots(1,2, figsize=(14,6))
    axes[0].barh(list(agency["term"])[::-1], list(agency["z"])[::-1]); axes[0].set_title(f"{collection}: distinctive in Agency (2025)")
    axes[1].barh(list(nara["term"]), list(nara["z"])); axes[1].invert_yaxis(); axes[1].set_title(f"{collection}: distinctive in NARA (2025)")
    fig.suptitle(f"Log-odds keyness by host (mode: {mode})"); fig.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"logodds_host_{collection}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [22]:
#Overlap heatmap (Jaccard) of top-K terms across groups
def plot_overlap_heatmap(collection="JFK", top_k=100, use="entities"):
    groups = [r for r in INV if r["collection"]==collection]
    if not groups:
        print(f"[overlap] no groups for {collection}"); return
    mode = use if use in ("entities","keywords") else _choose_mode(groups)
    if mode is None:
        print(f"[overlap] skip {collection}: no entities/keywords."); return
    labels=[]; sets=[]
    for r in groups:
        lab = f"{r['era_label']} ({r['host']})"
        if mode == "entities" and r["families"]["entities"]:
            cnt = _count_terms_entities(r["edir"])
        elif mode == "keywords" and r["families"]["keywords"]:
            cnt = _count_terms_keywords(r["edir"])
        else:
            continue
        if not cnt: continue
        labels.append(lab); sets.append(set([t for t,_ in cnt.most_common(top_k)]))
    if len(sets)<2:
        print(f"[overlap] insufficient groups for {collection}"); return
    n=len(sets); M=np.zeros((n,n), dtype=float)
    for i in range(n):
        for j in range(n):
            inter=len(sets[i]&sets[j]); union=len(sets[i]|sets[j])
            M[i,j]= inter/max(1,union)
    plt.figure(figsize=(1.2*n+4, 1.2*n+4))
    plt.imshow(M, interpolation="nearest"); plt.xticks(range(n), labels, rotation=45, ha="right")
    plt.yticks(range(n), labels); plt.colorbar(label=f"Jaccard (top-{top_k} {mode})")
    plt.title(f"{collection}: overlap of prominent terms")
    plt.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"overlap_heatmap_{collection}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [23]:
#Lexicon-density vs Sentiment scatter (needs docs with tokens/lex_hits)
def plot_lexicon_sentiment_scatter(collection="JFK", sample_per_group=5000):
    groups = [r for r in INV if r["collection"]==collection and r["families"]["docs"]]
    if not groups:
        print(f"[scatter] no docs for {collection}"); return
    plt.figure(figsize=(10,7)); legends=[]
    for r in groups:
        parts=[]
        for ch in iter_family(r["edir"], "docs", chunksize=120_000):
            if "sentiment" not in ch.columns: continue
            if "tokens" in ch.columns and "lex_hits" in ch.columns:
                ch = ch[["sentiment","tokens","lex_hits"]].copy()
                ch = ch[ch["tokens"].fillna(0) > 0]
                ch["lex_rate_10k"] = 1e4 * ch["lex_hits"].fillna(0) / ch["tokens"]
            else:
                continue
            parts.append(ch)
        if not parts:
            continue
        D = pd.concat(parts, ignore_index=True)
        if len(D) > sample_per_group: D = D.sample(sample_per_group, random_state=7)
        plt.scatter(D["sentiment"], D["lex_rate_10k"], alpha=0.25, s=10)
        legends.append(f"{r['era_label']} ({r['host']})")
    plt.axvline(0, linestyle="--", linewidth=0.8)
    plt.xlabel("VADER compound sentiment"); plt.ylabel("Lexicon hits per 10k tokens")
    plt.title(f"{collection}: sentiment vs redaction-lexicon density")
    if legends: plt.legend(legends, fontsize=8)
    plt.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"scatter_lex_vs_sent_{collection}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [24]:
#Era vs Era: “New in 2025” (names present in 2025 but not 2017–2023), NARA
def plot_new_in_2025(collection="JFK", top=20):
    e1723 = [r for r in INV if r["collection"]==collection and r["era_label"]=="2017–2023" and r["host"]=="NARA"]
    e2025 = [r for r in INV if r["collection"]==collection and r["era_label"]=="2025"       and r["host"]=="NARA"]
    if not e1723 or not e2025:
        print(f"[new2025] skip {collection}: need NARA 2017–2023 & 2025"); return
    mode = _choose_mode(e1723+e2025)
    if mode is None:
        print(f"[new2025] skip {collection}: no entities/keywords"); return
    c1 = _counts_for_rows(e1723, mode); c2=_counts_for_rows(e2025, mode)
    names_2025_only = [n for n in c2 if n not in c1]
    rows = sorted([(n, c2[n]) for n in names_2025_only], key=lambda x: x[1], reverse=True)[:top]
    if not rows:
        print(f"[new2025] none for {collection}"); return
    labels=[n for n,_ in rows][::-1]; vals=[v for _,v in rows][::-1]
    plt.figure(figsize=(10,6)); plt.barh(labels, vals)
    plt.title(f"{collection}: names appearing in 2025 but not 2017–2023 (NARA, mode {mode})")
    plt.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"new_in_2025_{collection}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [25]:
#Jensen–Shannon distance heatmap across eras (NARA only)
def _jsd(p, q):
    p = np.array(p, dtype=float); q = np.array(q, dtype=float)
    p /= p.sum() if p.sum() else 1.0
    q /= q.sum() if q.sum() else 1.0
    m = 0.5*(p+q)
    def _kl(a,b):
        a = np.where(a==0, 1e-12, a); b=np.where(b==0, 1e-12, b)
        return np.sum(a*np.log2(a/b))
    return 0.5*_kl(p,m)+0.5*_kl(q,m)

def plot_jsd_heatmap(collection="JFK", top_k=200):
    rows = [r for r in INV if r["collection"]==collection and r["host"]=="NARA"]
    if not rows:
        print(f"[jsd] no NARA groups for {collection}"); return
    mode = _choose_mode(rows)
    if mode is None:
        print(f"[jsd] skip {collection}: no entities/keywords"); return
    vocab=set(); dists=[]; labels=[]
    tmp_counters=[]
    for r in rows:
        cnt = _count_terms_entities(r["edir"]) if mode=="entities" else _count_terms_keywords(r["edir"])
        tmp_counters.append((r["era_label"], cnt))
        vocab |= set([t for t,_ in cnt.most_common(top_k)])
    vocab = sorted(vocab)
    if not vocab:
        print(f"[jsd] empty vocab {collection}"); return
    for era, cnt in tmp_counters:
        vec = [cnt.get(t,0) for t in vocab]
        dists.append(vec); labels.append(era)
    n=len(dists); M=np.zeros((n,n), dtype=float)
    for i in range(n):
        for j in range(n):
            M[i,j]=_jsd(dists[i], dists[j])
    plt.figure(figsize=(1.2*n+4, 1.2*n+4))
    plt.imshow(M, interpolation="nearest"); plt.xticks(range(n), labels, rotation=45, ha="right")
    plt.yticks(range(n), labels); plt.colorbar(label=f"Jensen–Shannon distance (top-{top_k} {mode})")
    plt.title(f"{collection}: distributional distance across eras (NARA)")
    plt.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"jsd_eras_{collection}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [26]:
#Bar charts: mean lexicon density per group (needs docs with tokens/lex_hits)
def plot_lexicon_rate_bars(collection="JFK"):
    groups = [r for r in INV if r["collection"]==collection and r["families"]["docs"]]
    if not groups:
        print(f"[lex-bars] no docs for {collection}"); return
    rows=[]
    for r in groups:
        rates=[]
        for ch in iter_family(r["edir"], "docs", chunksize=120_000):
            if {"tokens","lex_hits"}.issubset(ch.columns):
                ch = ch[["tokens","lex_hits"]].copy()
                ch = ch[ch["tokens"].fillna(0) > 0]
                if ch.empty: continue
                rates.extend((1e4 * ch["lex_hits"].fillna(0) / ch["tokens"]).tolist())
        if rates:
            arr = np.array(rates, dtype=float)
            m = float(np.mean(arr)); se = float(np.std(arr, ddof=1)/max(1,np.sqrt(len(arr))))
            rows.append((f"{r['era_label']} ({r['host']})", m, se))
    if not rows:
        print(f"[lex-bars] no usable docs for {collection}"); return
    rows = sorted(rows, key=lambda x: x[0])
    labels=[r[0] for r in rows]; mean=[r[1] for r in rows]; se=[r[2] for r in rows]
    x = np.arange(len(labels))
    plt.figure(figsize=(max(8, 0.55*len(labels)+6),5))
    plt.bar(x, mean, yerr=se, capsize=3)
    plt.xticks(x, labels, rotation=30, ha="right")
    plt.ylabel("Lexicon hits per 10k tokens (mean ± SE)")
    plt.title(f"{collection}: redaction/assassination lexicon density by group")
    plt.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"lex_rate_bars_{collection}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [27]:
#comentions of focal points
def co_mentions(target="Oswald", collection="JFK", era="2025",
                label_pool=("PERSON","ORG","GPE"), top=20):
    dirs = [r for r in INV if r["collection"]==collection and r["era_label"]==era]
    if not dirs:
        print(f"[co-mentions] no dirs for {collection} {era}"); return
    use_entities = any(r["families"]["entities"] for r in dirs)
    mode = "entities" if use_entities else ("keywords" if any(r["families"]["keywords"] for r in dirs) else None)
    if mode is None:
        print(f"[co-mentions] skip {collection} {era}: no entities/keywords."); return

    target_sidecars=set()
    if mode == "entities":
        for r in dirs:
            if not r["families"]["entities"]: continue
            for ch in iter_family(r["edir"], "entities", usecols=("label","text","sidecar"), chunksize=150_000):
                ch = ch[ch["label"].isin(label_pool)]
                if ch.empty: continue
                txt = ch["text"].astype(str)
                mask = txt.str.contains(target, case=False, na=False)
                if mask.any():
                    target_sidecars.update(ch.loc[mask,"sidecar"].astype(str).unique())
    else:
        for r in dirs:
            if not r["families"]["keywords"]: continue
            for ch in iter_family(r["edir"], "keywords", usecols=("keyword","sidecar"), chunksize=150_000):
                kw = ch["keyword"].astype(str)
                mask = kw.str.contains(target, case=False, na=False)
                if mask.any():
                    target_sidecars.update(ch.loc[mask,"sidecar"].astype(str).unique())

    if not target_sidecars:
        print(f"[co-mentions] none found for '{target}' — {collection} {era}"); return

    counts = Counter()
    if mode == "entities":
        for r in dirs:
            if not r["families"]["entities"]: continue
            for ch in iter_family(r["edir"], "entities", usecols=("label","text","sidecar"), chunksize=150_000):
                ch = ch[ch["label"].isin(label_pool)]
                if ch.empty: continue
                ch = ch[ch["sidecar"].astype(str).isin(target_sidecars)]
                if ch.empty: continue
                names = ch["text"].astype(str).map(normalize_entity)
                names = names[~names.str.contains(target, case=False, na=False)]
                names = names[names.str.len() > 1]
                counts.update(names.value_counts().to_dict())
    else:
        for r in dirs:
            if not r["families"]["keywords"]: continue
            for ch in iter_family(r["edir"], "keywords", usecols=("keyword","sidecar"), chunksize=150_000):
                ch = ch[ch["sidecar"].astype(str).isin(target_sidecars)]
                if ch.empty: continue
                kw = ch["keyword"].astype(str).str.strip()
                kw = kw[~kw.str.contains(target, case=False, na=False)]
                kw = kw[kw.str.len() > 1]
                counts.update(kw.value_counts().to_dict())

    if not counts:
        print(f"[co-mentions] none found for '{target}' — {collection} {era}"); return

    top_items = counts.most_common(top)
    labels = [n for n,_ in top_items][::-1]; values = [v for _,v in top_items][::-1]
    plt.figure(figsize=(10,6)); plt.barh(labels, values)
    plt.title(f'Co-mentions with "{target}" — {collection} {era} ({mode})')
    plt.tight_layout()
    out = os.path.join(FIGS_COMPARE, f"co_mentions_{collection}_{era}_{target.replace(' ','_')}_{mode}.png")
    plt.savefig(out); plt.close(); print("wrote", out)

In [28]:
print("[1/9] Word-cloud grids")
make_wordcloud_grids()

print("[2/9] Year overlays")
plot_year_overlays("JFK"); plot_year_overlays("RFK")

print("[3/9] Host contrasts (2025)")
for col in ["JFK","RFK"]:
    for lab in ["PERSON","ORG","GPE"]:
        host_contrast(collection=col, era="2025", label=lab, k=15)

print("[4/9] PMI pair drift (NARA 2017–2023 -> 2025)")
plot_pmi_drift("JFK", top=20); plot_pmi_drift("RFK", top=20)

print("[5/9] Log-odds keyness (era)")
plot_logodds_era("JFK"); plot_logodds_era("RFK")

print("[6/9] Log-odds keyness (host, 2025)")
plot_logodds_host("JFK"); plot_logodds_host("RFK")

print("[7/9] Overlap heatmaps")
plot_overlap_heatmap("JFK"); plot_overlap_heatmap("RFK")

print("[8/9] Lexicon density-bars & scatter")
plot_lexicon_rate_bars("JFK"); plot_lexicon_rate_bars("RFK")
plot_lexicon_sentiment_scatter("JFK"); plot_lexicon_sentiment_scatter("RFK")

print("[9/9] New in 2025 (NARA)")
plot_new_in_2025("JFK", top=20); plot_new_in_2025("RFK", top=20)

#some focal comentions as well
targets = ["Oswald","Sirhan","Mossad","KGB","CIA","FBI","Kennedy","Cuba"]
for col in ["JFK","RFK"]:
    for t in targets:
        co_mentions(target=t, collection=col, era="2025", top=20)

print("All figures written to:", FIGS_COMPARE)


[1/9] Word-cloud grids
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/wordclouds_JFK.png
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/wordclouds_RFK.png
[2/9] Year overlays


  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)


wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/year_overlay_JFK.png


  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["date"], errors="coerce").dt.year.dropna().astype(int)
  yrs = pd.to_datetime(ch["d

wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/year_overlay_RFK.png
[3/9] Host contrasts (2025)
[host] skip JFK 2025: need Agency and NARA.
[host] skip JFK 2025: need Agency and NARA.
[host] skip JFK 2025: need Agency and NARA.
[host] skip RFK 2025: need Agency and NARA.
[host] skip RFK 2025: need Agency and NARA.
[host] skip RFK 2025: need Agency and NARA.
[4/9] PMI pair drift (NARA 2017–2023 -> 2025)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ch["text_norm"] = ch["text"].astype(str).map(normalize_entity)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ch["text_norm"] = ch["text"].astype(str).map(normalize_entity)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ch["text_norm"] = ch["text"].astype(str).map(normalize_entity)
A value is tryin

wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/pmi_drift_JFK_entities.png
[PMI] skip RFK: need NARA 2017–2023 & 2025.
[5/9] Log-odds keyness (era)
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/logodds_era_JFK_entities.png
[logodds era] skip RFK: need NARA 2017–2023 and 2025.
[6/9] Log-odds keyness (host, 2025)
[logodds host] skip JFK: need 2025 Agency and NARA.
[logodds host] skip RFK: need 2025 Agency and NARA.
[7/9] Overlap heatmaps
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/overlap_heatmap_JFK_entities.png
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/overlap_heatmap_RFK_entities.png
[8/9] Lexicon density-bars & scatter
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/lex_rate_bars_JFK.png
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/lex_rate_bars_RFK.png
wrote /content/drive/MyDrive/JFK_RFK_Project/Structured/figs_compare/scatter_lex_vs_sent_JFK.