## Config + Paths

## Data access (for class/demo)

- **Sample WET file location:** the notebook looks for `test_compression.warc.wet` in:
  - `/mnt/data/test_compression.warc.wet` (recommended for Colab uploads)
  - `data/test_compression.warc.wet`
  - `test_compression.warc.wet`

- **Generated artifacts (written to `./data/`):**
  - `wet_raw_extracted.jsonl` (raw extracted conversion text)
  - `wet_cleaned_filtered.jsonl` (cleaned + English + quality filtered)
  - `wet_deduped.jsonl` (exact deduplicated)
  - `wet_report.json` (summary stats)


In [1]:
from pathlib import Path

# Input WET
# Place your sample WET file in ONE of these locations:
#   1) /mnt/data/test_compression.warc.wet   (Colab / mounted uploads)
#   2) data/test_compression.warc.wet       (repo-relative)
#   3) test_compression.warc.wet            (repo root)

CANDIDATE_WET_PATHS = [
    Path("data-v2/test_compression.warc.wet"),
    # Path("data/test_compression.warc.wet"),
    # Path("test_compression.warc.wet"),
]

WET_PATH = next((p for p in CANDIDATE_WET_PATHS if p.exists()), None)
assert WET_PATH is not None, f"Missing WET file. Tried: {CANDIDATE_WET_PATHS}"
print("✅ Using WET file:", WET_PATH)

# Output directory
OUT_DIR = Path("data-v2")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Output files
RAW_EXTRACT_JSONL = OUT_DIR / "wet_raw_extracted.jsonl"
CLEANED_JSONL     = OUT_DIR / "wet_cleaned_filtered.jsonl"
DEDUPED_JSONL      = OUT_DIR / "wet_deduped.jsonl"
REPORT_JSON        = OUT_DIR / "wet_report.json"

# Thresholds
MIN_CHARS = 300
MAX_NONASCII_RATIO = 0.25
MAX_DIGIT_RATIO    = 0.30
MAX_PUNCT_RATIO    = 0.35
MIN_STOPWORD_RATIO = 0.05
MAX_REPEAT_LINE_RATIO = 0.30

RANDOM_SEED = 42

✅ Using WET file: data-v2/test_compression.warc.wet


## Install + imports

In [2]:
!pip -q install warcio tqdm langid nltk numpy scikit-learn datasketch

In [3]:
import json, re, hashlib, random
from collections import Counter, defaultdict
from urllib.parse import urlparse

import numpy as np
from tqdm import tqdm
from warcio.archiveiterator import ArchiveIterator
import langid

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

random.seed(RANDOM_SEED)

## Download NLTK resources

In [4]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ostwalaman/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ostwalaman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## WET sanity scan (first N records)

In [5]:
WS_RE = re.compile(r"\s+")

def normalize_ws(text: str) -> str:
    return WS_RE.sub(" ", text).strip()

def preview_wet(wet_path, max_records=60, max_preview_chars=220):
    type_counts = Counter()
    samples = []

    seen = 0
    with open(wet_path, "rb") as stream:
        for rec in ArchiveIterator(stream):
            seen += 1
            type_counts[rec.rec_type] += 1

            if rec.rec_type == "conversion":
                url = rec.rec_headers.get_header("WARC-Target-URI") or ""
                raw = rec.content_stream().read()
                text = raw.decode("utf-8", errors="ignore")
                prev = normalize_ws(text[:max_preview_chars])
                samples.append({"url": url, "chars": len(text), "preview": prev})

            if seen >= max_records:
                break

    print(f"✅ Scanned first {seen} records")
    print("Record types:", dict(type_counts))
    print("\n--- conversion samples (up to 5) ---")
    for i, s in enumerate(samples[:5], 1):
        print(f"\n[{i}] chars={s['chars']}")
        print("URL:", s["url"])
        print("Preview:", s["preview"])

preview_wet(WET_PATH)

✅ Scanned first 60 records
Record types: {'warcinfo': 1, 'conversion': 59}

--- conversion samples (up to 5) ---

[1] chars=16478
URL: http://000af36.netsolhost.com/wordpress1/2004/09/page/2/
Preview: September | 2004 | Bob Griendling | Page 2 Menu About Home Experience Op-eds Bio Blog Top Monthly Archives: September 2004 Fantasyland Date: September 27, 2004 Author: Bob Griendling Categories: Uncategorized Funny thing

[2] chars=2087
URL: http://0400425.netsolhost.com/beiseker/calendar-2/action~month/exact_date~1672556400/request_format~html/
Preview: Calendar | Village of Beiseker | Page 0400425.netsolhost.com|beiseker|calendar-2|action~month|exact_date~1672556400|request_format~html| Village of Beiseker Crossroads to the Future Search Main menu Skip to primary conte

[3] chars=4542
URL: http://055-237-0928.com/css/0pgxv9khyq5k0ar63xv97/index.html
Preview: 춘천출장만남 최신뉴스▶ 출장샵,출장마사지,출장안마 [새책]종화동안마,익산여대생출장 [새책]비천동안마,서랑동안마 대덕소개팅,웅진동안마 '지하철에서 출장30대소개팅 위험.jpg,성북출장만남 출장대행 콜걸샾 오피콜걸 여대생' 창원성인출장마

## Extract conversion records → wet_raw_extracted.jsonl

In [6]:
def host_from_url(url: str) -> str:
    try:
        return urlparse(url).netloc.lower()
    except Exception:
        return ""

def extract_wet_to_jsonl(wet_path: Path, out_jsonl: Path, min_chars: int = 300):
    total = 0
    conv = 0
    kept = 0
    dropped_short = 0

    with open(wet_path, "rb") as stream, open(out_jsonl, "w", encoding="utf-8") as out:
        for rec in tqdm(ArchiveIterator(stream), desc="Extracting WET"):
            total += 1
            if rec.rec_type != "conversion":
                continue

            conv += 1
            url = rec.rec_headers.get_header("WARC-Target-URI") or ""
            raw = rec.content_stream().read()
            text = raw.decode("utf-8", errors="ignore")
            text = normalize_ws(text)

            if len(text) < min_chars:
                dropped_short += 1
                continue

            obj = {"source": "wet", "url": url, "host": host_from_url(url), "content": text}
            out.write(json.dumps(obj, ensure_ascii=False) + "\n")
            kept += 1

    print("\n✅ Extraction complete")
    print("Total records      :", total)
    print("Conversion records :", conv)
    print("Kept docs          :", kept)
    print("Dropped too short  :", dropped_short)
    print("Saved to           :", out_jsonl)

extract_wet_to_jsonl(WET_PATH, RAW_EXTRACT_JSONL, min_chars=MIN_CHARS)

Extracting WET: 34318it [00:09, 3795.36it/s]


✅ Extraction complete
Total records      : 34318
Conversion records : 34317
Kept docs          : 32879
Dropped too short  : 1438
Saved to           : data-v2/wet_raw_extracted.jsonl





## JSONL iterator + basic reporting

In [7]:
def iter_jsonl(path: Path):
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                yield json.loads(line)

def summarize_numeric(values):
    arr = np.array(values, dtype=float) if values else np.array([], dtype=float)
    if arr.size == 0:
        return {"n": 0}
    return {
        "n": int(arr.size),
        "mean": float(arr.mean()),
        "p50": float(np.median(arr)),
        "p90": float(np.quantile(arr, 0.9)),
        "max": float(arr.max())
    }

def analyze_raw(jsonl_path: Path, top_k_hosts=20):
    lengths = []
    hosts = Counter()
    for obj in iter_jsonl(jsonl_path):
        t = obj.get("content", "")
        lengths.append(len(t))
        hosts[obj.get("host", "")] += 1

    total = sum(hosts.values()) if hosts else 0
    return {
        "docs": len(lengths),
        "length_chars": summarize_numeric(lengths),
        "top_hosts": hosts.most_common(top_k_hosts),
        "host_concentration": {
            "top1_share": (hosts.most_common(1)[0][1] / total) if total else 0.0,
            "top5_share": (sum(c for _, c in hosts.most_common(5)) / total) if total else 0.0,
            "top20_share": (sum(c for _, c in hosts.most_common(20)) / total) if total else 0.0,
        }
    }

raw_report = analyze_raw(RAW_EXTRACT_JSONL)
raw_report

{'docs': 32879,
 'length_chars': {'n': 32879,
  'mean': 7480.538246297028,
  'p50': 3807.0,
  'p90': 13459.600000000002,
  'max': 877216.0},
 'top_hosts': [('cdha.cuny.edu', 14),
  ('courseware.zcu.cz', 13),
  ('turbotax.intuit.com', 10),
  ('diecezja.pl', 9),
  ('alcoholpolicy.niaaa.nih.gov', 8),
  ('businessfig.com', 8),
  ('www.besport.com', 8),
  ('www.library.univ.kiev.ua', 7),
  ('yscholarhub.yonsei.ac.kr', 6),
  ('headquarters.s4.xrea.com', 5),
  ('viavca.in2p3.fr', 5),
  ('5ka-sale.ru', 5),
  ('b-port.com', 5),
  ('bryansk.news', 5),
  ('ca.news.yahoo.com', 5),
  ('andpremium.jp', 4),
  ('arquivo.cienciaviva.pt', 4),
  ('art.ceskatelevize.cz', 4),
  ('burbujasweb.com', 4),
  ('cleanindiajournal.com', 4)],
 'host_concentration': {'top1_share': 0.0004258037044922291,
  'top5_share': 0.0016423857173271693,
  'top20_share': 0.004045135192676176}}

## Cleaning (line-based boilerplate removal)

In [8]:
NAV_LINE_RE = re.compile(r"^(menu|home|about|contact|search|skip to|privacy|terms|login|sign in)$", re.I)

def clean_doc_text(text: str):
    raw_lines = [ln.strip() for ln in text.splitlines()]
    lines = [ln for ln in raw_lines if ln]

    if not lines:
        return "", 1.0

    filtered = []
    for ln in lines:
        low = ln.lower().strip()
        if len(low) <= 2:
            continue
        if len(low) <= 25 and NAV_LINE_RE.match(low):
            continue
        filtered.append(ln)

    if not filtered:
        return "", 1.0

    cnt = Counter([ln.lower() for ln in filtered])
    kept = [ln for ln in filtered if cnt[ln.lower()] < 3]

    repeat_ratio = 1.0 - (len(set([ln.lower() for ln in kept])) / max(1, len(kept)))
    cleaned = normalize_ws("\n".join(kept))
    return cleaned, float(repeat_ratio)

## NLTK-based tokenization + stopword ratio + quality metrics

In [9]:
EN_STOPWORDS = set(stopwords.words("english"))

DIGIT_RE = re.compile(r"\d")
PUNCT_RE = re.compile(r"[^\w\s]")

def nltk_tokens(text: str):
    # NLTK word tokenization
    # Note: word_tokenize needs punkt.
    return [t.lower() for t in word_tokenize(text)]

def stopword_ratio_nltk(text: str):
    toks = nltk_tokens(text)
    # Keep alphabetic tokens only for stopword ratio stability
    words = [t for t in toks if t.isalpha()]
    if not words:
        return 0.0
    sw = sum(1 for w in words if w in EN_STOPWORDS)
    return sw / len(words)

def basic_ratios(text: str):
    if not text:
        return {"nonascii": 1.0, "digit": 1.0, "punct": 1.0}
    n = len(text)
    nonascii = sum(1 for c in text if ord(c) > 127) / n
    digit = len(DIGIT_RE.findall(text)) / n
    punct = len(PUNCT_RE.findall(text)) / n
    return {"nonascii": nonascii, "digit": digit, "punct": punct}

def is_english_langid(text: str):
    lang, score = langid.classify(text[:5000])
    return (lang == "en"), {"lang": lang, "lang_score": float(score)}

def quality_pass(text: str, repeat_line_ratio: float):
    if len(text) < MIN_CHARS:
        return False, {"fail": "too_short"}

    r = basic_ratios(text)
    swr = stopword_ratio_nltk(text)

    if r["nonascii"] > MAX_NONASCII_RATIO:
        return False, {"fail":"nonascii", **r, "stopword_ratio": swr, "repeat_line_ratio": repeat_line_ratio}
    if r["digit"] > MAX_DIGIT_RATIO:
        return False, {"fail":"digit", **r, "stopword_ratio": swr, "repeat_line_ratio": repeat_line_ratio}
    if r["punct"] > MAX_PUNCT_RATIO:
        return False, {"fail":"punct", **r, "stopword_ratio": swr, "repeat_line_ratio": repeat_line_ratio}
    if swr < MIN_STOPWORD_RATIO:
        return False, {"fail":"low_stopwords", **r, "stopword_ratio": swr, "repeat_line_ratio": repeat_line_ratio}
    if repeat_line_ratio > MAX_REPEAT_LINE_RATIO:
        return False, {"fail":"too_repetitive", **r, "stopword_ratio": swr, "repeat_line_ratio": repeat_line_ratio}

    return True, {**r, "stopword_ratio": swr, "repeat_line_ratio": repeat_line_ratio}

## Clean + English + quality filter → wet_cleaned_filtered.jsonl

In [10]:
from collections import Counter

def filter_and_clean(in_jsonl: Path, out_jsonl: Path):
    stats = Counter()

    with open(out_jsonl, "w", encoding="utf-8") as out:
        for obj in tqdm(iter_jsonl(in_jsonl), desc="Cleaning+Filtering"):
            content = obj.get("content", "")
            if not content:
                stats["drop_empty"] += 1
                continue

            cleaned, rep = clean_doc_text(content)
            if not cleaned:
                stats["drop_clean_empty"] += 1
                continue

            ok_lang, lang_meta = is_english_langid(cleaned)
            if not ok_lang:
                stats["drop_non_en"] += 1
                continue

            ok_q, q_meta = quality_pass(cleaned, rep)
            if not ok_q:
                stats[f"drop_quality_{q_meta.get('fail','unknown')}"] += 1
                continue

            out_obj = {
                "url": obj.get("url", ""),
                "host": obj.get("host", ""),
                "source": obj.get("source", "wet"),
                "content": cleaned,
                "meta": {"lang": lang_meta, "quality": q_meta}
            }
            out.write(json.dumps(out_obj, ensure_ascii=False) + "\n")
            stats["kept"] += 1

    return stats

filter_stats = filter_and_clean(RAW_EXTRACT_JSONL, CLEANED_JSONL)
filter_stats

Cleaning+Filtering: 32879it [03:08, 173.99it/s]


Counter({'drop_non_en': 19166,
         'kept': 13436,
         'drop_quality_low_stopwords': 223,
         'drop_quality_digit': 51,
         'drop_quality_nonascii': 3})

## Exact dedup → wet_deduped.jsonl

In [11]:
def content_hash(text: str) -> str:
    t = normalize_ws(text)
    return hashlib.md5(t.encode("utf-8", errors="ignore")).hexdigest()

def exact_dedup(in_jsonl: Path, out_jsonl: Path):
    seen = set()
    stats = Counter()

    with open(out_jsonl, "w", encoding="utf-8") as out:
        for obj in tqdm(iter_jsonl(in_jsonl), desc="Exact dedup"):
            t = obj.get("content", "")
            h = content_hash(t)
            if h in seen:
                stats["drop_exact_dup"] += 1
                continue
            seen.add(h)
            out.write(json.dumps(obj, ensure_ascii=False) + "\n")
            stats["kept"] += 1

    stats["unique_hashes"] = len(seen)
    return stats

dedup_stats = exact_dedup(CLEANED_JSONL, DEDUPED_JSONL)
dedup_stats

Exact dedup: 13436it [00:03, 3910.88it/s]


Counter({'kept': 13406, 'unique_hashes': 13406, 'drop_exact_dup': 30})

## Language distribution sanity check

In [12]:
def lang_distribution(jsonl_path: Path, sample_n=20000, seed=42):
    cnt = Counter()
    rows = list(iter_jsonl(jsonl_path))
    random.Random(seed).shuffle(rows)
    rows = rows[:min(sample_n, len(rows))]

    for obj in tqdm(rows, desc="LangID dist"):
        text = obj.get("content", "")
        if not text:
            continue
        lang, score = langid.classify(text[:5000])
        cnt[lang] += 1

    print("Total classified:", sum(cnt.values()))
    print("Top 15 languages:", cnt.most_common(15))
    return cnt

lang_cnt = lang_distribution(RAW_EXTRACT_JSONL, sample_n=20000)

LangID dist: 100%|██████████| 20000/20000 [01:11<00:00, 279.66it/s]


Total classified: 20000
Top 15 languages: [('en', 8266), ('ru', 1281), ('de', 1140), ('es', 1057), ('ja', 1037), ('fr', 992), ('zh', 902), ('it', 624), ('pt', 444), ('nl', 423), ('pl', 409), ('la', 292), ('cs', 253), ('id', 252), ('vi', 244)]


## Save a report JSON

In [13]:
post_report = analyze_raw(DEDUPED_JSONL)

final_report = {
    "input_wet": str(WET_PATH),
    "raw_extract_jsonl": str(RAW_EXTRACT_JSONL),
    "cleaned_jsonl": str(CLEANED_JSONL),
    "deduped_jsonl": str(DEDUPED_JSONL),
    "raw_report": raw_report,
    "filter_stats": dict(filter_stats),
    "exact_dedup_stats": dict(dedup_stats),
    "post_report": post_report,
    "notes": {
        "tokenization": "NLTK word_tokenize + NLTK stopwords used for stopword_ratio",
        "language_id": "langid",
        "dedup": "md5 of whitespace-normalized content"
    }
}

with open(REPORT_JSON, "w", encoding="utf-8") as f:
    json.dump(final_report, f, indent=2, ensure_ascii=False)

print("✅ Saved report:", REPORT_JSON)

✅ Saved report: data-v2/wet_report.json


## Near-duplicate detection (MinHash + LSH)

This section identifies *near-duplicate* documents (e.g., mirrors, syndicated copies, template variants) that are not caught by exact MD5 deduplication.

**Method:**  
- Convert each document into word-shingles (default 5-grams)  
- Build a MinHash signature per document  
- Use LSH to retrieve candidate near-duplicates efficiently  
- Cluster using union-find (transitive closure)

> For class/demo purposes, the code defaults to running on a **sample** of the deduped dataset to keep runtime reasonable.  
> Set `NEAR_DUP_SAMPLE_N = None` to run on all documents (may take longer).


In [14]:

# If datasketch isn't installed, install it:
# !pip -q install datasketch

from datasketch import MinHash, MinHashLSH

# Controls
NEAR_DUP_SAMPLE_N = 3000     # None = use all docs
MINHASH_NUM_PERM  = 128
LSH_THRESHOLD     = 0.90
SHINGLE_SIZE      = 5

WORD_RE = re.compile(r"[A-Za-z]+")

def load_docs(jsonl_path: Path, limit=None, seed=42):
    docs = list(iter_jsonl(jsonl_path))
    if limit is not None and len(docs) > limit:
        rng = random.Random(seed)
        rng.shuffle(docs)
        docs = docs[:limit]
    return docs

def word_shingles(text: str, k=5):
    words = [w.lower() for w in WORD_RE.findall(text)]
    if len(words) < k:
        return []
    return [" ".join(words[i:i+k]) for i in range(len(words)-k+1)]

def build_minhash(shingles, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for s in shingles:
        m.update(s.encode("utf-8", errors="ignore"))
    return m

class UnionFind:
    def __init__(self, n):
        self.p = list(range(n))
        self.r = [0]*n
    def find(self, x):
        while self.p[x] != x:
            self.p[x] = self.p[self.p[x]]
            x = self.p[x]
        return x
    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra == rb: 
            return
        if self.r[ra] < self.r[rb]:
            self.p[ra] = rb
        elif self.r[ra] > self.r[rb]:
            self.p[rb] = ra
        else:
            self.p[rb] = ra
            self.r[ra] += 1

docs_nd = load_docs(DEDUPED_JSONL, limit=NEAR_DUP_SAMPLE_N, seed=RANDOM_SEED)
print("Docs for near-dup:", len(docs_nd))

lsh = MinHashLSH(threshold=LSH_THRESHOLD, num_perm=MINHASH_NUM_PERM)
minhashes = []

for i, d in enumerate(tqdm(docs_nd, desc="MinHash build")):
    sh = word_shingles(d["content"], SHINGLE_SIZE)
    m  = build_minhash(sh, num_perm=MINHASH_NUM_PERM)
    lsh.insert(str(i), m)
    minhashes.append(m)

uf = UnionFind(len(docs_nd))
edges = 0

for i, m in enumerate(tqdm(minhashes, desc="LSH query")):
    hits = lsh.query(m)
    for h in hits:
        j = int(h)
        if j <= i:
            continue
        uf.union(i, j)
        edges += 1

comp = defaultdict(list)
for i in range(len(docs_nd)):
    comp[uf.find(i)].append(i)

clusters = [v for v in comp.values() if len(v) > 1]
clusters.sort(key=len, reverse=True)

print("Near-dup edges:", edges)
print("Clusters (size>1):", len(clusters))
print("Top cluster sizes:", [len(c) for c in clusters[:10]])

# Show a few example clusters
for ci, c in enumerate(clusters[:3], 1):
    print("\n" + "="*90)
    print(f"Cluster {ci} | size={len(c)}")
    for idx in c[:3]:
        d = docs_nd[idx]
        print("-"*90)
        print("URL:", d.get("url",""))
        print(d.get("content","")[:280].replace("\n"," "), "...")


Docs for near-dup: 3000


MinHash build: 100%|██████████| 3000/3000 [00:17<00:00, 169.00it/s]
LSH query: 100%|██████████| 3000/3000 [00:00<00:00, 180775.98it/s]

Near-dup edges: 4
Clusters (size>1): 2
Top cluster sizes: [3, 2]

Cluster 1 | size=3
------------------------------------------------------------------------------------------
URL: https://turbotax.intuit.com/reviews/online/deluxe/?page=10982
TurboTax® Deluxe 2023-2024 - Customer Reviews - Page 10982 Skip To Main Content Only from TurboTax - file 100% FREE with expert help File 100% FREE with expert help ~37% of filers qualify. Form 1040 + limited credits only. Must file by 3/31. Start for free expand navigation optio ...
------------------------------------------------------------------------------------------
URL: https://turbotax.intuit.com/reviews/online/?page=9711
TurboTax® Online 2023-2024 - Customer Reviews - Page 9711 Skip To Main Content Only from TurboTax - file 100% FREE with expert help File 100% FREE with expert help ~37% of filers qualify. Form 1040 + limited credits only. Must file by 3/31. Start for free expand navigation option ...
-------------------------------------




## Topic modeling (TF-IDF + NMF)

This section provides an interpretable view of corpus themes using:
- TF-IDF features (unigrams + bigrams)
- NMF topic model (interpretable topic-word lists)

> For class/demo purposes, it defaults to a **sample** of documents.  
> Set `TOPIC_SAMPLE_N = None` to use all documents.


In [15]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Controls
TOPIC_SAMPLE_N = 5000    # None = use all docs
N_TOPICS       = 8
TOP_TERMS      = 12

docs_tm = load_docs(DEDUPED_JSONL, limit=TOPIC_SAMPLE_N, seed=RANDOM_SEED)
texts = [d["content"] for d in docs_tm]
print("Docs for topic modeling:", len(texts))

if len(texts) < 20:
    print("Not enough documents to run topic modeling reliably.")
else:
    vectorizer = TfidfVectorizer(
        max_features=30000,
        min_df=2,
        max_df=0.9,
        ngram_range=(1,2),
        stop_words="english"
    )
    X = vectorizer.fit_transform(texts)

    nmf = NMF(n_components=N_TOPICS, random_state=RANDOM_SEED)
    W = nmf.fit_transform(X)
    H = nmf.components_
    terms = np.array(vectorizer.get_feature_names_out())

    def top_terms(topic_idx, topn=12):
        idx = np.argsort(H[topic_idx])[::-1][:topn]
        return terms[idx].tolist()

    topic_terms = {f"topic_{i}": top_terms(i, TOP_TERMS) for i in range(N_TOPICS)}
    dominant = W.argmax(axis=1)

    print("\nTopic terms:")
    for k, v in topic_terms.items():
        print(k, ":", v)

    print("\nExample documents per topic (first 2):")
    for t in range(N_TOPICS):
        idxs = np.where(dominant == t)[0][:2]
        if len(idxs) == 0:
            continue
        print("\n" + "="*90)
        print(f"Topic {t}: {topic_terms[f'topic_{t}']}")
        for i in idxs:
            print("-"*90)
            print("URL:", docs_tm[i].get("url",""))
            print(docs_tm[i]["content"][:260].replace("\n"," "), "...")


Docs for topic modeling: 5000

Topic terms:
topic_0 : ['news', 'services', 'business', 'contact', 'home', 'new', '2024', 'health', 'events', 'information', 'data', 'people']
topic_1 : ['october', 'september', 'november', 'december', 'january', 'april', 'february', 'july', 'march', 'june', 'august', '2021']
topic_2 : ['cookies', 'cookie', 'consent', 'website', 'necessary', 'cookie set', 'gdpr cookie', 'set gdpr', 'months cookie', 'user', '11 months', 'gdpr']
topic_3 : ['redirect notice', 'redirect', 'previous page', '3d', 'page', 'notice', 'sending http', 'page sending', 'notice redirect', 'notice previous', 'visit page', 'page return']
topic_4 : ['00', 'cart', 'price', 'shop', 'sale', 'shipping', 'accessories', 'add', 'product', 'products', '99', 'add cart']
topic_5 : ['function', 'var', 'return', 'function var', 'function return', 'null', 'length', 'data', 'css', 'document', 'left', 'window']
topic_6 : ['phpbb', 'forum', 'board', 'search', 'login', 'forums', 'register', 'password', 't

## NLTK vs Regex demo (for class discussion)

This cell demonstrates how **NLTK tokenization** differs from a simple **regex tokenization** approach on real web text.


In [16]:

# Pick one random English document from the deduped set
docs_demo = load_docs(DEDUPED_JSONL, limit=200, seed=RANDOM_SEED)
sample_doc = docs_demo[0]["content"]

# Regex tokenization (simple)
regex_tokens = re.findall(r"[A-Za-z]+", sample_doc.lower())[:60]

# NLTK tokenization (word_tokenize)
nltk_tok = [t.lower() for t in word_tokenize(sample_doc)][:60]

print("REGEX TOKENS (first 60):")
print(regex_tokens)

print("\nNLTK TOKENS (first 60):")
print(nltk_tok)

print("\nNotes:")
print("- NLTK keeps punctuation as separate tokens; regex strips it.")
print("- NLTK handles contractions and punctuation boundaries more explicitly.")


REGEX TOKENS (first 60):
['articles', 'by', 'author', 'about', 'pensoftbooksjournalsnews', 'blogcontact', 'register', 'login', 'full', 'text', 'author', 'title', 'submit', 'manuscript', 'about', 'articles', 'issues', 'topical', 'collections', 'author', 'guidelines', 'editorial', 'team', 'contacts', 'author', 'valentina', 'l', 'a', 'laface', 'article', 'by', 'this', 'author', 'sort', 'by', 'publication', 'date', 'newest', 'publication', 'date', 'oldest', 'total', 'views', 'unique', 'views', 'best', 'match', 'citations', 'counthighly', 'accessed', 'last', 'month', 'highly', 'accessed', 'last', 'months', 'highly', 'accessed', 'last', 'months']

NLTK TOKENS (first 60):
['articles', 'by', 'author', 'about', 'pensoftbooksjournalsnews', '&', 'blogcontact', 'register', '|', 'login', 'full', 'text', 'author', 'title', 'submit', 'manuscript', 'about', 'articles', 'issues', 'topical', 'collections', 'author', 'guidelines', 'editorial', 'team', 'contacts', 'author', 'valentina', 'l.a.', 'laface', 