### Documentation for the First Cell

The first cell in this Jupyter Notebook performs the following tasks:

1. **Imports Required Modules**:
    - `Path` from `pathlib` for handling file paths.
    - `re` for working with regular expressions.
    - `defaultdict` from `collections` for grouping data.
    - `pandas` as `pd` for data manipulation.

2. **Configures Paths**:
    - Defines `BASE_DIR` as the root directory for input files.
    - Defines `OUTPUT_ROOT` as the directory where results will be stored, mirroring the structure of `BASE_DIR`.

3. **Mirrors Folder Structure**:
    - Creates the `OUTPUT_ROOT` directory if it doesn't exist.
    - Recursively mirrors the folder structure of `BASE_DIR` into `OUTPUT_ROOT`.

4. **Groups JSON Files by Issues**:
    - Uses a regular expression to match filenames in the format `<issue>_<page>.json`.
    - Groups files by their issue ID and relative directory using a `defaultdict`.

5. **Sorts and Processes Files**:
    - Defines a helper function `page_num` to extract page numbers from filenames.
    - Sorts files within each group by page number.
    - Constructs a list of records containing metadata for each issue, including:
        - `relative_dir`: The relative directory of the files.
        - `issue_id`: The unique identifier for the issue.
        - `output_path`: The path where the consolidated JSON will be saved.
        - `num_pages`: The number of pages in the issue.
        - `source_files`: A list of source file paths.

6. **Creates a DataFrame**:
    - Converts the list of records into a pandas DataFrame `df`.
    - Sorts the DataFrame by `relative_dir` and `issue_id`.

7. **Displays or Prints the DataFrame**:
    - Includes commented-out code to display or print the DataFrame and a summary of the results.
    - Provides a summary of the number of consolidated issues and any skipped files that did not match the expected filename pattern.


Clean the data of the given JSON files in each issue by removing unnecessary characters.

In [15]:
# --- Clean each page JSON and save as "*-cleaned.json" ------------------------
from pathlib import Path
import json
import re

# 0) Configure paths (reuse your BASE_DIR; fallback to your full path if missing)
try:
    BASE_DIR
except NameError:
    BASE_DIR = Path(
        "/Users/stepanyan/Library/CloudStorage/OneDrive-UniversityCollegeLondon/"
        "Knowledge Exchange and Innovation - UCL - Wikimedia - Shared working space/"
        "NER/Newswire - Jupyter/Pi-OCR-Articles"
    )
BASE_DIR = Path(BASE_DIR)
OUTPUT_ROOT = BASE_DIR.parent / "Pi-OCR-Articles-Cleaned"  # <- change to BASE_DIR if you want to save alongside originals

# Make sure output root and mirror subfolders exist
OUTPUT_ROOT.mkdir(exist_ok=True)
for d in [BASE_DIR] + [p for p in BASE_DIR.rglob("*") if p.is_dir()]:
    (OUTPUT_ROOT / d.relative_to(BASE_DIR)).mkdir(parents=True, exist_ok=True)

# --- Text cleaning helpers ----------------------------------------------------
def normalize_quotes(s: str) -> str:
    # Curly quotes → straight quotes
    return (
        s.replace("\u2018", "'")
         .replace("\u2019", "'")
         .replace("\u201C", '"')
         .replace("\u201D", '"')
    )

# Collapse dotted abbreviations like U.C.L., U.S.A., incl. spaced variants U. S. A. → UCL/USA
_abbr_with_space = re.compile(r'\b(?:[A-Za-z]\s*\.){1,}[A-Za-z]\s*\.?')
_abbr_no_space   = re.compile(r'\b(?:[A-Za-z]\.){1,}[A-Za-z]\.?')
def collapse_dotted_abbreviations(s: str) -> str:
    def repl(m):  # keep letters only
        return ''.join(ch for ch in m.group(0) if ch.isalpha())
    s = _abbr_with_space.sub(repl, s)
    s = _abbr_no_space.sub(repl, s)
    return s

def clean_string(s: str) -> str:
    s = normalize_quotes(s)
    s = collapse_dotted_abbreviations(s)
    return s

def clean_json(obj):
    """Recursively clean ALL string fields while preserving structure."""
    if isinstance(obj, str):
        return clean_string(obj)
    if isinstance(obj, list):
        return [clean_json(v) for v in obj]
    if isinstance(obj, dict):
        return {k: clean_json(v) for k, v in obj.items()}
    return obj

# --- Process files ------------------------------------------------------------
processed, skipped, errors = 0, 0, []

for f in BASE_DIR.rglob("*.json"):
    # Skip any files that look already cleaned
    if f.name.endswith("-cleaned.json"):
        continue
    rel = f.relative_to(BASE_DIR)
    out_dir = OUTPUT_ROOT / rel.parent
    out_path = out_dir / (f.stem + "-cleaned.json")
    try:
        with open(f, "r", encoding="utf-8") as fh:
            data = json.load(fh)
    except Exception as e:
        errors.append((str(f), f"read: {e}"))
        continue

    try:
        cleaned = clean_json(data)
    except Exception as e:
        errors.append((str(f), f"clean: {e}"))
        continue

    try:
        with open(out_path, "w", encoding="utf-8") as oh:
            json.dump(cleaned, oh, ensure_ascii=False, indent=2)
        processed += 1
    except Exception as e:
        errors.append((str(out_path), f"write: {e}"))

print(f"Done. Processed: {processed} file(s). Errors: {len(errors)}.")
if errors:
    for path, msg in errors[:10]:  # show up to first 10 errors
        print(f"- {path} -> {msg}")
    if len(errors) > 10:
        print(f"... and {len(errors)-10} more")
print(f"Cleaned files are under: {OUTPUT_ROOT}")


Done. Processed: 373 file(s). Errors: 0.
Cleaned files are under: /Users/stepanyan/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Knowledge Exchange and Innovation - UCL - Wikimedia - Shared working space/NER/Newswire - Jupyter/Pi-OCR-Articles-Cleaned


### Combining individual Pages of an issue into a single JSON

The following code combines individual pages of a single issue, which were cleaned above into a single JSON file.

In [3]:
# Combine multi-page OCR JSONs into one file per issue.
# Jupyter-friendly: hard-coded paths, no CLI args.

import os, re, json
from json import JSONDecodeError
from collections import defaultdict
from typing import List, Tuple, Union, Any

# ---- EDIT THESE TWO LINES IF YOUR PATHS CHANGE ----
INPUT_ROOT  = "/Users/stepanyan/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Knowledge Exchange and Innovation - UCL - Wikimedia - Shared working space/NER/Newswire - Jupyter/Pi-OCR-Articles-Cleaned"
OUTPUT_ROOT = "/Users/stepanyan/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Knowledge Exchange and Innovation - UCL - Wikimedia - Shared working space/NER/Newswire - Jupyter/Pi-OCR-Articles-Cleaned-Combined"
# ---------------------------------------------------

OVERWRITE = False      # set True to overwrite existing combined files
DRY_RUN   = False      # set True to preview without writing

PAGE_FILE_RE = re.compile(r"^(?P<base>.+?)_(?P<page>\d{3})-cleaned\.json$")

def assert_paths():
    if not os.path.isdir(INPUT_ROOT):
        raise FileNotFoundError(f"INPUT_ROOT not found: {INPUT_ROOT}")
    # Prevent accidental recursion if OUTPUT_ROOT is inside INPUT_ROOT
    in_abs  = os.path.abspath(INPUT_ROOT)
    out_abs = os.path.abspath(OUTPUT_ROOT)
    if out_abs.startswith(in_abs + os.sep):
        raise ValueError("OUTPUT_ROOT must not be inside INPUT_ROOT. Choose a sibling or separate folder.")
    os.makedirs(OUTPUT_ROOT, exist_ok=True)

def mirror_dir_structure(in_root: str, out_root: str, dry_run: bool = False):
    for dirpath, dirnames, _ in os.walk(in_root):
        rel = os.path.relpath(dirpath, start=in_root)
        target_dir = os.path.join(out_root, rel) if rel != os.curdir else out_root
        if not dry_run:
            os.makedirs(target_dir, exist_ok=True)

def find_issue_groups(root: str):
    """
    Return mapping: (dirpath, base) -> list of (page_num, abs_path, dirpath, filename, base)
    Only matches files like <base>_NNN-cleaned.json
    """
    groups = defaultdict(list)
    for dirpath, _, files in os.walk(root):
        for fn in files:
            if not fn.endswith("-cleaned.json"):
                continue
            m = PAGE_FILE_RE.match(fn)
            if not m:
                # Skip already-combined files like <base>-cleaned.json (no _NNN)
                continue
            base = m.group("base")
            page = int(m.group("page"))
            abs_path = os.path.join(dirpath, fn)
            groups[(dirpath, base)].append((page, abs_path, dirpath, fn, base))
    # Sort by page number
    for k in groups:
        groups[k].sort(key=lambda x: x[0])
    return groups

def load_json_or_text(path: str):
    with open(path, "r", encoding="utf-8") as fh:
        try:
            return json.load(fh), "json"
        except JSONDecodeError:
            fh.seek(0)
            return fh.read(), "text"

def _extract_articles(data: Any) -> list:
    """
    Normalize a single page payload into a list of articles:
      - {"articles": [...]} -> [...]
      - [...]                -> [...]
      - {...}                -> [ {...} ]
      - other                -> []
    """
    if isinstance(data, dict):
        if "articles" in data:
            arts = data["articles"]
            return arts if isinstance(arts, list) else [arts]
        return [data]
    if isinstance(data, list):
        return data
    return []

def combine_pages(paged_files: List[Tuple[int, str]]) -> Tuple[Union[dict, str], str]:
    """
    Returns (combined, mode)
      - mode == 'json'  -> {"articles": [...]} where each article has 'id' and 'page'
      - mode == 'text'  -> newline-delimited fallback when any page is non-JSON
    """
    parsed = []
    all_json = True
    for page, path in paged_files:
        data, kind = load_json_or_text(path)
        parsed.append((page, data, kind))
        if kind != "json":
            all_json = False

    if not all_json:
        # Fallback: newline-delimited serialization (no normalization possible)
        lines = []
        for page, data, kind in parsed:
            if kind == "json":
                lines.append(json.dumps(data, ensure_ascii=False))
            else:
                lines.append(str(data))
        return "\n".join(lines), "text"

    # All JSON: normalize, annotate with page, then assign sequential ids
    combined_articles = []
    for page, data, _ in parsed:
        for art in _extract_articles(data):
            # Ensure each article is a dict; if not, preserve raw under "_raw"
            if not isinstance(art, dict):
                art = {"_raw": art}
            # Work on a shallow copy to avoid mutating inputs
            art = dict(art)
            # Always set/overwrite page to the source page number
            art["page"] = page
            combined_articles.append(art)

    # Assign unique sequential IDs starting from 1 across the issue
    for i, art in enumerate(combined_articles, start=1):
        art["id"] = i

    return {"articles": combined_articles}, "json"

def ensure_out_dir(in_root: str, dirpath: str, out_root: str) -> str:
    rel = os.path.relpath(dirpath, start=in_root)
    target_dir = os.path.join(out_root, rel) if rel != os.curdir else out_root
    os.makedirs(target_dir, exist_ok=True)
    return target_dir

# ---- run ----
print(f"[INPUT ROOT]  {INPUT_ROOT}")
print(f"[OUTPUT ROOT] {OUTPUT_ROOT}")

assert_paths()
mirror_dir_structure(INPUT_ROOT, OUTPUT_ROOT, dry_run=DRY_RUN)

groups = find_issue_groups(INPUT_ROOT)
if not groups:
    print("No page groups found under INPUT_ROOT.")
else:
    total_written = 0
    for (dirpath, base), entries in groups.items():
        pages = [e[0] for e in entries]
        file_paths = [e[1] for e in entries]
        out_dir = ensure_out_dir(INPUT_ROOT, dirpath, OUTPUT_ROOT)
        out_filename = f"{base}-cleaned.json"  # remove the _NNN segment
        out_path = os.path.join(out_dir, out_filename)

        if os.path.exists(out_path) and not OVERWRITE:
            print(f"[SKIP] exists: {os.path.relpath(out_path, start=OUTPUT_ROOT)}")
            continue

        # Pass (page, path) pairs so we can annotate 'page' during normalization
        paged_files = list(zip(pages, file_paths))
        combined, mode = combine_pages(paged_files)

        rel_out = os.path.relpath(out_path, start=OUTPUT_ROOT)
        print(f"[WRITE] {rel_out}  pages={pages}  mode={mode}  articles={len(combined['articles']) if mode=='json' else 'N/A'}")

        if not DRY_RUN:
            if mode == "json":
                with open(out_path, "w", encoding="utf-8") as fh:
                    json.dump(combined, fh, ensure_ascii=False, indent=2)
                    fh.write("\n")
            else:
                with open(out_path, "w", encoding="utf-8") as fh:
                    fh.write(combined)
        total_written += 1

    print(f"Done. {'Would write' if DRY_RUN else 'Wrote'} {total_written} file(s).")


[INPUT ROOT]  /Users/stepanyan/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Knowledge Exchange and Innovation - UCL - Wikimedia - Shared working space/NER/Newswire - Jupyter/Pi-OCR-Articles-Cleaned
[OUTPUT ROOT] /Users/stepanyan/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Knowledge Exchange and Innovation - UCL - Wikimedia - Shared working space/NER/Newswire - Jupyter/Pi-OCR-Articles-Cleaned-Combined
[WRITE] 1951/pi_vol_8_2_1951-cleaned.json  pages=[1, 2, 3, 4]  mode=json  articles=41
[WRITE] 1951/pi_vol_7_5_1951-cleaned.json  pages=[1, 2, 3, 4]  mode=json  articles=32
[WRITE] 1951/pi_vol_7_7_1951-cleaned.json  pages=[1, 2, 3, 4]  mode=json  articles=32
[WRITE] 1951/pi_vol_9_4_1951-cleaned.json  pages=[1, 2, 3, 4]  mode=json  articles=30
[WRITE] 1951/pi_vol_9_3_1951-cleaned.json  pages=[1, 2, 3, 4]  mode=json  articles=36
[WRITE] 1951/pi_vol_7_8_1951-cleaned.json  pages=[1, 2, 3, 4]  mode=json  articles=24
[WRITE] 1951/pi_vol_7_9_1951-cleaned.json  pages=[1, 2, 3, 4]